While profiling I noticed that Streamline differentiates between 128-bit memory access and <128-bit memory access. For example loading a highp 4-vec vs. loading a mediump 4-vec. Is there any difference in latency between a short and a full-memory access (assuming they both hit the L1 cache) ?
Another thing that I observed while using the offline compiler is the treatment of uniform variables (especially arrays) in shader code. It seems that there is a portion of the register file that can/will be used for uniform variables and thus will not incur any L/S operation on using them for computation. When a certain amount of uniform data is exceeded it will be fetched from memory. According to the documentation, the Mali shader cores do have some "constant memory" but I could not find out about the latency (as compared to L1 cache for example) nor could I find information about the size of constant memory or what the tripping point between storing uniform data in registers vs. storing them in constant memory is.
Are there any specifications on that topic ?