While profiling I noticed that Streamline differentiates between 128-bit memory access and <128-bit memory access. For example loading a highp 4-vec vs. loading a mediump 4-vec. Is there any difference in latency between a short and a full-memory access (assuming they both hit the L1 cache) ?
Another thing that I observed while using the offline compiler is the treatment of uniform variables (especially arrays) in shader code. It seems that there is a portion of the register file that can/will be used for uniform variables and thus will not incur any L/S operation on using them for computation. When a certain amount of uniform data is exceeded it will be fetched from memory. According to the documentation, the Mali shader cores do have some "constant memory" but I could not find out about the latency (as compared to L1 cache for example) nor could I find information about the size of constant memory or what the tripping point between storing uniform data in registers vs. storing them in constant memory is.
Are there any specifications on that topic ?
All LSC access take a single cycle (assuming a cache hit); the short accesses simply indicate cases where the shader program is not using the full available data path width. These counters are mostly provided to help applications optimize the memory accesses in compute shaders, as it helps identify where memory access is not vectorized sufficiently. For example, changing the memory access vectorization in a compute kernel to make 4 full 128-bit accesses per thread rather than 8 half-width 64-bit accesses.
> According to the documentation, the Mali shader cores do have some "constant memory"
As noted by the offline compiler performance reports most* uniform accesses can be mapped to a constant register file; there isn't any separate constant memory other than that register file. The uniform register space available is relatively small; our best practices document here ** recommends keeping total size of uniforms and Vulkan push constants in a draw call around 128 bytes .
* Most = must be directly accessed, or accessed via a constant or uniform integer array subscript.
** https://static.docs.arm.com/100019/0100/arm_mali_application_developer_best_practices_developer_guide_100019_0100_00_en2.pdf
Regards, Pete