This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How best to maximize cache-write utilization for gpu-compute?

What are some best practices for preventing data from being written out to RAM when structuring a compute job on the GPU that requires a small amount of data? For example, if I wanted to do 10M read/write operations on a contiguous 1024B array and finally output, say, 1024B, would this be automatically cached or are there things that should be done to make caching more likely?

  • All data read by a shader core will get cached automatically. Some of the basics for Mali:

    • As always try to minimize data sizes; 16-bit types are faster than 32-bit types to process, and use half the storage in RAM and registers, so a huge boost if you can design your algorithm to exploit this.
    • Mali doesn't have any dedicated local memory region; we just use system RAM backed by normal cache policies. From a performance point of view local memory is therefore identical to global memory, so don't waste cycles copying data from global to local - it won't help and just adds more cache pressure.
    • The usual caching 101 principles of spatial and temporal locality apply - try to keep accesses to the same cache lines close together in time to minimize cache pressure and chances of having to re-fetch the same data from memory. This can be controlled by algorithm design, and by the shape of your kernel work groups.
    • Design your data accesses to use vector load/store operations, they make much more cache friendly accesses. In 2D datasets, such as matrices in SGEMM, this may mean you need to tranpose one or more of the input data sets to allow this (the alternative to not transposing is scalar loads from each row, loading and using one entry from each cache line, and in sufficiently large input datasets probably thrashing the MMU TLB as well as the cache).
    • The L1 texture cache is a different resource to the L1 LS cache; use texturing operations rather than array accesses for constant read-only texture resources. (This can have some other performance side-effects, so worth testing both options to see which one works best for your particuar algorithm).
    • Try to avoid writeable data, including atomics, which is heavily accessed by multiple work groups which span shader cores. You lose internal bandwidth within the shader core handlin the parallel writes to the same cache line from multiple shader cores. If you need to have some form of "global but local to a single core" data structure to minimize this we have some published extensions for OpenCL which can provide the core_id to the kernel code to allow dynamic array indexing: https://www.khronos.org/registry/cl/extensions/arm/cl_arm_get_core_id.txt
    • When writing output data try to write entire 64-byte cache lines from a workgroup without gaps or holes; partial writes are more expensive than full writes. Partial = either shorter than 64-bytes (bad for AXI utilization) or with unwritten regions which must be masked out to avoid corrupting the unwritten parts (bad for DDR utilization in LPDDR4 onwards).
    • Use the offline Mali shader compiler to review your shaders to see if they are spilling to stack; stack spills are a very quick way to burn bandwidth which isn't obviously visible just looking at the shader source code.

    HTH,
    Pete

  • Thanks, Pete!

    This was amazingly helpful, and extremely informative! Stack spilling is something that I hadn't considered but I can appreciate how expensive such operations would be, making a very good case for using a profiler. I also appreciated (but hadn't considered) the recommendation of using 16b types for a more efficient use of the register file!

    I have an unrelated question regarding atomics. While I understand that atomics have the potential to be non-cache friendly, are they GPU-compute or GL/VK friendly operations? Are stalls caused by mutual exclusion locking of atomic data easily hidden by other jobs or is some level of higher-than-normal cycle-latency guaranteed when using them?

    This answer was so good, it almost deserves its own post! Is there an official PDF resource that has more detail about memory optimization?

    Sean

  • While I understand that atomics have the potential to be non-cache friendly, are they GPU-compute or GL/VK friendly operations? Are stalls caused by mutual exclusion locking of atomic data easily hidden by other jobs or is some level of higher-than-normal cycle-latency guaranteed when using them?

    Within a single shader core the atomics are exceptionally fast (single cycle back-to-back access to the same atomic is entirely possible). The only point where we will start slowing down is due to the cache line locking (only one core can own an atomic cache line at any point in time, so if you get high frequency of access to that atomic from multiple work groups then contention across cores will rapidly erode that single cycle throughput), hence the OpenCL extension to allow an array of atomics to be specified (one per-core, space one cache line apart) and indexed from shader programs. If there is interest in a GL version of that extension it would be possible, but not available today.

    Pete

  • Peter Harris wrote:

    Within a single shader core the atomics are exceptionally fast (single cycle back-to-back access to the same atomic is entirely possible).

    This is wonderful! And I suppose it make intuitive sense in hindsight: limiting a single compute-core to data ensures that only one access on the data can happen at a time which likely plays a large role in the low-latency access. The CL extension is tremendously useful for tasks that require atomic access.

    I'm looking forward to exploring how atomic operations can be efficiently utilized in both Vulkan fragment and compute shaders. Perhaps a single compute-shader instance can be designed to contain a work-loop (rather than each loop being split across a workgroup) with exclusive access to an atomic block. While the instance would certainly have a longer run-time than an equivalent workgroup, there is the potential for it to be effectively threaded with other instances...

    Sean