This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How best to maximize cache-write utilization for gpu-compute?

What are some best practices for preventing data from being written out to RAM when structuring a compute job on the GPU that requires a small amount of data? For example, if I wanted to do 10M read/write operations on a contiguous 1024B array and finally output, say, 1024B, would this be automatically cached or are there things that should be done to make caching more likely?

Parents
  • All data read by a shader core will get cached automatically. Some of the basics for Mali:

    • As always try to minimize data sizes; 16-bit types are faster than 32-bit types to process, and use half the storage in RAM and registers, so a huge boost if you can design your algorithm to exploit this.
    • Mali doesn't have any dedicated local memory region; we just use system RAM backed by normal cache policies. From a performance point of view local memory is therefore identical to global memory, so don't waste cycles copying data from global to local - it won't help and just adds more cache pressure.
    • The usual caching 101 principles of spatial and temporal locality apply - try to keep accesses to the same cache lines close together in time to minimize cache pressure and chances of having to re-fetch the same data from memory. This can be controlled by algorithm design, and by the shape of your kernel work groups.
    • Design your data accesses to use vector load/store operations, they make much more cache friendly accesses. In 2D datasets, such as matrices in SGEMM, this may mean you need to tranpose one or more of the input data sets to allow this (the alternative to not transposing is scalar loads from each row, loading and using one entry from each cache line, and in sufficiently large input datasets probably thrashing the MMU TLB as well as the cache).
    • The L1 texture cache is a different resource to the L1 LS cache; use texturing operations rather than array accesses for constant read-only texture resources. (This can have some other performance side-effects, so worth testing both options to see which one works best for your particuar algorithm).
    • Try to avoid writeable data, including atomics, which is heavily accessed by multiple work groups which span shader cores. You lose internal bandwidth within the shader core handlin the parallel writes to the same cache line from multiple shader cores. If you need to have some form of "global but local to a single core" data structure to minimize this we have some published extensions for OpenCL which can provide the core_id to the kernel code to allow dynamic array indexing: https://www.khronos.org/registry/cl/extensions/arm/cl_arm_get_core_id.txt
    • When writing output data try to write entire 64-byte cache lines from a workgroup without gaps or holes; partial writes are more expensive than full writes. Partial = either shorter than 64-bytes (bad for AXI utilization) or with unwritten regions which must be masked out to avoid corrupting the unwritten parts (bad for DDR utilization in LPDDR4 onwards).
    • Use the offline Mali shader compiler to review your shaders to see if they are spilling to stack; stack spills are a very quick way to burn bandwidth which isn't obviously visible just looking at the shader source code.

    HTH,
    Pete

Reply
  • All data read by a shader core will get cached automatically. Some of the basics for Mali:

    • As always try to minimize data sizes; 16-bit types are faster than 32-bit types to process, and use half the storage in RAM and registers, so a huge boost if you can design your algorithm to exploit this.
    • Mali doesn't have any dedicated local memory region; we just use system RAM backed by normal cache policies. From a performance point of view local memory is therefore identical to global memory, so don't waste cycles copying data from global to local - it won't help and just adds more cache pressure.
    • The usual caching 101 principles of spatial and temporal locality apply - try to keep accesses to the same cache lines close together in time to minimize cache pressure and chances of having to re-fetch the same data from memory. This can be controlled by algorithm design, and by the shape of your kernel work groups.
    • Design your data accesses to use vector load/store operations, they make much more cache friendly accesses. In 2D datasets, such as matrices in SGEMM, this may mean you need to tranpose one or more of the input data sets to allow this (the alternative to not transposing is scalar loads from each row, loading and using one entry from each cache line, and in sufficiently large input datasets probably thrashing the MMU TLB as well as the cache).
    • The L1 texture cache is a different resource to the L1 LS cache; use texturing operations rather than array accesses for constant read-only texture resources. (This can have some other performance side-effects, so worth testing both options to see which one works best for your particuar algorithm).
    • Try to avoid writeable data, including atomics, which is heavily accessed by multiple work groups which span shader cores. You lose internal bandwidth within the shader core handlin the parallel writes to the same cache line from multiple shader cores. If you need to have some form of "global but local to a single core" data structure to minimize this we have some published extensions for OpenCL which can provide the core_id to the kernel code to allow dynamic array indexing: https://www.khronos.org/registry/cl/extensions/arm/cl_arm_get_core_id.txt
    • When writing output data try to write entire 64-byte cache lines from a workgroup without gaps or holes; partial writes are more expensive than full writes. Partial = either shorter than 64-bytes (bad for AXI utilization) or with unwritten regions which must be masked out to avoid corrupting the unwritten parts (bad for DDR utilization in LPDDR4 onwards).
    • Use the offline Mali shader compiler to review your shaders to see if they are spilling to stack; stack spills are a very quick way to burn bandwidth which isn't obviously visible just looking at the shader source code.

    HTH,
    Pete

Children