This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How best to maximize cache-write utilization for gpu-compute?

What are some best practices for preventing data from being written out to RAM when structuring a compute job on the GPU that requires a small amount of data? For example, if I wanted to do 10M read/write operations on a contiguous 1024B array and finally output, say, 1024B, would this be automatically cached or are there things that should be done to make caching more likely?

Parents
  • Thanks, Pete!

    This was amazingly helpful, and extremely informative! Stack spilling is something that I hadn't considered but I can appreciate how expensive such operations would be, making a very good case for using a profiler. I also appreciated (but hadn't considered) the recommendation of using 16b types for a more efficient use of the register file!

    I have an unrelated question regarding atomics. While I understand that atomics have the potential to be non-cache friendly, are they GPU-compute or GL/VK friendly operations? Are stalls caused by mutual exclusion locking of atomic data easily hidden by other jobs or is some level of higher-than-normal cycle-latency guaranteed when using them?

    This answer was so good, it almost deserves its own post! Is there an official PDF resource that has more detail about memory optimization?

    Sean

Reply
  • Thanks, Pete!

    This was amazingly helpful, and extremely informative! Stack spilling is something that I hadn't considered but I can appreciate how expensive such operations would be, making a very good case for using a profiler. I also appreciated (but hadn't considered) the recommendation of using 16b types for a more efficient use of the register file!

    I have an unrelated question regarding atomics. While I understand that atomics have the potential to be non-cache friendly, are they GPU-compute or GL/VK friendly operations? Are stalls caused by mutual exclusion locking of atomic data easily hidden by other jobs or is some level of higher-than-normal cycle-latency guaranteed when using them?

    This answer was so good, it almost deserves its own post! Is there an official PDF resource that has more detail about memory optimization?

    Sean

Children
  • While I understand that atomics have the potential to be non-cache friendly, are they GPU-compute or GL/VK friendly operations? Are stalls caused by mutual exclusion locking of atomic data easily hidden by other jobs or is some level of higher-than-normal cycle-latency guaranteed when using them?

    Within a single shader core the atomics are exceptionally fast (single cycle back-to-back access to the same atomic is entirely possible). The only point where we will start slowing down is due to the cache line locking (only one core can own an atomic cache line at any point in time, so if you get high frequency of access to that atomic from multiple work groups then contention across cores will rapidly erode that single cycle throughput), hence the OpenCL extension to allow an array of atomics to be specified (one per-core, space one cache line apart) and indexed from shader programs. If there is interest in a GL version of that extension it would be possible, but not available today.

    Pete

  • Peter Harris wrote:

    Within a single shader core the atomics are exceptionally fast (single cycle back-to-back access to the same atomic is entirely possible).

    This is wonderful! And I suppose it make intuitive sense in hindsight: limiting a single compute-core to data ensures that only one access on the data can happen at a time which likely plays a large role in the low-latency access. The CL extension is tremendously useful for tasks that require atomic access.

    I'm looking forward to exploring how atomic operations can be efficiently utilized in both Vulkan fragment and compute shaders. Perhaps a single compute-shader instance can be designed to contain a work-loop (rather than each loop being split across a workgroup) with exclusive access to an atomic block. While the instance would certainly have a longer run-time than an equivalent workgroup, there is the potential for it to be effectively threaded with other instances...

    Sean