What are some best practices for preventing data from being written out to RAM when structuring a compute job on the GPU that requires a small amount of data? For example, if I wanted to do 10M read/write operations on a contiguous 1024B array and finally output, say, 1024B, would this be automatically cached or are there things that should be done to make caching more likely?
Peter Harris wrote:Within a single shader core the atomics are exceptionally fast (single cycle back-to-back access to the same atomic is entirely possible).
Peter Harris wrote:
Within a single shader core the atomics are exceptionally fast (single cycle back-to-back access to the same atomic is entirely possible).
This is wonderful! And I suppose it make intuitive sense in hindsight: limiting a single compute-core to data ensures that only one access on the data can happen at a time which likely plays a large role in the low-latency access. The CL extension is tremendously useful for tasks that require atomic access.
I'm looking forward to exploring how atomic operations can be efficiently utilized in both Vulkan fragment and compute shaders. Perhaps a single compute-shader instance can be designed to contain a work-loop (rather than each loop being split across a workgroup) with exclusive access to an atomic block. While the instance would certainly have a longer run-time than an equivalent workgroup, there is the potential for it to be effectively threaded with other instances...
Sean