Hi there,
I am playing around with the Mali-G78 in a Pixel 6a and Streamline.
As a trial I want to copy the content of a 64MB buffer into another - here my shader code:
```#version 450
layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;layout(constant_id = 0) const int kBufferSizeElements = (64*1024*1024)/4;layout(set = 0, binding = 0) buffer InputBuffer {uint input_buffer[kBufferSizeElements];};layout(set = 0, binding = 1) buffer OutputBuffer {uint output_buffer[kBufferSizeElements];};
void main() { output_buffer[gl_GlobalInvocationID.x] = input_buffer[gl_GlobalInvocationID.x];}
```
When I use Streamline to check the GPU counters, I observe the following:
Generally, can you explain how the memory should be accessed for best performance ?
And finally a short question on the datasheet: the "FP32 operations/cycle" is this per Arithmetic unit or per core ?
Thank you very much for the fast and detailed answer.
That threads are grouped into 4 and not 16 is totally new to me. Is this documented / explained somewhere in more detail ? I checked "The Valhall shader core" document but didn't find anything related.
Yes, in the experiment with the uvec4, I only changed the access size from uint to uvec4, the workgroup size is still set to 16 threads. (Well, I dispatch obviously 4x less)
opm said: Is this documented / explained somewhere in more detail ?
Not aware of anything more detailed (but not sure there is much more detail to give either ;)