Hi there,
I am playing around with the Mali-G78 in a Pixel 6a and Streamline.
As a trial I want to copy the content of a 64MB buffer into another - here my shader code:
```#version 450
layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;layout(constant_id = 0) const int kBufferSizeElements = (64*1024*1024)/4;layout(set = 0, binding = 0) buffer InputBuffer {uint input_buffer[kBufferSizeElements];};layout(set = 0, binding = 1) buffer OutputBuffer {uint output_buffer[kBufferSizeElements];};
void main() { output_buffer[gl_GlobalInvocationID.x] = input_buffer[gl_GlobalInvocationID.x];}
```
When I use Streamline to check the GPU counters, I observe the following:
Generally, can you explain how the memory should be accessed for best performance ?
And finally a short question on the datasheet: the "FP32 operations/cycle" is this per Arithmetic unit or per core ?
opm said:Is this max value realistic to achieve or am I already relatively high with the 30 GB/s?
I can't speak to any Pixel 6 specifics, but generally you'll never hit 100% DRAM utilization in real world usage scenarios. Hitting 75% of theoretical is probably as good as it gets before DRAM paging overheads start to dominate.
opm said:What is the reason for these partial read/writes ?
For most Mali cores accesses are coalesced in groups of 4 threads, not across the whole warp. This is why the uvec4 helps - you only need 4 threads to fill the width of the cache load path. The latest Immortalis-G715 can coalesce across the whole warp, so this should improve on newer microarchitectures.
opm said:Why is it getting slower when I have full read/writes only ?
Is the only change to the shader the change to a uvec4 access size? Is your workgroup size is still 16 threads?
TBH it's hard to say without more data; memory performance close to peak is very sensitive to small changes in timing and thread scheduling. The large increase in write stalls implies the issue may be outside of the GPU (i.e. that's the external bus saying it's busy). Perhaps the more efficient access pattern caused the bus or DRAM clock to drop?
Thank you very much for the fast and detailed answer.
That threads are grouped into 4 and not 16 is totally new to me. Is this documented / explained somewhere in more detail ? I checked "The Valhall shader core" document but didn't find anything related.
Yes, in the experiment with the uvec4, I only changed the access size from uint to uvec4, the workgroup size is still set to 16 threads. (Well, I dispatch obviously 4x less)
opm said: Is this documented / explained somewhere in more detail ?
Not aware of anything more detailed (but not sure there is much more detail to give either ;)