This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali Memory Bandwidth and Access Pattern

Hi there,

I am playing around with the Mali-G78 in a Pixel 6a and Streamline.

As a trial I want to copy the content of a 64MB buffer into another - here my shader code:

```
#version 450

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;
layout(constant_id = 0) const int kBufferSizeElements = (64*1024*1024)/4;
layout(set = 0, binding = 0) buffer InputBuffer {uint input_buffer[kBufferSizeElements];};
layout(set = 0, binding = 1) buffer OutputBuffer {uint output_buffer[kBufferSizeElements];};

void main() {
   output_buffer[gl_GlobalInvocationID.x] = input_buffer[gl_GlobalInvocationID.x];
}

```

When I use Streamline to check the GPU counters, I observe the following:

  • the memory bandwidth (read and write together) are roughly 30 GB/s, even though the spec of the Pixel 6a talks of 50GB/s max. Is this max value realistic to achieve or am I already relatively high with the 30 GB/s?
  • the load/store counters indicate that I am doing partial read/writes only - there are no full reads/writes reported. This I don't understand at all. I expect the warp, so 16 threads, to access the L/S unit with all adjacent addresses. According to the "Mali-G78 Performance Counters Reference Guide" threads should access a 64 byte range for max performance, which I do I think. What is the reason for these partial read/writes ? 
  • when I change the buffers from type `uint` to `uvec4` (and dispatch 4x less), then I get rid of the partial reads/writes. However, performance drops from 30GB/s to less than 25GB/s, memory write stalls increase from 46% to 62% and the time for the copy increases from ~4.5ms to ~5.5ms. Why is it getting slower when I have full read/writes only ? Obviously, I am accessing the memory in a less optimal way, but why?

Generally, can you explain how the memory should be accessed for best performance ?

And finally a short question on the datasheet: the "FP32 operations/cycle" is this per Arithmetic unit or per core ?

Parents
  • Thank you very much for the fast and detailed answer.

    That threads are grouped into 4 and not 16 is totally new to me. Is this documented / explained somewhere in more detail ? I checked "The Valhall shader core" document but didn't find anything related.

    Yes, in the experiment with the uvec4, I only changed the access size from uint to uvec4, the workgroup size is still set to 16 threads. (Well, I dispatch obviously 4x less)

Reply
  • Thank you very much for the fast and detailed answer.

    That threads are grouped into 4 and not 16 is totally new to me. Is this documented / explained somewhere in more detail ? I checked "The Valhall shader core" document but didn't find anything related.

    Yes, in the experiment with the uvec4, I only changed the access size from uint to uvec4, the workgroup size is still set to 16 threads. (Well, I dispatch obviously 4x less)

Children