Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.

We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.

Thank you for your understanding.


This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali Memory Bandwidth and Access Pattern

Hi there,

I am playing around with the Mali-G78 in a Pixel 6a and Streamline.

As a trial I want to copy the content of a 64MB buffer into another - here my shader code:

```
#version 450

layout (local_size_x = 16, local_size_y = 1, local_size_z = 1) in;
layout(constant_id = 0) const int kBufferSizeElements = (64*1024*1024)/4;
layout(set = 0, binding = 0) buffer InputBuffer {uint input_buffer[kBufferSizeElements];};
layout(set = 0, binding = 1) buffer OutputBuffer {uint output_buffer[kBufferSizeElements];};

void main() {
   output_buffer[gl_GlobalInvocationID.x] = input_buffer[gl_GlobalInvocationID.x];
}

```

When I use Streamline to check the GPU counters, I observe the following:

  • the memory bandwidth (read and write together) are roughly 30 GB/s, even though the spec of the Pixel 6a talks of 50GB/s max. Is this max value realistic to achieve or am I already relatively high with the 30 GB/s?
  • the load/store counters indicate that I am doing partial read/writes only - there are no full reads/writes reported. This I don't understand at all. I expect the warp, so 16 threads, to access the L/S unit with all adjacent addresses. According to the "Mali-G78 Performance Counters Reference Guide" threads should access a 64 byte range for max performance, which I do I think. What is the reason for these partial read/writes ? 
  • when I change the buffers from type `uint` to `uvec4` (and dispatch 4x less), then I get rid of the partial reads/writes. However, performance drops from 30GB/s to less than 25GB/s, memory write stalls increase from 46% to 62% and the time for the copy increases from ~4.5ms to ~5.5ms. Why is it getting slower when I have full read/writes only ? Obviously, I am accessing the memory in a less optimal way, but why?

Generally, can you explain how the memory should be accessed for best performance ?

And finally a short question on the datasheet: the "FP32 operations/cycle" is this per Arithmetic unit or per core ?

Parents
  • Thank you very much for the fast and detailed answer.

    That threads are grouped into 4 and not 16 is totally new to me. Is this documented / explained somewhere in more detail ? I checked "The Valhall shader core" document but didn't find anything related.

    Yes, in the experiment with the uvec4, I only changed the access size from uint to uvec4, the workgroup size is still set to 16 threads. (Well, I dispatch obviously 4x less)

Reply
  • Thank you very much for the fast and detailed answer.

    That threads are grouped into 4 and not 16 is totally new to me. Is this documented / explained somewhere in more detail ? I checked "The Valhall shader core" document but didn't find anything related.

    Yes, in the experiment with the uvec4, I only changed the access size from uint to uvec4, the workgroup size is still set to 16 threads. (Well, I dispatch obviously 4x less)

Children