This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

High Load/Store Cycles

Hi, i'm using performance advisor together with streamline to profile our game.
The report says our GPU seems to be busy with load/store operations. The optimization advice is mainly about how to optimize compute shaders.
The GPU cycles indicates we are fragment bound. (Total 20K Fragment 18K Non-Fragment 16K)
The GPU external read bandwidth is 137M, write bandwidth is around 350M

Our game is quite complex and the only compute shader related pass is some postprocess related to tone-mapping.
So i looked into streamline's related performance counters:
Streamline sample duration is 100ms, total load store cycles is 3526K, and it's about 1.8 times more than other ALU operations.

Load/store total issues:10.19 mega-cycles
Load/store full read issues:0.31 mega-cycles
Load/store partial read issues:3.84 mega-cycles
Load/store full write issues:4 mega-cycles
Load/store partial write issues:2.03 mega-cycles
Load/store atomic issues:< 0.01 mega-cycles

We're testing on a G77MP16 GPU, docs says it's LSC bandwidth is designed to be 32 Byte / Cycle

Load/store bytes read from L2 per access cycle:8 byte
Load/store bytes read from external memory per access cycle:3 byte
Load/store bytes written to L2 per access cycle:16 byte

Load/store read bytes from L2 cache:32.48 MB
Texture read bytes from L2 cache:8.48 MB
Load/store read bytes from external memory:12.3 MB
Texture read bytes from external memory:6.08 MB
Load/store write bytes:97 MB
Tile buffer write bytes:2.56MB

Full quad warp rate:96%
Diverged instruction issue rate:<1%
All registers warp rate:25%
Constant tile kill rate:7%

The Load/Store write bytes seems to consume most L2 cache's bandwidth

Although above counters doesn't indicate any problem related to external memory load/store, but theses counter looks suspicious:

Output external read bytes:409 MB / 100ms,137 MB / Frame
Output external write bytes:1037 MB / 100ms,342 MB / Frame

Output external read stall rate:6%
Output external write stall rate:200%

The external write stall rate is 200% ...

So, to my knowleage, performance advisor says we have a load/store problem. And there's at least 2 types of load/stores.
One for L2 cache, one for external memory.

For L2 cache load/stores, we're write bound? But when does GPU write to L2 cache? Like when register spill happens(We have a 25% of all register wrap rate)? Like when filling L2 cache with geometry / thread stack data for shader wraps? And i tried to show/hide some portion of our game, the execution engine & load/store cycles goes down together. What should i do? I'm a little confused here, what could cause a high L2 load/store write bandwidth.

For external memory load/store, the counters says we're still write bound. I understand writing to external memory is time consuming.
Does 14GB/s is too much for the external memory system? Should we try to optimize external write bandwidth as well?

Parents
  • So, to my knowleage, performance advisor says we have a load/store problem. And there's at least 2 types of load/stores.

    That part of the advice is local to the shader core - load/store is the highest pipeline load, so that's the critical path for shader execition (ignoring external memory effects).

    In addition to compute, load/store is used for reading and writing vertex data. Check your mesh complexity (primitive count, vertex count, and number of bytes per vertex) and CPU-side culling to make sure you are not processing more geometry that the GPU can cope with. 

    In general for mobile aim for < 300-500K input vertices per frame, and ~32 bytes per vertex.

    Does 14GB/s is too much for the external memory system?

    Yes, that's too high.

    A high-end chipset might be able to provide that for short periods, but DRAM access is very power hungry so that much memory access costs about 1.4 Watts even if the CPU and GPU are completely idle, so you'll tend to overheat. A  mass-market chipset can probably only provide a maximum of 6-10 GB/s for the whole SoC.

    In general we recommend aiming for 3-5 GB/s for the GPU, which gives a maximum of 80MB/frame (total for read and write) at 60 FPS.

    In your other post I think you said you were writing MSAA data back to memory at the end of a render pass. Don't do that - it's very expensive. You have 4x the framebuffer samples to write, and you also you lose framebuffer compression, so ~8x the bandwidth of a normal render pass. If you use MSAA ensure you resolve samples before you write back at the end of the render pass. If you can't do that for algorithm reasons, then it's better not to use MSAA.

    Pete

Reply
  • So, to my knowleage, performance advisor says we have a load/store problem. And there's at least 2 types of load/stores.

    That part of the advice is local to the shader core - load/store is the highest pipeline load, so that's the critical path for shader execition (ignoring external memory effects).

    In addition to compute, load/store is used for reading and writing vertex data. Check your mesh complexity (primitive count, vertex count, and number of bytes per vertex) and CPU-side culling to make sure you are not processing more geometry that the GPU can cope with. 

    In general for mobile aim for < 300-500K input vertices per frame, and ~32 bytes per vertex.

    Does 14GB/s is too much for the external memory system?

    Yes, that's too high.

    A high-end chipset might be able to provide that for short periods, but DRAM access is very power hungry so that much memory access costs about 1.4 Watts even if the CPU and GPU are completely idle, so you'll tend to overheat. A  mass-market chipset can probably only provide a maximum of 6-10 GB/s for the whole SoC.

    In general we recommend aiming for 3-5 GB/s for the GPU, which gives a maximum of 80MB/frame (total for read and write) at 60 FPS.

    In your other post I think you said you were writing MSAA data back to memory at the end of a render pass. Don't do that - it's very expensive. You have 4x the framebuffer samples to write, and you also you lose framebuffer compression, so ~8x the bandwidth of a normal render pass. If you use MSAA ensure you resolve samples before you write back at the end of the render pass. If you can't do that for algorithm reasons, then it's better not to use MSAA.

    Pete

Children
  • Unknown said:
    That part of the advice is local to the shader core - load/store is the highest pipeline load, so that's the critical path for shader execition (ignoring external memory effects).

    Our average primitive count is around 500K, and i'll keep optimizing geometry culling and primitive counts, thanks.

    Anyway, i'm using mali-offline-compiler to inspect our shaders, i tested some basic vertex shader, and found out that we seems to exceeds both work register & uniform register limit.

    So i wonder what would happen if i exceeds the work register limit, for example if i'm running on Mali-G76:

    The max number of work register is 64. if, for example, the position or varying variant exceeds this limit, what would happen?
    Does this mean the gpu may have to load/store part of registers data in LSC when needed, which is time consuming?

    BTW, what's the relation ship between LSC(16K) and load/store L2 Cache, do they mean the same thing?

    And when exceeding uniform register limit(16 i guess?), will mali-gpu do the same thing to load/store uniform registers in LSC?

    If it's true, does this mean some part of our load/store pressure comes from using too much registers?

  • The max number of work register is 64. if, for example, the position or varying variant exceeds this limit, what would happen?

    If you exceed the 64 register capacity a shader would start to spill to stack. You can see that's not happening here - stack spilling is reporting as false. The compiler tries hard to not to do this - spilling in a GPU is very expensive in terms of bandwidth due to the thread count involved. 

    BTW, what's the relation ship between LSC(16K) and load/store L2 Cache, do they mean the same thing?

    LSC - level 1 cache in the shader core, only used for load/store data, and typically 16KB in size.

    L2 - level 2 unified cache, shared by multiple shader cores, used for most types of data (shaders, descriptors, buffers, textures, etc), and typically anywhere up to 2-4MB for large GPU designs. Slower to access than the L1 though. 

    And when exceeding uniform register limit(16 i guess?), will mali-gpu do the same thing to load/store uniform registers in LSC?

    Uniform register limit is large (128 registers per draw), so you have to try very hard to exceed it. If you do, or if uniform registers can't be used (e.g. dynamic array index), then yes will load from load/store cache. 

    If it's true, does this mean some part of our load/store pressure comes from using too much registers?

    Without seeing the shader, it's impossible for me to give a definitive answer. You're not spilling, so work register pressure isn't the problem, but you might be loading uniforms from memory (either due to size or how they are being used).

    For vertex shaders, loading vertex attributes is more likely to be the major cost, as that goes down the load/store path.

  • Thank you again for sharing these information. They're very inspiring.

  • Hi, one last thing to confirm :D

    This chart shows Shader Core's different unit's performance in cycle count, right?

    So the high Load/Store cycles is caused by load/store data from unified L2 Cache and external DRAM, which don't include LSC bandwidth, is this correct?

    If we confirm we consumes too much external memory write bandwidth, what's the main cause of writing to external DRAM?
    In my understanding, tbr gpu only write back necessary tile memory to DRAM, and i could calculate the total bandwidth with rt resolution and rt format.
    What else could be written to DRAM, is it related to drawcalls?

    I understand Tile buffer write means tile write back. What does Load/Store writeback and load/store other means?
    Like when register spil happens, gpu need to write stack memory to L2, and eventually tp external memory?
    What else memory write action does it covers?

  • This chart shows Shader Core's different unit's performance in cycle count, right?

    Yes, it shows the utilization of the major functional part of each pipeline - i.e. the bit that does real work.  

    It is deliberately simplified and can't show all the detail of every pipeline, so its a simplification and there may be bottlenecks in each pipe which don't show up here if content start stressing the pipe in weird ways. Streamline has the full counter set and can show more detail.

    So the high Load/Store cycles is caused by load/store data from unified L2 Cache and external DRAM, which don't include LSC bandwidth, is this correct?

    The LSC counter just shows L1 cache accesses. Each access might not access L2 or external memory at all if the data is already in the L1. 

    If we confirm we consumes too much external memory write bandwidth, what's the main cause of writing to external DRAM?

    Intermediate outputs from the vertex shader are stored back to memory before being read by the fragment shader, so the main cause of load/store writes is vertex shader outputs. Minimizing vertex count and reducing precision (use mediump outputs from the vertex shader as much as possible - usually only need highp for position and depth) can help to reduce the write bandwidth here.

    Other options are writes from compute shaders (either to buffers or to images).

  • With your help, i have a deep understanding of mali GPU's characteristics. I think i can handle the rest of the problems now, thank you!

  • With your help, i have a deep understanding of mali GPU's characteristics. I think i can handle the rest of the problems now, thank you!

  • No problem at all, we're always happy to help.

    Cheers, 
    Pete