Hi, i'm using performance advisor together with streamline to profile our game.The report says our GPU seems to be busy with load/store operations. The optimization advice is mainly about how to optimize compute shaders.The GPU cycles indicates we are fragment bound. (Total 20K Fragment 18K Non-Fragment 16K)The GPU external read bandwidth is 137M, write bandwidth is around 350MOur game is quite complex and the only compute shader related pass is some postprocess related to tone-mapping.So i looked into streamline's related performance counters:Streamline sample duration is 100ms, total load store cycles is 3526K, and it's about 1.8 times more than other ALU operations.
Load/store total issues:10.19 mega-cyclesLoad/store full read issues:0.31 mega-cycles Load/store partial read issues:3.84 mega-cycles Load/store full write issues:4 mega-cyclesLoad/store partial write issues:2.03 mega-cyclesLoad/store atomic issues:< 0.01 mega-cycles
We're testing on a G77MP16 GPU, docs says it's LSC bandwidth is designed to be 32 Byte / Cycle
Load/store bytes read from L2 per access cycle:8 byteLoad/store bytes read from external memory per access cycle:3 byteLoad/store bytes written to L2 per access cycle:16 byte
Load/store read bytes from L2 cache:32.48 MBTexture read bytes from L2 cache:8.48 MBLoad/store read bytes from external memory:12.3 MBTexture read bytes from external memory:6.08 MB Load/store write bytes:97 MB Tile buffer write bytes:2.56MB
Full quad warp rate:96%Diverged instruction issue rate:<1%All registers warp rate:25%Constant tile kill rate:7%
The Load/Store write bytes seems to consume most L2 cache's bandwidth
Although above counters doesn't indicate any problem related to external memory load/store, but theses counter looks suspicious:
Output external read bytes:409 MB / 100ms,137 MB / FrameOutput external write bytes:1037 MB / 100ms,342 MB / Frame
Output external read stall rate:6%Output external write stall rate:200%
The external write stall rate is 200% ...
So, to my knowleage, performance advisor says we have a load/store problem. And there's at least 2 types of load/stores.One for L2 cache, one for external memory.
For L2 cache load/stores, we're write bound? But when does GPU write to L2 cache? Like when register spill happens(We have a 25% of all register wrap rate)? Like when filling L2 cache with geometry / thread stack data for shader wraps? And i tried to show/hide some portion of our game, the execution engine & load/store cycles goes down together. What should i do? I'm a little confused here, what could cause a high L2 load/store write bandwidth.
For external memory load/store, the counters says we're still write bound. I understand writing to external memory is time consuming. Does 14GB/s is too much for the external memory system? Should we try to optimize external write bandwidth as well?
With your help, i have a deep understanding of mali GPU's characteristics. I think i can handle the rest of the problems now, thank you!