We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi, i'm using performance advisor together with streamline to profile our game.The report says our GPU seems to be busy with load/store operations. The optimization advice is mainly about how to optimize compute shaders.The GPU cycles indicates we are fragment bound. (Total 20K Fragment 18K Non-Fragment 16K)The GPU external read bandwidth is 137M, write bandwidth is around 350MOur game is quite complex and the only compute shader related pass is some postprocess related to tone-mapping.So i looked into streamline's related performance counters:Streamline sample duration is 100ms, total load store cycles is 3526K, and it's about 1.8 times more than other ALU operations.
Load/store total issues:10.19 mega-cyclesLoad/store full read issues:0.31 mega-cycles Load/store partial read issues:3.84 mega-cycles Load/store full write issues:4 mega-cyclesLoad/store partial write issues:2.03 mega-cyclesLoad/store atomic issues:< 0.01 mega-cycles
We're testing on a G77MP16 GPU, docs says it's LSC bandwidth is designed to be 32 Byte / Cycle
Load/store bytes read from L2 per access cycle:8 byteLoad/store bytes read from external memory per access cycle:3 byteLoad/store bytes written to L2 per access cycle:16 byte
Load/store read bytes from L2 cache:32.48 MBTexture read bytes from L2 cache:8.48 MBLoad/store read bytes from external memory:12.3 MBTexture read bytes from external memory:6.08 MB Load/store write bytes:97 MB Tile buffer write bytes:2.56MB
Full quad warp rate:96%Diverged instruction issue rate:<1%All registers warp rate:25%Constant tile kill rate:7%
The Load/Store write bytes seems to consume most L2 cache's bandwidth
Although above counters doesn't indicate any problem related to external memory load/store, but theses counter looks suspicious:
Output external read bytes:409 MB / 100ms,137 MB / FrameOutput external write bytes:1037 MB / 100ms,342 MB / Frame
Output external read stall rate:6%Output external write stall rate:200%
The external write stall rate is 200% ...
So, to my knowleage, performance advisor says we have a load/store problem. And there's at least 2 types of load/stores.One for L2 cache, one for external memory.
For L2 cache load/stores, we're write bound? But when does GPU write to L2 cache? Like when register spill happens(We have a 25% of all register wrap rate)? Like when filling L2 cache with geometry / thread stack data for shader wraps? And i tried to show/hide some portion of our game, the execution engine & load/store cycles goes down together. What should i do? I'm a little confused here, what could cause a high L2 load/store write bandwidth.
For external memory load/store, the counters says we're still write bound. I understand writing to external memory is time consuming. Does 14GB/s is too much for the external memory system? Should we try to optimize external write bandwidth as well?
Thank you again for sharing these information. They're very inspiring.