This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to found write stall corresponding shader or renderpass?

I found higp write stall rate in time, but I can't found corresponding shader or render pass in Streamline. Is it has a solution?

  • Should I add a Streamline annotation in engine? I found arm has a Unity Streamline annotation, but I use the engine made by myself. Is it has a solution?

  • Hi Chashao, 

    At the moment there isn't a good way to correlate counters to specific render passes, and it is impossible to correlate with single draw calls (hardware interleaves bits of draw calls, tile by tile). 

    If you are able to share the Streamline capture I'd be happy to provide some initial analysis, but usual candidates for high write stalls are:

    • Simple content with very high pixel throughput, writing complete pixels faster than the memory system can accept them.
    • Vertex processing with simple shaders and a large amount of output data per vertex.
    • Content using MSAA and writing the MSAA data back to memory rather than resolving as part of writeback. 

    Kind regards, 
    Pete

  • Thanks peter ,this is our game capture, a deferred pipeline and use taa. GPU is g51mp4. I found fragment utlization is 100%. So I want to reduce some shader effect, and bandwidth may is another problem.

  • Hi Chashao, 

    If it's possible to share the same screenshots with a single frame highlighted using the cross-section marker, so the data bubbles show totals for the frame, it would be great. However, here is some early analysis:

    Scheduling:

    • There looks to be some vertex-fragment serialization once per frame, so possibly worth seeing if you can tweak barriers or fences to get fragment shading running all of the time.

    Bandwidth:

    • The spikes in bandwidth look mostly related to geometry (they correlate with the Non-fragment queue active time).

    Geometry:

    • For the main part of your geometry pipeline you have very low visible primitive counts - only 16% of the input primitives end up on screen. The main cause of this is sample culling - triangles that are so small they don't hit any sample points. Dense meshes with small triangles are _very_ expensive in terms of bandwidth, and will also reduce fragment efficiency. I don't see the series for workload properties here, but check the number of partial quads that you are rendering - you want that number to be as low as possible.

    Late ZS:

    • You seem to have a relatively high percentage of late ZS kills (25% test and kill). Can't tell what you are doing here, but beware that late ZS can really cause interesting scheduling issues due to dependencies between layers. Try to minimize shaders using discard, alpha-to-coverage, or programmable depth. This is definitely causing some bubbles - the "Fragment FPKB utilization" is dropping down to ~80% a lot of the time. (The other cause of late ZS is depth attachment readbacks when starting a render pass from an existing non-cleared depth buffer, so this might be unavoidable).

    Shader core load:

    • The major problem seems to be arithmetic complexity in your fragment shaders. You have 81%+ utilization of the execution engine, which is the shader core arithmetic unit, and the next highest load is a long way below that.  Reducing shader complexity, and making aggressive use of mediump should help here.

    Kind regards, 
    Pete