This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to found write stall corresponding shader or renderpass?

I found higp write stall rate in time, but I can't found corresponding shader or render pass in Streamline. Is it has a solution?

Parents
  • Hi Chashao, 

    If it's possible to share the same screenshots with a single frame highlighted using the cross-section marker, so the data bubbles show totals for the frame, it would be great. However, here is some early analysis:

    Scheduling:

    • There looks to be some vertex-fragment serialization once per frame, so possibly worth seeing if you can tweak barriers or fences to get fragment shading running all of the time.

    Bandwidth:

    • The spikes in bandwidth look mostly related to geometry (they correlate with the Non-fragment queue active time).

    Geometry:

    • For the main part of your geometry pipeline you have very low visible primitive counts - only 16% of the input primitives end up on screen. The main cause of this is sample culling - triangles that are so small they don't hit any sample points. Dense meshes with small triangles are _very_ expensive in terms of bandwidth, and will also reduce fragment efficiency. I don't see the series for workload properties here, but check the number of partial quads that you are rendering - you want that number to be as low as possible.

    Late ZS:

    • You seem to have a relatively high percentage of late ZS kills (25% test and kill). Can't tell what you are doing here, but beware that late ZS can really cause interesting scheduling issues due to dependencies between layers. Try to minimize shaders using discard, alpha-to-coverage, or programmable depth. This is definitely causing some bubbles - the "Fragment FPKB utilization" is dropping down to ~80% a lot of the time. (The other cause of late ZS is depth attachment readbacks when starting a render pass from an existing non-cleared depth buffer, so this might be unavoidable).

    Shader core load:

    • The major problem seems to be arithmetic complexity in your fragment shaders. You have 81%+ utilization of the execution engine, which is the shader core arithmetic unit, and the next highest load is a long way below that.  Reducing shader complexity, and making aggressive use of mediump should help here.

    Kind regards, 
    Pete

Reply
  • Hi Chashao, 

    If it's possible to share the same screenshots with a single frame highlighted using the cross-section marker, so the data bubbles show totals for the frame, it would be great. However, here is some early analysis:

    Scheduling:

    • There looks to be some vertex-fragment serialization once per frame, so possibly worth seeing if you can tweak barriers or fences to get fragment shading running all of the time.

    Bandwidth:

    • The spikes in bandwidth look mostly related to geometry (they correlate with the Non-fragment queue active time).

    Geometry:

    • For the main part of your geometry pipeline you have very low visible primitive counts - only 16% of the input primitives end up on screen. The main cause of this is sample culling - triangles that are so small they don't hit any sample points. Dense meshes with small triangles are _very_ expensive in terms of bandwidth, and will also reduce fragment efficiency. I don't see the series for workload properties here, but check the number of partial quads that you are rendering - you want that number to be as low as possible.

    Late ZS:

    • You seem to have a relatively high percentage of late ZS kills (25% test and kill). Can't tell what you are doing here, but beware that late ZS can really cause interesting scheduling issues due to dependencies between layers. Try to minimize shaders using discard, alpha-to-coverage, or programmable depth. This is definitely causing some bubbles - the "Fragment FPKB utilization" is dropping down to ~80% a lot of the time. (The other cause of late ZS is depth attachment readbacks when starting a render pass from an existing non-cleared depth buffer, so this might be unavoidable).

    Shader core load:

    • The major problem seems to be arithmetic complexity in your fragment shaders. You have 81%+ utilization of the execution engine, which is the shader core arithmetic unit, and the next highest load is a long way below that.  Reducing shader complexity, and making aggressive use of mediump should help here.

    Kind regards, 
    Pete

Children
No data