Hi, I'm a mobile game developer and I try to use MSAA 4x in my game recently. As far as I know, MSAA is almost "free" on Mali GPU. I use UE 4.27 and build a demo to profile the performance.
Demo is using forward render pipeline (Vulkan) and material of scene object is using unlit shading model which is just using its world space normal value as its pixel color.
The profile result shows that GPU Active increase about 20%! The Fragment queue active also increase 20%.
I understand that using MSAA 4x will make more primitives going to the rasterizer and create more quads because there are 4 times sample points within a pixel.What make me confused is the increament of Fragment Warp/Execution Core Active is not the same to GPU Active/Fragment queue active. Increasement of Fragment Warp is about 3% and Execution Core Active is about 30%.
Since all objects in my demo scene are using the same simple material, I expect that when the workload(e.g. fragment warps) increased by A%, the gpu active should also increased around by A% or even less thant that. But it seems not true according to the profiling result.
There's something even stranger that after using MSAA 4x, the usage of varying unit and texture unit are decreasing!? (More warps but less varing/textureing ????)
So, my questions are:1. Is MSAA not "free" actually? The increase of GPU Active (20% ~ 30%) is expected?2. Why the growth rates of Fragment Warp/Execution Core Active/GPU Active are different?3. What's going on with each unit when using MSAA 4x?
Thanks!
yonghao lu said:Is MSAA not "free" actually? The increase of GPU Active (20% ~ 30%) is expected?
Cost is very content dependent - some content is close to free, and some content isn't.
It is more expensive than it used to be on older Mali GPUs because some of the fixed-function logic that is used more by MSAA has not scaled up in performance as much as the programmable shader core. Content that is bound by these fixed function paths (rasterizer, ZS test, blend, etc) can see more impact than content which is bound by shader performance.
yonghao lu said:2. Why the growth rates of Fragment Warp/Execution Core Active/GPU Active are different?
This is just highlighting where the bottlenecks are occurring. You have a bigger increase in cycles than warps, which indicates that the workload is not limited by warp shader performance, but something outside of the shader core. It's not always easy to tell what unfortunately - we have limited visibility of the fixed-function blocks in the performance counters.
yonghao lu said:3. What's going on with each unit when using MSAA 4x?
These counters are percentage load counters, so the drop is simply that there is only a little more work (3% more warps) spread over many more cycles (30% more cycles). So overall you are seeing a 27% drop in utilisation. Again, this is just showing that the bottleneck is outside of the programmable core and in one of the fixed-function paths.
What values do you get for the "Fragment FPK buffer utilization" counter?
Hi Peter! Really thanks for your reply and it's such an inspiring answers!
I really agree with you that the bottlenet may come from fixed function unit. According to some documentations from Arm, blend/ZS test unit is the one that being responsible for writing data to tile memory. I suspect that it may cause the performance degradation because there are 4 times data it need to process when MSAA 4x enabled. But as you say we have no way to know what happen with it right now :(
Any way, thank you for providing me with your perspective on analyzing the problem. It has been very enlightening for me. It's my first time to use Streamline and it provided me with a lot of information about the hardware(Better than snapdragon profiler in my oppion :) ).
Oh, the value of "Fragment FPK buffer utilization" is 84.3% -> 85.2%. Do you suspect that the performance degradation is caused by FPK?
yonghao lu said:Do you suspect that the performance degradation is caused by FPK?
No. That metric gives some indication of how well the fixed-function front-end (rasterization, early-zs, etc) is keeping the core fed with fragment quads. If you saw a large drop then that would be indicative that you were seeing a front-end problem. The fact you don't points more towards late-zs or blending being the slow path.
If you are able to share the Streamline capture I'd be happy to take a look. If you can't share this publicly, free free to get in touch via developer@arm.com.
Kind regards, Pete
Thank you Peter! I really appreciate for you to help me analyse my problem and I'm happy to share my Streamline file and screenshot of my demo scene. Hope it will also help somebody who is confusing about the same question.
Google drive link(including streamline file and screenshot)
My streamline version is 9.0.0, Build 20240215_172843
Best Wishes,
Yonghao
Hi Yonghao.
I think the problem in this case is likely to be the high percentage of Late ZS test/update that your shaders are triggering. During the last 40% of the frame, you have ~50% of fragment quads using late ZS testing. For MSAA this consumes proportionally more ZS test resource due to the higher sample count that needs testing and, depending on how many quads end up with partial coverage after late-ZS, can cause higher blending load too.
You can see that the shader core functional unit utilisation is particularly poor during this phase when you compare before/after MSAA. The earlier part of the frame isn't so bad.
To avoid late-ZS try to minimize use of alpha-to-coverage, shader-based alpha testing, or shader-based writes to fragment depth.
Hi Peter.
The anaysis idea you proviede to me is really helpful. It's a new perspective on my rendering pipeline. I need sometime to further profile my rendering pipeline as you suggested. I may encounter new problems and hope to get your suggestions.
Again, thank you and have a good day :)