Hi! We're currently working on implementing subpasses for Vulkan and encountered really strange behaviour on Mali GPUs, specifically G76 (Samsung S10), G77 (S20FE). Samsung S10 is running Android 12. In short, it looks like the driver is not merging subpasses.
The render pass in question consisted of two subpasses. We first output something similar to G-Buffer, including depth, then read the data using input attachments.
We first noticed that subpasses on Mali did not give us performance improvement, or in case of Note 8 Pro, noticeable performance degradation. When we looked at AGI captures, the AGI showed two different render passes with the same VkRenderPass handle, which suggested that driver did not merge subpasses.
Next, we tried to reproduce the issue using the following examples, and observed the same behaviour.
https://github.com/KhronosGroup/Vulkan-Samples
https://github.com/SaschaWillems/Vulkan
In case of Vulkan Samples repo, on Samsung S10, switching between Subpasses and Render Passes did not change Tile Count or system memory accesses. When we tried running Vulkan Samples on Huawei Nova 5T (A10, Mali-G76 MP10), switching from Render Passes to Subpasses yields 2x decrease in Tile Count and system memory reads/writes. As for G77, it also shows our new merged pass with two subpasses as two render passes.
In case of S10 it's especially surprising, as Vulkan Samples page on Subpasses (https://github.com/KhronosGroup/Vulkan-Samples/tree/main/samples/performance/subpasses) mentions this exact phone and shows expected tile usage improvements.
As those samples exhibit the same issues as our client code, is there anything wrong or potentially wrong that may hint the driver to not merge the subpasses? And how should correctly merged subpasses look in AGI?
About the slowdown on S20 with the input attachment sample, now having looked at the code a theory for what happens is:
As you can see, in the first two cases (no fusion) we end up with quite a lot of off-chip/DRAM traffic in order to write out the 'intermediate' images. Whereas when fusing the subpasses we avoid this by being able to keep the data-on-chip, and the only thing we need to write to memory is the final swapchain image. The difference between the first two cases is that in the second case we need a readback of the depth-buffer in the second fragment job (because there is one specified) -- and this extra work / BW may explain the slowdown.
With this in mind it could be the slowdown you see here is not related to fusion working / not working, and that fusion in fact never happens on these devices. If so it would explain the lack of any differences with/without subpasses on the S10 and S20, but that there is a difference on the Nova 5T.
If this sounds good so far, I guess the only question remaining is why you see a noticeable slowdown on the Note 8 in your original case. A theory is that fusion is in fact working here, explaining there is *a* difference -- but of course we'd expect an improvement, not a slowdown.
Are you able to do any profiling on the Note 8 to try to get some information out of it about what might be happening on this device?
Regarding the Note 8 Pro, below are before (left) and after (right) relevant metrics captured from our bench device. It does look like memory traffic and amount of tiles modified are reduced greatly.
One thing to try here is to zoom in (use the 1ms option) and try to select 1 frame using the range selector. Usually there is a nice gap between frames, if vsync limited, so this should be fairly easy if so. If not look for patterns in the workload, like e.g. when the bulk of the VS shading happens (this is usually the start of a frame).
This way you can directly compare the workload for a single frame, as opposed to workload inside some specific period of time. I find this useful when comparing performance between two cases.