Hi! We're currently working on implementing subpasses for Vulkan and encountered really strange behaviour on Mali GPUs, specifically G76 (Samsung S10), G77 (S20FE). Samsung S10 is running Android 12. In short, it looks like the driver is not merging subpasses.
The render pass in question consisted of two subpasses. We first output something similar to G-Buffer, including depth, then read the data using input attachments.
We first noticed that subpasses on Mali did not give us performance improvement, or in case of Note 8 Pro, noticeable performance degradation. When we looked at AGI captures, the AGI showed two different render passes with the same VkRenderPass handle, which suggested that driver did not merge subpasses.
Next, we tried to reproduce the issue using the following examples, and observed the same behaviour.
https://github.com/KhronosGroup/Vulkan-Samples
https://github.com/SaschaWillems/Vulkan
In case of Vulkan Samples repo, on Samsung S10, switching between Subpasses and Render Passes did not change Tile Count or system memory accesses. When we tried running Vulkan Samples on Huawei Nova 5T (A10, Mali-G76 MP10), switching from Render Passes to Subpasses yields 2x decrease in Tile Count and system memory reads/writes. As for G77, it also shows our new merged pass with two subpasses as two render passes.
In case of S10 it's especially surprising, as Vulkan Samples page on Subpasses (https://github.com/KhronosGroup/Vulkan-Samples/tree/main/samples/performance/subpasses) mentions this exact phone and shows expected tile usage improvements.
As those samples exhibit the same issues as our client code, is there anything wrong or potentially wrong that may hint the driver to not merge the subpasses? And how should correctly merged subpasses look in AGI?
One thing to try here is to zoom in (use the 1ms option) and try to select 1 frame using the range selector. Usually there is a nice gap between frames, if vsync limited, so this should be fairly easy if so. If not look for patterns in the workload, like e.g. when the bulk of the VS shading happens (this is usually the start of a frame).
This way you can directly compare the workload for a single frame, as opposed to workload inside some specific period of time. I find this useful when comparing performance between two cases.