Hi! We're currently working on implementing subpasses for Vulkan and encountered really strange behaviour on Mali GPUs, specifically G76 (Samsung S10), G77 (S20FE). Samsung S10 is running Android 12. In short, it looks like the driver is not merging subpasses.
The render pass in question consisted of two subpasses. We first output something similar to G-Buffer, including depth, then read the data using input attachments.
We first noticed that subpasses on Mali did not give us performance improvement, or in case of Note 8 Pro, noticeable performance degradation. When we looked at AGI captures, the AGI showed two different render passes with the same VkRenderPass handle, which suggested that driver did not merge subpasses.
Next, we tried to reproduce the issue using the following examples, and observed the same behaviour.
https://github.com/KhronosGroup/Vulkan-Samples
https://github.com/SaschaWillems/Vulkan
In case of Vulkan Samples repo, on Samsung S10, switching between Subpasses and Render Passes did not change Tile Count or system memory accesses. When we tried running Vulkan Samples on Huawei Nova 5T (A10, Mali-G76 MP10), switching from Render Passes to Subpasses yields 2x decrease in Tile Count and system memory reads/writes. As for G77, it also shows our new merged pass with two subpasses as two render passes.
In case of S10 it's especially surprising, as Vulkan Samples page on Subpasses (https://github.com/KhronosGroup/Vulkan-Samples/tree/main/samples/performance/subpasses) mentions this exact phone and shows expected tile usage improvements.
As those samples exhibit the same issues as our client code, is there anything wrong or potentially wrong that may hint the driver to not merge the subpasses? And how should correctly merged subpasses look in AGI?
I can try to explain some of the things I had in mind, though should be said it's possible I misunderstood some things and a lot of this is semantics anyway :)
> 1. For base resolution X (lower than native), MSAA 4X would make for 4X system memory cost and 4X bandwidth cost when subpasses are not merged.
Yep, though as mentioned, because of AFBC only applying to non-MSAA images this can easily be e.g. 8x or more instead of 4x.
> 2. If we instead make our new resolution something like 1.2X, we still get improved visual fidelity at the cost of 1.2X bandwidth and 1.2X raster cost.
Here there are some details depending on what you compare. Comparing 0x MSAA at 1.0x resolution vs 0x MSAA at 1.2x resolution, that's fair. But if comparing e.g. 1.0x resolution w/4xMSAA with 1.2x resolution w/0xMSAA then keep in mind MSAA *also* comes with some extra rasterization cost -- so in this sense the trade-off is not entirely obvious. Because MSAA also only adds quality at the edges, whereas 1.2x would give a more general improvement, there are some details to the quality comparisons as well.
> 3. [...] In other words, if MSAA 4X causes 4x tile count, if we instead make it 1.2x tile count, we'll spend 1.2x more time rasterizing fragments, but 3.3x time less on loading and storing tiles.
For this one, it's maybe mostly a semantic argument, but here I'd say the pixel/tile count will be the same with both 0x and 4x MSAA -- while increasing resolution to 1.2x on the other hand would cause an increase in pixel / tile count. *However*, 4x MSAA has (a) an increased rasterization cost (because of the additional samples per pixel) compared to 0x MSAA, and (b) an increased BW cost since the storage-per-pixel is higher as well. So, when comparing bandwidth there is a huge difference between 0x and 4x in this case, and here it's definitely the case that e.g. using 1.2x resolution instead of 4x MSAA should be a very nice BW- and time-to-do-tile-load/store-performance win.
> 4. Assuming that moving data over system memory bus produces more heat than rasterization over time (one SoC vendor did hint about that), rasterizing 1.2x fragments would produce *much* less heat than transferring 4x tiles over system bus.
Here it's probably also mostly a semantic argument, but 1.2x resolution would be more tiles than the 1.0x w/4xMSAA case, but indeed the 4x MSAA case would be more bytes-per-pixel/tile -- thus more total BW. So yes this is effectively mostly a BW vs pixel-processing trade-off -- but strictly speaking the the rasterization cost specifically may actually be higher for the 4x MSAA case (fewer pixels, but 4x samples per pixel), while e.g. fragment shading cost might be higher for the 1.2x resolution case (1.2x more pixels, but now only 1x sample per pixel to rasterize).
In short, mostly nit-picking on some of the specific language (especially rasterization cost vs pixel shading cost, and tile count vs the storage needed for a given tile) but seems to me your thinking is very much on point :)
As for MSAA, did I understand you correctly, that when subpasses are not merged, input attachments essentially become texel fetches from system memory?
Yep, that's correct.
And when subpasses are not merged, specifying depth both as depth attachment and as input attachment causes additional depth tile loads?
It's primarily because of the first point, really. Any time you have a depth/stencil attachment specified, unless the attachment is LOAD_OP_DONT_CARE/LOAD_OP_CLEAR, we'll usually need to read it from memory into the tile-buffer. This holds also for non-fused subpasses, as in this case the sub-pass effectively becomes a render-pass with LOAD_OP_LOAD for each non-input attachment.
A useful mental model here might be; any time we do not fuse, each sub-pass becomes a render-pass with the load/storeOps defined by previous/upcoming usage in other subpasses (or the render-pass attachment specification if there is no previous/upcoming use in other sub-passes). In this case the depth-buffer was written to in the previous subpass and is also used in the current subpass. So it will be STORE_OP_STORE-ed at the end of the first subpass, and LOAD_OP_LOAD-ed at the beginning of this one.
Hope this helps :)