Hi! We're currently working on implementing subpasses for Vulkan and encountered really strange behaviour on Mali GPUs, specifically G76 (Samsung S10), G77 (S20FE). Samsung S10 is running Android 12. In short, it looks like the driver is not merging subpasses.
The render pass in question consisted of two subpasses. We first output something similar to G-Buffer, including depth, then read the data using input attachments.
We first noticed that subpasses on Mali did not give us performance improvement, or in case of Note 8 Pro, noticeable performance degradation. When we looked at AGI captures, the AGI showed two different render passes with the same VkRenderPass handle, which suggested that driver did not merge subpasses.
Next, we tried to reproduce the issue using the following examples, and observed the same behaviour.
https://github.com/KhronosGroup/Vulkan-Samples
https://github.com/SaschaWillems/Vulkan
In case of Vulkan Samples repo, on Samsung S10, switching between Subpasses and Render Passes did not change Tile Count or system memory accesses. When we tried running Vulkan Samples on Huawei Nova 5T (A10, Mali-G76 MP10), switching from Render Passes to Subpasses yields 2x decrease in Tile Count and system memory reads/writes. As for G77, it also shows our new merged pass with two subpasses as two render passes.
In case of S10 it's especially surprising, as Vulkan Samples page on Subpasses (https://github.com/KhronosGroup/Vulkan-Samples/tree/main/samples/performance/subpasses) mentions this exact phone and shows expected tile usage improvements.
As those samples exhibit the same issues as our client code, is there anything wrong or potentially wrong that may hint the driver to not merge the subpasses? And how should correctly merged subpasses look in AGI?
We're able to reproduce the subpass sample not showing any difference between render-passes and sub-passes on a stock Galaxy S20 with a r38p1 driver. It works as expected on our stock driver, however -- so our best guess is this is caused by a driver modification by the device vendor.
For the input attachment sample, on a G710 device I can see the number of tiles are reduced after making your modifications, indicating subpass fusion is now happening (left side is after modifications, right side is the default code):
Notice there's no real performance change in this case (if anything the modified code is slightly faster) and we can also see that bandwidth is now significantly reduced:
About the slowdown on S20 with the input attachment sample, now having looked at the code a theory for what happens is:
As you can see, in the first two cases (no fusion) we end up with quite a lot of off-chip/DRAM traffic in order to write out the 'intermediate' images. Whereas when fusing the subpasses we avoid this by being able to keep the data-on-chip, and the only thing we need to write to memory is the final swapchain image. The difference between the first two cases is that in the second case we need a readback of the depth-buffer in the second fragment job (because there is one specified) -- and this extra work / BW may explain the slowdown.
With this in mind it could be the slowdown you see here is not related to fusion working / not working, and that fusion in fact never happens on these devices. If so it would explain the lack of any differences with/without subpasses on the S10 and S20, but that there is a difference on the Nova 5T.
If this sounds good so far, I guess the only question remaining is why you see a noticeable slowdown on the Note 8 in your original case. A theory is that fusion is in fact working here, explaining there is *a* difference -- but of course we'd expect an improvement, not a slowdown.
Are you able to do any profiling on the Note 8 to try to get some information out of it about what might be happening on this device?
Regarding the Note 8 Pro, below are before (left) and after (right) relevant metrics captured from our bench device. It does look like memory traffic and amount of tiles modified are reduced greatly.
Thanks! It's both good news and bad news. Considering what you said before, that subpass merging might reduce raw performance in favour of sustained thermal performance, I can see why some vendors might be intrested to modify the driver and disable subpass merging.
As for subpass merging not working, do you think it would make more sense for us to trade MSAA for a slight resolution increase? The theory is that:
1. For base resolution X (lower than native), MSAA 4X would make for 4X system memory cost and 4X bandwidth cost when subpasses are not merged.
2. If we instead make our new resolution something like 1.2X, we still get improved visual fidelity at the cost of 1.2X bandwidth and 1.2X raster cost.
3. Assuming ALU performance scales better than system bus throughoutput generation-over-generation, there's a chance that we'll exchange time tile load/store cycle for time that we take to rasterize more fragments. In other words, if MSAA 4X causes 4x tile count, if we instead make it 1.2x tile count, we'll spend 1.2x more time rasterizing fragments, but 3.3x time less on loading and storing tiles.
4. Assuming that moving data over system memory bus produces more heat than rasterization over time (one SoC vendor did hint about that), rasterizing 1.2x fragments would produce *much* less heat than transferring 4x tiles over system bus.
One additional thing to note here is that MSAA write-out cannot be framebuffer-compressed on our current GPUs, but non-MSAA can. So comparing 4x MSAA writeout vs 0x MSAA writeout the BW difference can easily be 8x or more in practice, given AFBC usually gives 2:1 compression ratios (and much better for solid-color tiles).
So, in practice, knowing some vendors may disable fusion, it seems difficult to recommend to use MSAA in combination with subpasses at all, as the risk and performance-consequence of MSAA writeouts is decidedly non-trivial...
So your general thinking there makes sense to me I think -- I might nitpick some of the details but the general direction seems reasonable.
One thing to try here is to zoom in (use the 1ms option) and try to select 1 frame using the range selector. Usually there is a nice gap between frames, if vsync limited, so this should be fairly easy if so. If not look for patterns in the workload, like e.g. when the bulk of the VS shading happens (this is usually the start of a frame).
This way you can directly compare the workload for a single frame, as opposed to workload inside some specific period of time. I find this useful when comparing performance between two cases.
Thanks! If possible, can you please point out the details that I got wrong?
As for MSAA, did I understand you correctly, that when subpasses are not merged, input attachments essentially become texel fetches from system memory? And when subpasses are not merged, specifying depth both as depth attachment and as input attachment causes additional depth tile loads?
As for range selector in streamline, great point! When I zoomed in on 1ms, I can see that Bus Beats/Core Tiles with subpasses do have less spikes.
I can try to explain some of the things I had in mind, though should be said it's possible I misunderstood some things and a lot of this is semantics anyway :)
> 1. For base resolution X (lower than native), MSAA 4X would make for 4X system memory cost and 4X bandwidth cost when subpasses are not merged.
Yep, though as mentioned, because of AFBC only applying to non-MSAA images this can easily be e.g. 8x or more instead of 4x.
> 2. If we instead make our new resolution something like 1.2X, we still get improved visual fidelity at the cost of 1.2X bandwidth and 1.2X raster cost.
Here there are some details depending on what you compare. Comparing 0x MSAA at 1.0x resolution vs 0x MSAA at 1.2x resolution, that's fair. But if comparing e.g. 1.0x resolution w/4xMSAA with 1.2x resolution w/0xMSAA then keep in mind MSAA *also* comes with some extra rasterization cost -- so in this sense the trade-off is not entirely obvious. Because MSAA also only adds quality at the edges, whereas 1.2x would give a more general improvement, there are some details to the quality comparisons as well.
> 3. [...] In other words, if MSAA 4X causes 4x tile count, if we instead make it 1.2x tile count, we'll spend 1.2x more time rasterizing fragments, but 3.3x time less on loading and storing tiles.
For this one, it's maybe mostly a semantic argument, but here I'd say the pixel/tile count will be the same with both 0x and 4x MSAA -- while increasing resolution to 1.2x on the other hand would cause an increase in pixel / tile count. *However*, 4x MSAA has (a) an increased rasterization cost (because of the additional samples per pixel) compared to 0x MSAA, and (b) an increased BW cost since the storage-per-pixel is higher as well. So, when comparing bandwidth there is a huge difference between 0x and 4x in this case, and here it's definitely the case that e.g. using 1.2x resolution instead of 4x MSAA should be a very nice BW- and time-to-do-tile-load/store-performance win.
> 4. Assuming that moving data over system memory bus produces more heat than rasterization over time (one SoC vendor did hint about that), rasterizing 1.2x fragments would produce *much* less heat than transferring 4x tiles over system bus.
Here it's probably also mostly a semantic argument, but 1.2x resolution would be more tiles than the 1.0x w/4xMSAA case, but indeed the 4x MSAA case would be more bytes-per-pixel/tile -- thus more total BW. So yes this is effectively mostly a BW vs pixel-processing trade-off -- but strictly speaking the the rasterization cost specifically may actually be higher for the 4x MSAA case (fewer pixels, but 4x samples per pixel), while e.g. fragment shading cost might be higher for the 1.2x resolution case (1.2x more pixels, but now only 1x sample per pixel to rasterize).
In short, mostly nit-picking on some of the specific language (especially rasterization cost vs pixel shading cost, and tile count vs the storage needed for a given tile) but seems to me your thinking is very much on point :)
As for MSAA, did I understand you correctly, that when subpasses are not merged, input attachments essentially become texel fetches from system memory?
Yep, that's correct.
And when subpasses are not merged, specifying depth both as depth attachment and as input attachment causes additional depth tile loads?
It's primarily because of the first point, really. Any time you have a depth/stencil attachment specified, unless the attachment is LOAD_OP_DONT_CARE/LOAD_OP_CLEAR, we'll usually need to read it from memory into the tile-buffer. This holds also for non-fused subpasses, as in this case the sub-pass effectively becomes a render-pass with LOAD_OP_LOAD for each non-input attachment.
A useful mental model here might be; any time we do not fuse, each sub-pass becomes a render-pass with the load/storeOps defined by previous/upcoming usage in other subpasses (or the render-pass attachment specification if there is no previous/upcoming use in other sub-passes). In this case the depth-buffer was written to in the previous subpass and is also used in the current subpass. So it will be STORE_OP_STORE-ed at the end of the first subpass, and LOAD_OP_LOAD-ed at the beginning of this one.
Hope this helps :)