This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Unexpected behaviour of VKN subpasses on G76 (Samsung S10), G77 (S20FE)

Hi! We're currently working on implementing subpasses for Vulkan and encountered really strange behaviour on Mali GPUs, specifically G76 (Samsung S10), G77 (S20FE). Samsung S10 is running Android 12. In short, it looks like the driver is not merging subpasses.

The render pass in question consisted of two subpasses. We first output something similar to G-Buffer, including depth, then read the data using input attachments.

We first noticed that subpasses on Mali did not give us performance improvement, or in case of Note 8 Pro, noticeable performance degradation. When we looked at AGI captures, the AGI showed two different render passes with the same VkRenderPass handle, which suggested that driver did not merge subpasses.

Next, we tried to reproduce the issue using the following examples, and observed the same behaviour.

https://github.com/KhronosGroup/Vulkan-Samples

https://github.com/SaschaWillems/Vulkan

In case of Vulkan Samples repo, on Samsung S10, switching between Subpasses and Render Passes did not change Tile Count or system memory accesses. When we tried running Vulkan Samples on Huawei Nova 5T (A10, Mali-G76 MP10), switching from Render Passes to Subpasses yields 2x decrease in Tile Count and system memory reads/writes. As for G77, it also shows our new merged pass with two subpasses as two render passes.

In case of S10 it's especially surprising, as Vulkan Samples page on Subpasses (https://github.com/KhronosGroup/Vulkan-Samples/tree/main/samples/performance/subpasses) mentions this exact phone and shows expected tile usage improvements.

As those samples exhibit the same issues as our client code, is there anything wrong or potentially wrong that may hint the driver to not merge the subpasses? And how should correctly merged subpasses look in AGI?

  • The other thing I'd like to mention is that we also tried binding depth simultaneously as IA and as pDepthStencilAttachment in DEPTH_STENCIL_READ_ONLY layout. This did not change anything, although ARM guides suggest that this is the correct way of using depth after it's rendered.

  • Hi Ivan,

    First thing to check in this situation is usually what kind of VkSubpassDependency-ies you have set up. Notably, any use of VK_ACCESS_SHADER_READ_BIT in the dependencies between two subpasses will prevent the driver from merging them. This is often the culprit when fusion is not triggering like expected.

    I find https://github.com/ARM-software/vulkan-sdk/blob/master/samples/multipass/multipass.cpp#L923 a useful 'reference' setup here. Do any differences jump out if you compare this with your use-case?

    Cheers,
    Christian

  • vkCreateRenderPass2                      vkCreateRenderPass2({ VkAttachmentDescription2[6], { { { 0, 1, 2 }, 3 }, { { 0 }, 3 } } })
    	device                                 Device 50
    	CreateInfo                             VkRenderPassCreateInfo2()
    		sType                                VK_STRUCTURE_TYPE_RENDER_PASS_CREATE_INFO_2
    		pNext                                NULL
    		flags                                VkRenderPassCreateFlagBits(0)
    		attachmentCount                      6
    		pAttachments                         VkAttachmentDescription2[6]
    			[0]                                VkAttachmentDescription2()
    				sType                            VK_STRUCTURE_TYPE_ATTACHMENT_DESCRIPTION_2
    				pNext                            NULL
    				flags                            VkAttachmentDescriptionFlagBits(0)
    				format                           VK_FORMAT_B10G11R11_UFLOAT_PACK32
    				samples                          VK_SAMPLE_COUNT_4_BIT
    				loadOp                           VK_ATTACHMENT_LOAD_OP_CLEAR
    				storeOp                          VK_ATTACHMENT_STORE_OP_DONT_CARE
    				stencilLoadOp                    VK_ATTACHMENT_LOAD_OP_DONT_CARE
    				stencilStoreOp                   VK_ATTACHMENT_STORE_OP_DONT_CARE
    				initialLayout                    VK_IMAGE_LAYOUT_UNDEFINED
    				finalLayout                      VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
    			[1]                                VkAttachmentDescription2()
    				sType                            VK_STRUCTURE_TYPE_ATTACHMENT_DESCRIPTION_2
    				pNext                            NULL
    				flags                            VkAttachmentDescriptionFlagBits(0)
    				format                           VK_FORMAT_R8G8_UNORM
    				samples                          VK_SAMPLE_COUNT_4_BIT
    				loadOp                           VK_ATTACHMENT_LOAD_OP_CLEAR
    				storeOp                          VK_ATTACHMENT_STORE_OP_DONT_CARE
    				stencilLoadOp                    VK_ATTACHMENT_LOAD_OP_DONT_CARE
    				stencilStoreOp                   VK_ATTACHMENT_STORE_OP_DONT_CARE
    				initialLayout                    VK_IMAGE_LAYOUT_UNDEFINED
    				finalLayout                      VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
    			[2]                                VkAttachmentDescription2()
    				sType                            VK_STRUCTURE_TYPE_ATTACHMENT_DESCRIPTION_2
    				pNext                            NULL
    				flags                            VkAttachmentDescriptionFlagBits(0)
    				format                           VK_FORMAT_R8_UNORM
    				samples                          VK_SAMPLE_COUNT_4_BIT
    				loadOp                           VK_ATTACHMENT_LOAD_OP_CLEAR
    				storeOp                          VK_ATTACHMENT_STORE_OP_DONT_CARE
    				stencilLoadOp                    VK_ATTACHMENT_LOAD_OP_DONT_CARE
    				stencilStoreOp                   VK_ATTACHMENT_STORE_OP_DONT_CARE
    				initialLayout                    VK_IMAGE_LAYOUT_UNDEFINED
    				finalLayout                      VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
    			[3]                                VkAttachmentDescription2()
    				sType                            VK_STRUCTURE_TYPE_ATTACHMENT_DESCRIPTION_2
    				pNext                            NULL
    				flags                            VkAttachmentDescriptionFlagBits(0)
    				format                           VK_FORMAT_D32_SFLOAT
    				samples                          VK_SAMPLE_COUNT_4_BIT
    				loadOp                           VK_ATTACHMENT_LOAD_OP_CLEAR
    				storeOp                          VK_ATTACHMENT_STORE_OP_DONT_CARE
    				stencilLoadOp                    VK_ATTACHMENT_LOAD_OP_DONT_CARE
    				stencilStoreOp                   VK_ATTACHMENT_STORE_OP_DONT_CARE
    				initialLayout                    VK_IMAGE_LAYOUT_UNDEFINED
    				finalLayout                      VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
    			[4]                                VkAttachmentDescription2()
    				sType                            VK_STRUCTURE_TYPE_ATTACHMENT_DESCRIPTION_2
    				pNext                            NULL
    				flags                            VkAttachmentDescriptionFlagBits(0)
    				format                           VK_FORMAT_B10G11R11_UFLOAT_PACK32
    				samples                          VK_SAMPLE_COUNT_1_BIT
    				loadOp                           VK_ATTACHMENT_LOAD_OP_DONT_CARE
    				storeOp                          VK_ATTACHMENT_STORE_OP_STORE
    				stencilLoadOp                    VK_ATTACHMENT_LOAD_OP_DONT_CARE
    				stencilStoreOp                   VK_ATTACHMENT_STORE_OP_DONT_CARE
    				initialLayout                    VK_IMAGE_LAYOUT_UNDEFINED
    				finalLayout                      VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
    			[5]                                VkAttachmentDescription2()
    				sType                            VK_STRUCTURE_TYPE_ATTACHMENT_DESCRIPTION_2
    				pNext                            NULL
    				flags                            VkAttachmentDescriptionFlagBits(0)
    				format                           VK_FORMAT_D32_SFLOAT
    				samples                          VK_SAMPLE_COUNT_1_BIT
    				loadOp                           VK_ATTACHMENT_LOAD_OP_DONT_CARE
    				storeOp                          VK_ATTACHMENT_STORE_OP_STORE
    				stencilLoadOp                    VK_ATTACHMENT_LOAD_OP_DONT_CARE
    				stencilStoreOp                   VK_ATTACHMENT_STORE_OP_DONT_CARE
    				initialLayout                    VK_IMAGE_LAYOUT_UNDEFINED
    				finalLayout                      VK_IMAGE_LAYOUT_DEPTH_STENCIL_READ_ONLY_OPTIMAL
    		subpassCount                         2
    		pSubpasses                           VkSubpassDescription2[2]
    			[0]                                VkSubpassDescription2()
    				sType                            VK_STRUCTURE_TYPE_SUBPASS_DESCRIPTION_2
    				pNext                            NULL
    				flags                            VkSubpassDescriptionFlagBits(0)
    				pipelineBindPoint                VK_PIPELINE_BIND_POINT_GRAPHICS
    				viewMask                         0
    				inputAttachmentCount             0
    				pInputAttachments                VkAttachmentReference2[0]
    				colorAttachmentCount             3
    				pColorAttachments                VkAttachmentReference2[3]
    					[0]                            VkAttachmentReference2()
    						sType                        VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    						pNext                        NULL
    						attachment                   0
    						layout                       VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
    						aspectMask                   VK_IMAGE_ASPECT_COLOR_BIT
    					[1]                            VkAttachmentReference2()
    						sType                        VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    						pNext                        NULL
    						attachment                   1
    						layout                       VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
    						aspectMask                   VK_IMAGE_ASPECT_COLOR_BIT
    					[2]                            VkAttachmentReference2()
    						sType                        VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    						pNext                        NULL
    						attachment                   2
    						layout                       VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
    						aspectMask                   VK_IMAGE_ASPECT_COLOR_BIT
    				pResolveAttachments              VkAttachmentReference2[0]
    				pDepthStencilAttachment          VkAttachmentReference2()
    					sType                          VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    					pNext                          NULL
    					attachment                     3
    					layout                         VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
    					aspectMask                     VK_IMAGE_ASPECT_DEPTH_BIT
    				preserveAttachmentCount          0
    				pPreserveAttachments             uint32_t[0]
    			[1]                                VkSubpassDescription2()
    				sType                            VK_STRUCTURE_TYPE_SUBPASS_DESCRIPTION_2
    				pNext                            VkSubpassDescriptionDepthStencilResolve()
    				sType                          VK_STRUCTURE_TYPE_SUBPASS_DESCRIPTION_DEPTH_STENCIL_RESOLVE
    				pNext                          NULL
    				depthResolveMode               VK_RESOLVE_MODE_SAMPLE_ZERO_BIT
    				stencilResolveMode             VK_RESOLVE_MODE_SAMPLE_ZERO_BIT
    				pDepthStencilResolveAttachment VkAttachmentReference2()
    					sType                        VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    					pNext                        NULL
    					attachment                   5
    					layout                       VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
    					aspectMask                   VK_IMAGE_ASPECT_DEPTH_BIT
    				flags                            VkSubpassDescriptionFlagBits(0)
    				pipelineBindPoint                VK_PIPELINE_BIND_POINT_GRAPHICS
    				viewMask                         0
    				inputAttachmentCount             3
    				pInputAttachments                VkAttachmentReference2[3]
    					[0]                            VkAttachmentReference2()
    						sType                        VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    						pNext                        NULL
    						attachment                   1
    						layout                       VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
    						aspectMask                   VK_IMAGE_ASPECT_COLOR_BIT
    					[1]                            VkAttachmentReference2()
    						sType                        VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    						pNext                        NULL
    						attachment                   2
    						layout                       VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
    						aspectMask                   VK_IMAGE_ASPECT_COLOR_BIT
    					[2]                            VkAttachmentReference2()
    						sType                        VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    						pNext                        NULL
    						attachment                   3
    						layout                       VK_IMAGE_LAYOUT_DEPTH_STENCIL_READ_ONLY_OPTIMAL
    						aspectMask                   VK_IMAGE_ASPECT_DEPTH_BIT
    				colorAttachmentCount             1
    				pColorAttachments                VkAttachmentReference2[1]
    					[0]                            VkAttachmentReference2()
    						sType                        VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    						pNext                        NULL
    						attachment                   0
    						layout                       VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
    						aspectMask                   VK_IMAGE_ASPECT_COLOR_BIT
    				pResolveAttachments              VkAttachmentReference2[1]
    					[0]                            VkAttachmentReference2()
    						sType                        VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    						pNext                        NULL
    						attachment                   4
    						layout                       VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
    						aspectMask                   VK_IMAGE_ASPECT_COLOR_BIT
    				pDepthStencilAttachment          VkAttachmentReference2()
    					sType                          VK_STRUCTURE_TYPE_ATTACHMENT_REFERENCE_2
    					pNext                          NULL
    					attachment                     3
    					layout                         VK_IMAGE_LAYOUT_DEPTH_STENCIL_READ_ONLY_OPTIMAL
    					aspectMask                     VK_IMAGE_ASPECT_DEPTH_BIT
    				preserveAttachmentCount          0
    				pPreserveAttachments             uint32_t[0]
    		dependencyCount                      3
    		pDependencies                        VkSubpassDependency2[3]
    			[0]                                VkSubpassDependency2()
    				sType                            VK_STRUCTURE_TYPE_SUBPASS_DEPENDENCY_2
    				pNext                            NULL
    				srcSubpass                       UINT32_MAX
    				dstSubpass                       0
    				srcStageMask                     VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
    				dstStageMask                     VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
    				srcAccessMask                    VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT
    				dstAccessMask                    VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT
    				dependencyFlags                  VK_DEPENDENCY_BY_REGION_BIT
    				viewOffset                       0
    			[1]                                VkSubpassDependency2()
    				sType                            VK_STRUCTURE_TYPE_SUBPASS_DEPENDENCY_2
    				pNext                            NULL
    				srcSubpass                       0
    				dstSubpass                       1
    				srcStageMask                     VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
    				dstStageMask                     VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT | VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
    				srcAccessMask                    VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT
    				dstAccessMask                    VK_ACCESS_INPUT_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT
    				dependencyFlags                  VK_DEPENDENCY_BY_REGION_BIT
    				viewOffset                       0
    			[2]                                VkSubpassDependency2()
    				sType                            VK_STRUCTURE_TYPE_SUBPASS_DEPENDENCY_2
    				pNext                            NULL
    				srcSubpass                       1
    				dstSubpass                       UINT32_MAX
    				srcStageMask                     VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
    				dstStageMask                     VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT | VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
    				srcAccessMask                    VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT
    				dstAccessMask                    VK_ACCESS_INPUT_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT
    				dependencyFlags                  VK_DEPENDENCY_BY_REGION_BIT
    				viewOffset                       0
    		correlatedViewMaskCount              0
    		pCorrelatedViewMasks                 uint32_t[0]
    	pAllocator                             NULL
    	RenderPass                             Render Pass 1847
    

    Hi Christian, this VkCreateInfo is taken from RDoc capture. This is the render pass where we try to apply multipass rendering.We don't use VK_ACCESS_SHADER_READ_BIT. Can you take a look and see if we may be doing something wrong?

  • Also, here's my experiments with inputattachments example from

    https://github.com/SaschaWillems/Vulkan

    The original render pass creation code goes like this:

    		/*
    			First subpass
    			Fill the color and depth attachments
    		*/
    		VkAttachmentReference colorReference = { 1, VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL };
    		VkAttachmentReference depthReference = { 2, VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL };
    
    		subpassDescriptions[0].pipelineBindPoint = VK_PIPELINE_BIND_POINT_GRAPHICS;
    		subpassDescriptions[0].colorAttachmentCount = 1;
    		subpassDescriptions[0].pColorAttachments = &colorReference;
    		subpassDescriptions[0].pDepthStencilAttachment = &depthReference;
    
    		/*
    			Second subpass
    			Input attachment read and swap chain color attachment write
    		*/
    
    		// Color reference (target) for this sub pass is the swap chain color attachment
    		VkAttachmentReference colorReferenceSwapchain = { 0, VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL };
    
    		subpassDescriptions[1].pipelineBindPoint = VK_PIPELINE_BIND_POINT_GRAPHICS;
    		subpassDescriptions[1].colorAttachmentCount = 1;
    		subpassDescriptions[1].pColorAttachments = &colorReferenceSwapchain;
    
    		// Color and depth attachment written to in first sub pass will be used as input attachments to be read in the fragment shader
    		VkAttachmentReference inputReferences[2];
    		inputReferences[0] = { 1, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL };
    		inputReferences[1] = { 2, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL };
    
    		// Use the attachments filled in the first pass as input attachments
    		subpassDescriptions[1].inputAttachmentCount = 2;
    		subpassDescriptions[1].pInputAttachments = inputReferences;
    
    		/*
    			Subpass dependencies for layout transitions
    		*/
    		std::array<VkSubpassDependency, 3> dependencies;
    
    		dependencies[0].srcSubpass = VK_SUBPASS_EXTERNAL;
    		dependencies[0].dstSubpass = 0;
    		dependencies[0].srcStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT;
    		dependencies[0].dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT | VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT;
    		dependencies[0].srcAccessMask = VK_ACCESS_MEMORY_READ_BIT;
    		dependencies[0].dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT;
    		dependencies[0].dependencyFlags = VK_DEPENDENCY_BY_REGION_BIT;
    
    		// This dependency transitions the input attachment from color attachment to shader read
    		dependencies[1].srcSubpass = 0;
    		dependencies[1].dstSubpass = 1;
    		dependencies[1].srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
    		dependencies[1].dstStageMask = VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT;
    		dependencies[1].srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
    		dependencies[1].dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
    		dependencies[1].dependencyFlags = VK_DEPENDENCY_BY_REGION_BIT;
    
    		dependencies[2].srcSubpass = 0;
    		dependencies[2].dstSubpass = VK_SUBPASS_EXTERNAL;
    		dependencies[2].srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
    		dependencies[2].dstStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT;
    		dependencies[2].srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
    		dependencies[2].dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;
    		dependencies[2].dependencyFlags = VK_DEPENDENCY_BY_REGION_BIT;

    After modifying it like this

    inputReferences[1] = { 2, VK_IMAGE_LAYOUT_DEPTH_READ_ONLY_OPTIMAL };
        
    .....
        
    subpassDescriptions[1].pDepthStencilAttachment = inputReferences + 1;
    
    .....
    
    dependencies[1].srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT | VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT;
    dependencies[1].dstStageMask = VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT | VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT;
    dependencies[1].srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT;
    dependencies[1].dstAccessMask = VK_ACCESS_INPUT_ATTACHMENT_READ_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT;
    dependencies[1].dependencyFlags = VK_DEPENDENCY_BY_REGION_BIT;

    The FPS in demo drops from ~37 to ~35.

  • Interesting, thanks! For the SaschaWillems sample, they are using VK_ACCESS_SHADER_READ_BIT in the original code there, which will prevent fusion. With your changes I would expect fusion to happen, and this could explain the performance difference. Which device did you see the FPS change on? And any indications from e.g. AGI that fusion might be happening for this case?

    Otherwise, to check my understanding; you're not seeing any indications of fusion happening on neither the Galaxy S10 or S20 with the Vulkan Samples subpass sample, but on the Nova 5T you are seeing it?

    Would it be possible to share an APK reproducer of your original case with us? If so we could check exactly what happens. Please email us at developer@arm.com if so, thanks. :)

    (To mention it, once drivers get updated you'll also be able to use the VK_EXT_subpass_merge_feedback extension to get some feedback on this on your side. We're hoping that will be helpful for exactly this kind of case.)

  • I tested those changes on Samsung S10. The GPU queue in AGI on the device looks like this:

    Sometimes there are two fragment blocks, sometimes there are three(!!). Also, two vertex phases that belong to the same VkRenderPass handle.

    As for your understanding, yes, you are correct. Samsung S10/S20FE do not show any indication of fusion (onscreen metrics, AGI), but Huawei Nova 5T does (decreases tiles, memory bandwidth by 2x).

    Unfortunately we cannot share an APK with based on our game client, but you can use either most recent https://github.com/KhronosGroup/Vulkan-Samples , or even multipass sample from https://github.com/ARM-software/vulkan-sdk

    Below is the AGI capture of ARM multipass sample running on S10. Again, two vertex blocks, two fragment blocks. There are no local modifications apart from updating Gradle to 4.2.0 and Android SDK to 28.

  • Also, regarding performance changes -- since FPS dropped after potential subpass fusion, does that mean that subpass fusion can lower performance?

  • Thanks again. This all sounds fairly strange. One thing which comes to mind is there's a non-zero chance the driver might have been modified by the device vendor here, which could explain things. It might be worth it to reach out to them to check this.

    About performance, subpass fusion can indeed result in lower performance, particularly on pre-G710 GPUs. On these devices fusion can be a trade-off between performance and memory-bandwidth (and power). This can create a situation where in a peak-performance comparison fusion is worse, however in a longer gameplay-type scenario fusion is likely better (less power consumption/more battery time, less heat, less throttling, etc). However, on G710 and up we generally expect performance to be similar -- making fusion a clear win overall. There is some more info about this here for reference: community.arm.com/.../mali-g710-developer-overview

    It should be said that this can also be quite case-dependent. In your case, with MSAA, having to write MSAA attachment out to memory is going to be very painful -- so this should in theory be a case where fusion is quite helpful.

  • We're able to reproduce the subpass sample not showing any difference between render-passes and sub-passes on a stock Galaxy S20 with a r38p1 driver. It works as expected on our stock driver, however -- so our best guess is this is caused by a driver modification by the device vendor.

    For the input attachment sample, on a G710 device I can see the number of tiles are reduced after making your modifications, indicating subpass fusion is now happening (left side is after modifications, right side is the default code):

    Notice there's no real performance change in this case (if anything the modified code is slightly faster) and we can also see that bandwidth is now significantly reduced:

  • About the slowdown on S20 with the input attachment sample, now having looked at the code a theory for what happens is:

    • In the original code, there is no DS attachment in the second subpass. If there's no fusion you'll have:
      • Fragment job 0 (subpass 0): Clears color+depth, render, writes out color+depth
      • Fragment job 1 (subpass 1): Samples color-or depth-buffer from previous pass, writes out color to swap-chain
    • However, with your changes there is now a DS attachment in the second subpass. So we get:
      • Fragment job 0 (subpass 0): Clears color+depth, render, writes out color+depth. (Same as before)
      • Fragment job 1 (subpass 1): Reads back depth-buffer (since there is now a DS attachment present), samples color-or depth-buffer from previous pass, writes out color to swap-chain
    • In the ideal case, with fusion, we'd expect this:
      • Fragment job 0 (subpass 0 and 1 combined): Clear color+depth, render, <NextSubpass>, read color/depth from tile-buffer and write result to swap-chain image, write out swap-chain image to memory

    As you can see, in the first two cases (no fusion) we end up with quite a lot of off-chip/DRAM traffic in order to write out the 'intermediate' images. Whereas when fusing the subpasses we avoid this by being able to keep the data-on-chip, and the only thing we need to write to memory is the final swapchain image. The difference between the first two cases is that in the second case we need a readback of the depth-buffer in the second fragment job (because there is one specified) -- and this extra work / BW may explain the slowdown. 

    With this in mind it could be the slowdown you see here is not related to fusion working / not working, and that fusion in fact never happens on these devices. If so it would explain the lack of any differences with/without subpasses on the S10 and S20, but that there is a difference on the Nova 5T.

    If this sounds good so far, I guess the only question remaining is why you see a noticeable slowdown on the Note 8 in your original case. A theory is that fusion is in fact working here, explaining there is *a* difference -- but of course we'd expect an improvement, not a slowdown.

    Are you able to do any profiling on the Note 8 to try to get some information out of it about what might be happening on this device?

  • Regarding the Note 8 Pro, below are before (left) and after (right) relevant metrics captured from our bench device. It does look like memory traffic and amount of tiles modified are reduced greatly.

  • Thanks! It's both good news and bad news. Considering what you said before, that subpass merging might reduce raw performance in favour of sustained thermal performance, I can see why some vendors might be intrested to modify the driver and disable subpass merging.

    As for subpass merging not working, do you think it would make more sense for us to trade MSAA for a slight resolution increase? The theory is that:

    1. For base resolution X (lower than native), MSAA 4X would make for 4X system memory cost and 4X bandwidth cost when subpasses are not merged.

    2. If we instead make our new resolution something like 1.2X, we still get improved visual fidelity at the cost of 1.2X bandwidth and 1.2X raster cost.

    3. Assuming ALU performance scales better than system bus throughoutput generation-over-generation, there's a chance that we'll exchange time tile load/store cycle for time that we take to rasterize more fragments. In other words, if MSAA 4X causes 4x tile count, if we instead make it 1.2x tile count, we'll spend 1.2x more time rasterizing fragments, but 3.3x time less on loading and storing tiles.

    4. Assuming that moving data over system memory bus produces more heat than rasterization over time (one SoC vendor did hint about that), rasterizing 1.2x fragments would produce *much* less heat than transferring 4x tiles over system bus.

  • One additional thing to note here is that MSAA write-out cannot be framebuffer-compressed on our current GPUs, but non-MSAA can. So comparing 4x MSAA writeout vs 0x MSAA writeout the BW difference can easily be 8x or more in practice, given AFBC usually gives 2:1 compression ratios (and much better for solid-color tiles).

    So, in practice, knowing some vendors may disable fusion, it seems difficult to recommend to use MSAA in combination with subpasses at all, as the risk and performance-consequence of MSAA writeouts is decidedly non-trivial...

    So your general thinking there makes sense to me I think -- I might nitpick some of the details but the general direction seems reasonable.

  • One thing to try here is to zoom in (use the 1ms option) and try to select 1 frame using the range selector. Usually there is a nice gap between frames, if vsync limited, so this should be fairly easy if so. If not look for patterns in the workload, like e.g. when the bulk of the VS shading happens (this is usually the start of a frame).

    This way you can directly compare the workload for a single frame, as opposed to workload inside some specific period of time. I find this useful when comparing performance between two cases.

  • Thanks! If possible, can you please point out the details that I got wrong?

    As for MSAA, did I understand you correctly, that when subpasses are not merged, input attachments essentially become texel fetches from system memory? And when subpasses are not merged, specifying depth both as depth attachment and as input attachment causes additional depth tile loads?

    As for range selector in streamline, great point! When I zoomed in on 1ms, I can see that Bus Beats/Core Tiles with subpasses do have less spikes.