When testing my applications on Android 15 with a Mali-G710 MP7 (Pixel 7), I noticed that any time I submit a secondary command buffer to a render pass I get the following validation error when presenting the swap chain:
VUID-VkPresentInfoKHR-pImageIndices-01430: Validation Error: [ VUID-VkPresentInfoKHR-pImageIndices-01430 ] Object 0: handle = 0xb400007be5014a90, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0x48ad24c6 | vkQueuePresentKHR(): pPresentInfo->pSwapchains[0] images passed to present must be in layout VK_IMAGE_LAYOUT_PRESENT_SRC_KHR or VK_IMAGE_LAYOUT_SHARED_PRESENT_KHR but is in VK_IMAGE_LAYOUT_UNDEFINED.The Vulkan spec states: Each element of pImageIndices must be the index of a presentable image acquired from the swapchain specified by the corresponding element of the pSwapchains array, and the presented image subresource must be in the VK_IMAGE_LAYOUT_PRESENT_SRC_KHR or VK_IMAGE_LAYOUT_SHARED_PRESENT_KHR layout at the time the operation is executed on a VkDevice (docs.vulkan.org/.../wsi.html
After a few frames, the device is lost. I haven't encountered either issue on any other platform when running the same tester going through the same Vulkan code paths.
Here is a screenshot from RenderDoc when running the same tester on Linux with the same configuration to draw with a secondary command buffer:
I have highlighted the barriers that perform the layout transitions. Here are the contents of the barriers themselves:
Some other things to note:
When debugging, I even tried waiting for the GPU to be idle before processing the submit, and I still had the same behavior. The following can avoid both the validation and device lost issues:
So far I haven't been able to find the source of the error for the image transitions, command buffer setup, or queue submission and swapchain presentation. I see that Mali-G710 was advertised as the first Mali GPU with native secondary command buffer support. Are there extra steps that I'm missing when the hardware support is used for Mali, such as extra barriers? Could this be a driver issue, perhaps android-specific with the swapchains?
The project is open source, I can share it and instructions to reproduce if need be.
The validation layer error can't be related to an Android issue or Mali driver issue, as that check is implemented above the API, so I'd focus on solving that one before worrying about anything else. It doesn't seem like that error should be related to secondary command buffers at all.
Hard to help more from screenshots - repro instructions would be useful.
After some more digging around, another difference I noticed was the image barriers may appear in different primary command buffers for the code path that uses secondary command buffers. This is because the multithreaded rendering code path may create command buffers for other operations such as copying data on separate threads, then they get submitted together with the rendering command buffers. The ordering of those command buffers for submission is determined on the main thread (i.e. no thread contention), and I verified with printouts that they are submitted in order.
Another thing worth mentioning is that I am able to run on a different Android device with different hardware (Qualcomm) without errors.
The validation layer error can't be related to an Android issue or Mali driver issue, as that check is implemented above the API, so I'd focus on solving that one before worrying about anything else.
I'm thinking it might be platform-specific differences in what is returned from various Vulkan functions (e.g. swapchain images and indices, perhaps which command buffers are returned) that changes what the validator sees. Along those lines, I put in a bunch of print statements and verified that the images are what I expect (4 images are in the swap chain, and they are acquired in index order) and that the images are been consistent with image views used in the framebuffer and swapchain operations. As described above, I also verified that the command buffers are submitted in order, and I use a ring buffer to avoid re-using any command buffer pools for 3 frames. The first queue submit appears to succeed without error, but then the second queue submit and on have validation errors. By the time I wait for a fence, the device is lost.
Thus far I haven't been able to find anything incorrect with how the swap chain images are being submitted. I've tried to get a frame trace, but the Android Graphics Inspector tool crashes for me when loading the trace and RenderDoc hasn't been able to gather a trace either. The fact that it crashes from device lost so quickly might also complicate matters.
The project exhibiting the issue is available here: https://github.com/akb825/DeepSea. The update.sh script may be used to checkout the submodules and download pre-built packages for the tools required for building (texture conversion, shader compilation, etc.) and pre-built external libraries. To build for Android, you can run it with the following arguments: update.sh -m -t -l android-all. If running on Windows, this may be run through git bash. Once you have the requirements, you can open the Android project under the android folder.
update.sh -m -t -l android-all
The different testers can be selected under the Build Variants section in Android Studio, with the GUI testers being under the "app" variants. There are two testers that exhibit the issue: TestLighting and TestScene. TestScene is the simpler of the two, though multithreaded rendering (which is what triggers the issue) is disabled by default. To enable multithreaded rendering in that tester, in the file "testers/TestScene/TestScene.c" add the following line at the top of the setup() function on line 224: testScene->multithreadedRendering = true;
testScene->multithreadedRendering = true;
I remembered that RenderDoc fails if validations are enabled and my automatic detection doesn't work on Android. After manually disabling validations (by making sure the enableValidation() function on line 329 in modules/Render/RenderVulkan/src/VkInit.c returns false), I was able to get a frame trace in RenderDoc.
The frame trace in RenderDoc in the crashing frame on Android looks identical to the one I posted above from desktop Linux.The requisite memory barriers for performing the image layout transitions for the swapchain image are present, and the image instance matches for both the barriers and the call to present at the end. The RenderDoc process running on the device also ends up crashing due to device lost when trying to replay the trace.
I realize that validations run at a level above the driver, but does it query anything from the state for the underlying Vulkan implementation? The frame trace seems to show that I am already fulfilling the requirements for the validation error, and the the Mali device is the only one showing the validation errors and device lost, which makes me think that something beneath the surface is causing the failure.
I have managed to track down what triggers this issue: setting the occlusionQueryEnable member to true in VkCommandBufferInheritanceInfo while beginning the secondary command buffer triggers both the validation errors and device lost. Setting occlusionQueryEnable to false prevents both from happening. Based on this behavior, I suspect it is dropping the primary command buffer that the secondary command buffer is executed on during the queue submit, since that's the only situation I can think of that would lead the validation layer to not see any of the image layout transitions for the swap chain image. As stated earlier, the same code path works without errors on other Android chipsets as well as other platforms (Windows and Linux), so I think this conclusively points to a driver or vendor Vulkan implementation bug.
occlusionQueryEnable
true
VkCommandBufferInheritanceInfo
false
For this specific situation, it's worth noting that I didn't have occlusion queries enabled for the primary command buffer. I initially set that so I wouldn't have to track the query state when managing secondary command buffers, but I think I'll want to re-evaluate that approach regardless. As a workaround, I will want to force the inheritedQueries device feature to false when running on Mali, and make sure my own code avoids situations that would require it when not available.
inheritedQueries
Peter Harris is this something that can be escalated on the ARM side? While there is a viable workaround (treating inheritedQueries as being unsupported), I'd imagine that others may run into this as well.
(Edited)Hi Aaron,
Support for inheritedQueries is an optional feature in the API. Applications must query that the feature is supported by using vkGetPhysicalDeviceFeatures() before trying to use it. It should be available on newer devices (needs Mali-G710 or newer hardware, with r49p1 or newer drivers). Can you confirm whether the feature is showing as available on your Pixel, and the driver version? It should be supported on the Pixel 7, but only with the very latest software update which includes the r50p0 drivers.
vkGetPhysicalDeviceFeatures()
Cheers, Pete
I am querying this feature using vkGetPhysicalDeviceFeatures, and inheritedQueries is being set to 1.
vkGetPhysicalDeviceFeatures
According to the Vulkan device info, the driver version is 0xC800000. I am on Android 15, build number AP4A.250105.002, and shows as having no updates available. (last updated Jan 5) I'm not sure how to check for a human-readable driver version like you're reporting.
I'm not sure how to check for a human-readable driver version like you're reporting.
Use vkGetPhysicalDeviceProperties and dump the deviceName and driverVersion fields:
std::string name { deviceProperties.deviceName }; uint32_t driverVersion = deviceProperties.driverVersion; uint32_t major = VK_VERSION_MAJOR(driverVersion); uint32_t minor = VK_VERSION_MINOR(driverVersion); uint32_t patch = VK_VERSION_PATCH(driverVersion);
The Arm standard numbering scheme is r<major>p<minor>. In your case 0xC800000 = r50p0, so this should work.
Based on this behavior, I suspect it is dropping the primary command buffer that the secondary command buffer is executed during the queue submit.
I don't quite understand what the "it" is in this case. Khronos validation layer is run before the driver sees each API call, so if the validation layer is not seeing a primary command buffer I don't see how the driver is causing it. The driver hasn't run at the point that the validation layer reads the submit data from the application. Are you sure the engine is passing in the correct command buffers on Mali?
(Although I would have expected another validation error if you just tried to submit a secondary directly)
Possibly, if timing is different on Mali and you are missing a sync, are you passing in a command buffer that has not been recorded or has been reset?
Peter Harris said:I don't quite understand what the "it" is in this case.
I am referring to Vulkan with the queue submission.
Peter Harris said:Khronos validation layer is run before the driver sees each API call, so if the validation layer is not seeing a primary command buffer I don't see how the driver is causing it.
This is based on my observations, though I admit that I only have limited knowledge of the inner workings of the validation layers. I know that they keep track of the commands on the command buffer in a layer above the device, but is there any communication of the driver for situations such as queue submissions? At least in this case, when the validation error fires it does so inside the call to vkQueuePresentKHR(), where the queue submission was done earlier. So if the command buffer was never "consumed" from the queue submission, I could see the validation layer may see that it wasn't executed before the present.
vkQueuePresentKHR()
Peter Harris said:Are you sure the engine is passing in the correct command buffers on Mali?
I am as sure as I possibly can be. I used printouts to verify that the correct command buffers were being written to, as well as submitted for the queue submission. Both the usage of the swapchain image at the start of the submission and the present after the submission are protected by semaphores, and I verified that the correct semaphore IDs were used for these operations as well.
Beyond that, I was able to capture the frame trace in RenderDoc, verifying that all the command buffers and related commands were submitted correctly. Additionally, the RenderDoc process that replays the commands on the device will crash with device lost when occlusion query inheritance is enabled in the secondary command buffer, and runs properly with occlusion query inheritance disabled. This is a single frame's commands in isolation in a completely separate process, which I expect would rule out many of the potential errors I could have, especially when it comes to synchronization. The flags set on the secondary command buffer inheritance info is the only difference between both runs.
Peter Harris said:Possibly, if timing is different on Mali and you are missing a sync, are you passing in a command buffer that has not been recorded or has been reset?
The primary command buffer is always constructed on the same thread that submits it, and the threads that create the secondary command buffers are waited on before submitting them to the primary command buffer. In fact, for the TestScene tester since it only creates a single secondary command buffer, the multithreaded rendering system sees there's only one task and executes it on the main thread, so in this case there's actually no threading going on.
For the Vulkan synchronization, I have two ringbuffers of semaphores: one for the queue submissions (6, assuming 3 frames before we synchronize with the GPU and expecting up to 2 submissions per frame), and one for the images used in the swapchains (based on the number of swapchain images). The queue submission semaphore will be used to synchronize the call to vkQueuePresentKHR() to ensure the present is done after the commands finish, while the swapchain image semaphores are used to synchronize the call to vkQueueSubmit() is done once the swapchain image is ready.
vkQueueSubmit()
When the bug exhibits itself, I see the validation errors on the second frame for the TestScene tester. It hasn't had a chance to re-use any command buffer pools, swapchain images, or semaphores before that point. The device lost error occurs the first time it attempts to synchronize with the GPU after the third frame, which is also before any of these resources have a chance to be re-used.
Thanks for double checking - I'll ask the driver team to take a look.
Yeah, it does indeed look like we have a bug in the driver which causes this to fail.It triggers if you set VkCommandBufferInheritanceInfo.occlusionQueryEnable = VK_TRUE for a secondary command buffer and use it when there is not an active query in the primary. The workaround is to only set occlusionQueryEnable when you have an active query in the primary.
The validation error you are getting probably happens after this occurs, and is a side-effect of this earlier error.
Sorry for the inconvenience, and thank you for reporting it. Cheers, P