We have unexpected frame drops after we integrated PLS into our UE4-based engine. For example, the following scene is rendered in both multi-pass and PLS ways. One can see from the fps that the latter suffers from non-trivial performance degradation.
After some analysis, we found that the major cause of this performance penalty is due to the lack of Early-Stencil culling in PLS. According to the streamline's captures:
Roughly speaking, Our deferred shading pipeline is as follows:We have three primary shading models, each with its own fragment shader. During the GBuffer pass, the stencil buffer is tagged by the geometry material's shading model ID. Then the following shading pass is divided into three full-screen quad draws, each responsible for its shading work.
In the traditional multi-pass pipeline, We can rely on the Early-Stencil culling mechanism to effectively kill the fragments which don't have the matching shading model ID with the stencil buffer before entering their fragment shaders. However, it turns out that this won't work in the PLS case.
In this situation, tons of fragments are actively shaded and then pathetically killed in the Late-ZS stage.
Please note from the chart, it seems that the Early-stencil test has indeed happened. But for some reason, Culling is not there.
I confirm that we didn't do any fancy stuff like clip pixels or depth modifications during the shading passes.
Unfortunately, I couldn't find any resources related to the Early-stencil test in PLS, from both
https://registry.khronos.org/OpenGL/extensions/EXT/EXT_shader_pixel_local_storage.txt
and
https://developer.arm.com/documentation/100587/0100/pixel-local-storage?lang=en
, or any other relevant forums.
So can anyone tell me if the issue is an open secret, or did I miss some things to make it work correctly?
Thanks and Any replies will be great appreciation.
Stencil-based Early-ZS with PLS should be substantially improved with Immortalis-G715/Mali-G715 series hardware, so there have been some hardware improvements here.
If you can share some example shaders or an APK via developer@arm.com we can confirm the expected improvement gets triggered.
Thanks, Peter. We eventually found that the issue has nothing to do with PLS, it is the 'GL_ARM_shader_framebuffer_fetch_depth_stencil' extension that made the Early-stencil culling lost. We must fetch the current depth to reconstruct the world space position during the shading pass. Due to the limitations of PLS storage and the complexity of our materials, we have no room for depth in the PLS.
We are considering implementing one full-screen shading draw instead of three, To overcome this problem. Another way would be to migrate to Vulkan. But we are worried about Vulkan's performance in older Mali-GPU hardware, specifically the early generation of Bifrost like Mali-G72.
I don't think Vulkan would be a magic fix for this, assuming the algorithm is the same, given it maps on to the same hardware functionality.
On all hardware prior to the Mali-G715 series we use a single dependency tracker for depth and stencil, so use of one in the shader can cause false dependencies for the other. Mali-G715 switches to separate ZS trackers to remove the false dependencies.
Just to understand your use case:* Are all draws that update depth happening first in the render pass? (i.e. object meshes update depth, then decal or lighting layers just test against it, but don't write it?). Is it only the layer layers that are reading depth?
If we are going to switch to Vulkan, we gonna follow the sub-pass paradigm in Vulkan. In that case, the Depth Buffer image should be considered a Depth Input Attachment in the render pass setting. The mesh pass then is the main pass, followed by three sub-passes corresponding to the three full-screen shading draws of our current OpenglES deferred pipeline. During the sub-pass, we gonna fetch the depth via 'subpassLoad' instead of 'gl_LastFragDepthARM'.
Each sub-pass has its own pipeline, The depth-stencil states of those pipelines are equivalent to the states of their OpenGLES counterparts.
Since depth fetch via 'gl_LastFragDepthARM' broke the early-stencil culling, is it also true for fetching via 'subpassLoad' in Vulkan?
Besides this, we are also concerned about other performance issues in Vulkan, like compute shader efficiency. We heard that they differ quite a lot between different hardware as well as different diver implementations. So is there any pitfall or caveat for switching from OpenglES to Android?
Exactly, The mesh draws that update the depth-stencil buffer are performed first. The following lighting layers only read and test against the stencil without any intention to modify it again.