We have unexpected frame drops after we integrated PLS into our UE4-based engine. For example, the following scene is rendered in both multi-pass and PLS ways. One can see from the fps that the latter suffers from non-trivial performance degradation.
After some analysis, we found that the major cause of this performance penalty is due to the lack of Early-Stencil culling in PLS. According to the streamline's captures:
Roughly speaking, Our deferred shading pipeline is as follows:We have three primary shading models, each with its own fragment shader. During the GBuffer pass, the stencil buffer is tagged by the geometry material's shading model ID. Then the following shading pass is divided into three full-screen quad draws, each responsible for its shading work.
In the traditional multi-pass pipeline, We can rely on the Early-Stencil culling mechanism to effectively kill the fragments which don't have the matching shading model ID with the stencil buffer before entering their fragment shaders. However, it turns out that this won't work in the PLS case.
In this situation, tons of fragments are actively shaded and then pathetically killed in the Late-ZS stage.
Please note from the chart, it seems that the Early-stencil test has indeed happened. But for some reason, Culling is not there.
I confirm that we didn't do any fancy stuff like clip pixels or depth modifications during the shading passes.
Unfortunately, I couldn't find any resources related to the Early-stencil test in PLS, from both
https://registry.khronos.org/OpenGL/extensions/EXT/EXT_shader_pixel_local_storage.txt
and
https://developer.arm.com/documentation/100587/0100/pixel-local-storage?lang=en
, or any other relevant forums.
So can anyone tell me if the issue is an open secret, or did I miss some things to make it work correctly?
Thanks and Any replies will be great appreciation.
If we are going to switch to Vulkan, we gonna follow the sub-pass paradigm in Vulkan. In that case, the Depth Buffer image should be considered a Depth Input Attachment in the render pass setting. The mesh pass then is the main pass, followed by three sub-passes corresponding to the three full-screen shading draws of our current OpenglES deferred pipeline. During the sub-pass, we gonna fetch the depth via 'subpassLoad' instead of 'gl_LastFragDepthARM'.
Each sub-pass has its own pipeline, The depth-stencil states of those pipelines are equivalent to the states of their OpenGLES counterparts.
Since depth fetch via 'gl_LastFragDepthARM' broke the early-stencil culling, is it also true for fetching via 'subpassLoad' in Vulkan?
Besides this, we are also concerned about other performance issues in Vulkan, like compute shader efficiency. We heard that they differ quite a lot between different hardware as well as different diver implementations. So is there any pitfall or caveat for switching from OpenglES to Android?