The Vulkan and OpenGLES APIs expose mechanisms for fetching framebuffer attachments on-tile. A special feature of tile-based GPUs, like Arm Mali, is the ability to implement fast programmable blending and on-chip deferred shading-like techniques.
In this post, we will look at how these features can be used to specifically implement deferred shading, a style of rendering which is still quite common. The deferred techniques have evolved over time, but the fundamental remains where a G-buffer is rendered, and lighting is computed based on that G-buffer data. This decouples geometry information from shading. Most of the innovation in the last years in the deferred space has revolved around reformulating the lighting pass, but the fundamentals remain the same.
For example, here we have a typical G-buffer layout of albedo, normals, material parameters (here: metallic-roughness) and depth.
These G-buffers are then used to shade the final pixel; here, just a trivial directional light.
The traditional way to implement this style of rendering is the use of multiple render targets (MRT) with a G-buffer pass, which is then followed by a lighting pass, where the G-buffer is sampled with plain old textures. This is still the conventional way on immediate mode GPUs (primarily desktop).
#version 320 es precision mediump float; // B10G11R11_UFLOAT layout(location = 0) out vec3 Emissive; // RGBA8_SRGB layout(location = 1) out vec3 Albedo; // B10G10R10A2_UNORM layout(location = 2) out vec3 Normal; // R8G8_UNORM layout(location = 3) out vec2 MetallicRoughness; layout(location = 0) in highp vec2 vUV; layout(location = 1) in vec3 vNormal; layout(location = 2) in vec4 vTangent; layout(binding = 0) uniform sampler2D TexAlbedo; layout(binding = 1) uniform sampler2D TexNormal; layout(binding = 2) uniform sampler2D TexMetallicRoughness; void main() { Albedo = texture(TexAlbedo, vUV).rgb; MetallicRoughness = texture(TexMetallicRoughness, vUV).xy; // Many different ways to implement this. vec2 tangent_xy = 2.0 * texture(TexNormal, vUV).xy - 1.0; float tangent_z = sqrt(max(0.0, 1.0 - dot(tangent_xy, tangent_xy))); vec3 tangent_normal = vec3(tangent_xy, tangent_z); vec3 bitangent = cross(vNormal, vTangent.xyz) * vTangent.w; mat3 TBN = mat3(vTangent, bitangent, vNormal); vec3 normal = normalize(TBN * tangent_normal); // [-1, 1] -> [0, 1] range. Normal = 0.5 * normal + 0.5; // This may or may not be relevant. // If present, we can reuse the lighting accumulation attachment. Emissive = vec3(0.0); }
#version 320 es precision mediump float; // B10G11R11_UFLOAT layout(location = 0) out vec3 Light; layout(binding = 4) uniform Parameters { highp mat4 inv_view_projection; highp vec2 inv_resolution; }; // Textures we rendered to in G-buffer pass. layout(binding = 0) uniform sampler2D GBufferAlbedo; layout(binding = 1) uniform sampler2D GBufferNormal; layout(binding = 2) uniform sampler2D GBufferMetallicRoughness; layout(binding = 3) uniform sampler2D GBufferDepth; vec3 compute_light(vec3 albedo, vec3 normal, vec2 metallic_roughness, highp vec3 world) { // Arbitrary complexity. ... } void main() { ivec2 coord = ivec2(gl_FragCoord.xy); highp float depth = texelFetch(GBufferDepth, coord, 0).x; vec2 metallic_roughness = texelFetch(GBufferMetallicRoughness, coord, 0).xy; vec3 normal = 2.0 * texelFetch(GBufferNormal, coord, 0).xyz - 1.0; vec3 albedo = texelFetch(GBufferAlbedo, coord, 0).rgb; // Reconstruct world position. highp vec4 clip = vec4(2.0 * gl_FragCoord.xy * inv_resolution – 1.0, depth, 1.0); highp vec4 world4 = inv_view_projection * clip; highp vec3 world = world4.xyz / world4.w; Light = compute_light(albedo, normal, metallic_roughness, world); }
Blending is enabled for the color attachment, and we can render as many lights as we want, each with different shaders. This kind of flexibility was critical for the early motivations for deferred shading, as shaders back in the mid 2000s had to be extremely simple and highly specialized. The downside of this technique is the fill-rate, memory, and bandwidth requirements. Storing large G-buffers is costly, and shading multiple lights with overdraw is a big drain on bandwidth on immediate mode GPUs.
With some help from the engine developer, there is a lot to gain by exploiting tile memory. For example, in the case of deferred shading, a straightforward implementation consumes a certain amount of bandwidth. For the sake of simplicity and clarity, let us use normalized units and call them BU, where 1 BU represents width * height * sizeof(pixel) bytes of either reads or writes to external memory.
Write G-buffer: 4 BU (albedo, normal, pbr, depth)Read G-buffer in lighting pass: >= 4 BU (could be more if caches are thrashed or lots of overdraw)Blend lighting buffer in lighting pass: >= 2 BU (1 BU read + 1 BU write, could be more with overdraw)
Cost: ~10 BU
The first observation we make is that whenever texelFetch(gl_FragCoord.xy) appears, it could map directly to the current pixel’s data. Since we have framebuffer fetching mechanisms, we do not even need the texture, and the shader code could be replaced with the hypothetical readDataFromTile(). However, the assumption that pixels reside on-tile only holds if everything can happen in the same render pass. Thus, to make tile-based GPUs shine, G-buffer, and lighting passes must happen back-to-back.
Writing a G-buffer attachment, or reading from it all happens on-chip, and thus costs no external memory bandwidth.
Tile-based GPUs hold their framebuffers on-chip, and can choose whether the result actually needs to be flushed out to main memory. This flush is the only real bandwidth cost associated with framebuffer rendering and we can choose to skip it if we so choose. glInvalidateFramebuffer() or STORE_OP_DONT_CARE express this intent in OpenGL ES and Vulkan respectively.
In the best case scenario, we only have 1 BU, so write the final shaded light attachment, which yields a nice ~10x bandwidth reduction.
This kind of optimization hinges on some key assumptions. There are various scenarios which will break them, and we will have to settle for the worst optimization as a result. The developer will need to weigh these losses against achieved image quality.
In this technique, we need to sample depth randomly to create a rough approximation of true ambient occlusion. This mask now needs to modulate ambient lighting in the lighting pass.
The naïve workaround here is to split rendering in G-buffer → SSAO → Lighting, but now we are back to 10 BU bandwidth cost, and any tile optimizations become meaningless.
Instead, we can try to split rendering into: Depth pre-pass (1 BU write) → SSAO → [G-buffer → Lighting] (1 BU read depth + 1 BU write light), where G-buffer reads depth read-only. This lessens the bandwidth hit, but requires a depth pre-pass, which also has an overhead associated with it.
Another idea would be to implement SSAO through a reprojection from the previous frame. The frame would look something like: SSAO (depth previous frame) → [G-buffer → Lighting] (1 BU write light + 1 BU write depth).
If we need this (very likely in this day and age), we cannot avoid writing out depth. Fortunately, this just costs 1 BU, and is not a big deal.
We can also imagine scenarios where we need other G-buffer attachments in other rendering passes, with screen-space reflection techniques coming to mind. Similar ideas apply, where we should try to avoid splitting G-buffer and lighting passes. Essentially, to gain anything from having tile memory, we need to be able to fuse G-buffer → Lighting into one pass.
With the theoretical out of the way, let us implement this on Arm Mali. These three methods all have their pros and cons, but accomplish more or less the same thing. The methods we will go through here are:
As explained earlier, it’s critical that we keep pixel data on-tile, so under no circumstance must we rebind framebuffers with glBindFramebuffer() (in GLES), or end the render pass with vkCmdEndRenderPass() (in Vulkan). Both commands cause pixel data to be flushed to memory.
API availability: OpenGLES 3.0+, GPUs: Arm Mali Midgard family and later
In this style, we have raw access to the tile buffer. We will have to limit ourselves to 128 bits, but this should not be a problem with the G-buffer layout that we have chosen in the previous examples.
This API was introduced in a time before floating-point render targets were the norm, so the HDR lighting is handled with programmable blending. Eventually, we add a “resolve” pass, which converts PLS raw data to a real color attachment.
Rather than declare color attachment outputs, we declare a view of the raw tile memory and write there:
#extension GL_EXT_shader_pixel_local_storage : require __pixel_local_outEXT OutPLS { layout(r11f_g11f_b10f) vec3 Emissive; layout(rgba8) vec4 Albedo; layout(rgb10_a2) vec4 Normal; layout(rgba8) vec4 MetallicRoughness; };
Writing to these variables works like any other out variable in fragment shaders. Another modification to consider is the albedo attachment. PLS does not use true hardware render targets, so to deal with sRGB, we have to add some ALU ourselves:
// Albedo is linear (sRGB texture), and to preserve accuracy, we need to encode.
// Gamma 2.0 is a decent choice since sqrt() is significantly faster than pow().
Albedo.rgb = sqrt(clamp(Albedo.rgb, vec3(0.0), vec3(1.0)));
Fortunately, ALU is rarely the bottleneck in G-buffer shaders, so this shouldn’t matter in practice. sqrt() is a decent choice as it is faster than pow(), and decoding it in the lighting pass later is just a square operation.
Here, we need to declare the same G-buffer layout as an input and output. We also need to recover G-buffer depth, so we enable GL_ARM_shader_framebuffer_fetch_depth_stencil.
#extension GL_EXT_shader_pixel_local_storage : require #extension GL_ARM_shader_framebuffer_fetch_depth_stencil : require __pixel_localEXT InOutPLS { layout(r11f_g11f_b10f) vec3 Light; layout(rgba8) vec4 Albedo; layout(rgb10_a2) vec4 Normal; layout(rgba8) vec4 MetallicRoughness; }; highp float depth = gl_LastFragDepthARM; vec2 metallic_roughness = MetallicRoughness.xy; vec3 normal = 2.0 * Normal.xyz - 1.0; vec3 albedo = Albedo.rgb * Albedo.rgb; // Convert back to linear // Programmable blending :D Light += compute_light(albedo, normal, metallic_roughness, world);
Before we end rendering, we must write the final result over to a real render target. As mentioned, in the days of GLES 3.0, we did not necessarily have floating-point render target support, so merely copying the HDR light data into a render target was not possible. The natural thing to do here is to do some kind of tone map, so it can fit into a UNORM or sRGB render target.
To optimize this some more, if we know that the last light we render is a “full-screen” quad, for example a directional light, we can combine tone map and lighting into one shader.
#extension GL_EXT_shader_pixel_local_storage : require #extension GL_ARM_shader_framebuffer_fetch_depth_stencil : require __pixel_local_inEXT InPLS { layout(r11f_g11f_b10f) vec3 Light; layout(rgba8) vec4 Albedo; layout(rgb10_a2) vec4 Normal; layout(rgba8) vec4 MetallicRoughness; }; // RGBA8, SRGB8, RGB10A2, layout(location = 0) out vec4 RenderTarget; void main() { // ... vec3 Final = Light + compute_directional_light(albedo, normal, metallic_roughness, world); RenderTarget = tonemap_to_sdr(Final); }
With GL_EXT_shader_framebuffer_fetch on r29 drivers, we move to a more traditional rendering setup with multiple render targets. In the shaders, we will interact with real hardware render targets, and we lose the ability to freely reinterpret tile memory. However, for deferred shading, we will not need this functionality either way, since the formats remain fixed. For effective HDR rendering; however, we will need floating point render target support, which was added in GLES 3.2 as a mandatory feature.
This remains the same as the traditional, immediate mode style rendering, convenient.
Rather than binding the G-buffer attachment as textures at this point and changing the FBO, we keep going.
#extension GL_EXT_shader_framebuffer_fetch : require #extension GL_ARM_shader_framebuffer_fetch_depth_stencil : require // inout rather than out layout(location = 0) inout vec3 Light; layout(location = 1) inout vec3 Albedo; layout(location = 2) inout vec3 Normal; layout(location = 3) inout vec2 MetallicRoughness; // Still need the Arm extension to read G-buffer depth highp float depth = gl_LastFragDepthARM; // Accumulate light Light += compute_light(albedo, normal, metallic_roughness, world);
With framebuffer fetch, we can choose whether we use programmable blending or fixed-function blending. If we opt-in to API side blending, we need to be very careful and limit blending to just the Light render target, or weird things can happen. glEnablei() can enable blending on a per-attachment level. In practice, we might as well just use programmable blending here.
Unlike pixel local storage, there is no need to “resolve” anything here, since the Light render target we accumulated against is bound to a real texture. If we want to implement on-tile tone-mapping however, we have to be a bit creative, since we lose the ability to reinterpret tile formats. Fortunately, the albedo attachment tends to be SRGB8, and is free real estate here.
layout(location = 0) inout vec3 Light; layout(location = 1) out vec3 Albedo; Albedo = tonemap_to_sdr(Light);
The albedo render target is now the attachment we do not discard in the end.
In Vulkan as it stands, we cannot directly access tile storage like in the GLES extensions. Instead, we give the driver enough information up front so that it can optimize a render pass into framebuffer fetch. We also need to modify the shaders a little, which makes them look a bit like GL_EXT_shader_framebuffer_fetch. The key strength of Vulkan’s abstraction here is that this implementation can be implemented once and run optimally on any type of GPU.
The fundamental abstraction in Vulkan is the input attachment, which looks a lot like the input variable in framebuffer fetch. In Vulkan GLSL, we represent this with a subpassInput.
#version 320 es precision mediump float; layout(input_attachment_index = 0, set = 0, binding = 0) uniform mediump subpassInput SomeAttachment; layout(location = 0) out vec4 SomeOutput; void main() { SomeOutput = subpassLoad(SomeAttachment); }
In the shader itself, we have no idea what this means yet. The compiler will give it meaning later when we give it a VkRenderPass. A peculiarity we need to consider is that subpass inputs are potentially backed by a real texture, so we have to bind it to a descriptor set. The input_attachment_index decoration is used so that the compiler knows how to fetch data from tile if it wants to.
In Vulkan, we create render pass objects which express how a render pass is laid out. A render pass can consist of multiple subpasses. In our case, we will need 2 subpasses:
We can set up dependencies between the subpasses. For deferred shading, the critical things to consider are:
A subpass has attachments set up, so that we can do:
For the second subpass:
For the attachments, we use STORE_OP_DONT_CARE for everything except for the final light attachment.
Unfortunately, we cannot be 100% sure that the driver will actually optimize this. This is the price we pay for compatibility with all devices. This pattern works well on Mali GPUs at least.
The best practices document for Mali can be consulted for details here.
Same as framebuffer fetch, nothing changes
This is basically the same as framebuffer fetch, except we use subpassInput, as mentioned earlier. One thing to note is that we do not need a special extension to fetch depth either. They also use input attachments. Finally, we have to use fixed-function blending to accumulate light, since programmable blending is not really a thing in Vulkan yet.
#version 320 es precision mediump float; // B10G11R11_UFLOAT layout(location = 0) out vec3 Light; layout(binding = 4) uniform Parameters { highp mat4 inv_view_projection; highp vec2 inv_resolution; }; // Driver has information it needs to promote this to framebuffer fetch. layout(input_attachment_index = 0, set = 0, binding = 0) uniform mediump subpassInput GBufferAlbedo; layout(input_attachment_index = 1, set = 0, binding = 1) uniform mediump subpassInput GBufferNormal; layout(input_attachment_index = 2, set = 0, binding = 2) uniform mediump subpassInput GBufferMetallicRoughness; layout(input_attachment_index = 3, set = 0, binding = 3) uniform highp subpassInput GBufferDepth; vec3 compute_light(vec3 albedo, vec3 normal, vec2 metallic_roughness, highp vec3 world) { // ... } void main() { highp float depth = subpassLoad(GBufferDepth).x; vec2 metallic_roughness = subpassLoad(GBufferMetallicRoughness).xy; vec3 normal = 2.0 * subpassLoad(GBufferNormal).xyz - 1.0; vec3 albedo = subpassLoad(GBufferAlbedo).rgb; // Reconstruct world position. highp vec4 clip = vec4(2.0 * gl_FragCoord.xy * inv_resolution - 1.0, depth, 1.0); highp vec4 world4 = inv_view_projection * clip; highp vec3 world = world4.xyz / world4.w; // Use fixed-function blending in API to accumulate. Light = compute_light(albedo, normal, metallic_roughness, world); }
For mobile, there are a lot of considerations when implementing deferred shading, as the potential gains for bandwidth, power and battery life are too great to ignore. However, we have outlined three ways we can tap into tile storage. This will improve the overall development experience when implementing deferred shading.
can you post the sample source?