Tile-based architectures, like Arm Mali and Immortalis GPUs, allow fragment shaders access to the contents of the framebuffer from tile memory. This can be beneficial if we need to implement deferred shading. OpenGL ES exposes this functionality in such mechanisms as Pixel Local storage (PLS) and Framebuffer Fetch (FBF), which are explained in detail in this blog by Roberto Lopez Mendez.
Subpasses in Vulkan provides a similar mechanism. Within a subpass, a barrier must be inserted between each color attachment write and the corresponding input attachment read. This makes techniques such as programmable blending difficult to implement because you cannot insert these barriers between primitives in the same draw call.
Additionally, we expect new Vulkan applications to prefer Dynamic Rendering, where there is only a single subpass per render pass. In this case, we need a new way to support the use-cases that previously required multiple subpasses.
For these reasons, we have added a set of new extensions to Vulkan. We will introduce these extensions in this blog post but first let's explain why they are necessary.
Imagine the following scenario, a subpass uses the same framebuffer attachment both as input and output for color or depth and stencil. That means that the subpass will be able to read and write the data from and to the same attachment. This situation is called a render pass feedback loop.
An access pattern causing a render pass feedback loop
There are exceptions if the subpass reads only those components that are not written to. For example, if it reads RG channels and writes to BA channels, then there is no feedback loop.
What is the problem with render pass feedback loops? The writes are not automatically made visible to reads. So, within the same subpass, it is a data race. The behavior in this case is undefined and the application may simply crash.
If the crash does not happen, consider the following problem related with data race. When sampling the attachment at a given coordinate, we do not know whether we will get the value updated by the previous shader invocation. It can lead to artifacts shown on the following image:
Rendering artifacts due to data races
We can avoid the render pass feedback loop by adding a subpass self-dependency. That means that we first specify a subpass dependency during render pass initialization with srcSubpass and dstSubpass values pointing to the same subpass. This enables us to later add memory barriers for the attachment using vkCmdPipelineBarrier within the render pass instance to synchronize read accesses prior to the barrier with write accesses after the barrier. One limitation of this approach, is that the barrier can only be inserted between draw calls; it does not allow us to synchronize accesses to the attachment within a draw call. This matters for use-cases such as programmable blending where triangles in a single draw call may be overlapping.
vkCmdPipelineBarrier
Let us now look at how we can synchronize attachment access without the need for subpass self-dependencies and explicit pipeline barriers.
The VK_EXT_rasterization_order_attachment_access extension provides a mechanism for accessing the contents of framebuffer attachments from one fragment to the next in rasterization order.
To enable this functionality for a color attachment, you need to add the VK_PIPELINE_COLOR_BLEND_STATE_CREATE_RASTERIZATION_ORDER_ATTACHMENT_ACCESS_BIT_EXT bit to the flags value of pColorBlendState when initializing the rendering pipeline. When using render pass objects, you also need to set the VK_SUBPASS_DESCRIPTION_RASTERIZATION_ORDER_ATTACHMENT_COLOR_ACCESS_BIT_EXT bit in the flags value of pSubpasses when creating the render pass.
VK_PIPELINE_COLOR_BLEND_STATE_CREATE_RASTERIZATION_ORDER_ATTACHMENT_ACCESS_BIT_EXT
pColorBlendState
VK_SUBPASS_DESCRIPTION_RASTERIZATION_ORDER_ATTACHMENT_COLOR_ACCESS_BIT_EXT
Similarly, you can enable it for depth and stencil attachments by adding the VK_PIPELINE_DEPTH_STENCIL_STATE_CREATE_RASTERIZATION_ORDER_ATTACHMENT_STENCIL_ACCESS_BIT_EXT bit to the flags value of pDepthStencilState.
VK_PIPELINE_DEPTH_STENCIL_STATE_CREATE_RASTERIZATION_ORDER_ATTACHMENT_STENCIL_ACCESS_BIT_EXT
pDepthStencilState.
This creates a framebuffer-local dependency for each fragment, just like PLS or FBF in OpenGL ES.
Note: If you have an older device with a Mali GPU where this extension is not exposed; it is still possible to experiment with this functionality because the implementation will synchronize the attachment accesses by default. In this case, it is still necessary to specify a subpass self-dependency, but the memory barrier is not needed.
Originally, this extension could not be used with dynamic rendering, but with the recent introduction of VK_KHR_dynamic_rendering_local_read this is now possible. We'll say more about this in the next section.
VK_KHR_dynamic_rendering_local_read
Finally, we should briefly mention the VK_ARM_rasterization_order_attachment_access extension. This provides identical functionality with the EXT extension. In fact, the two extensions are aliases of each other. There may be a small number of devices in the market that only support the ARM version. But if your device supports both extensions, there’s never a reason to use the ARM vendor extension.
The purpose of the VK_EXT_shader_tile_image extension is to provide rasterization order access to the contents of framebuffer attachments when using dynamic rendering, and to do so in a way that is explicit and guaranteed to access data from tile memory.
When designing this extension, we had a choice between two abstraction levels:
For VK_EXT_shader_tile_image we chose the first approach. The recently introduced VK_KHR_dynamic_rendering_local_read extension takes the second approach.
When VK_KHR_dynamic_rendering_local_read is used in combination with VK_EXT_rasterization_order_attachment_access it provides the same functionality as VK_EXT_shader_tile_image. For that reason, VK_KHR_dynamic_rendering_local_read targets all GPUs, Khronos decided to include it in the Vulkan Roadmap 2024 Milestone. Developers may prefer to use that extension for portability when it is available because it uses the same shader interface as subpasses. Also, it has the benefit of being compatible with the examples in the following sections. But if VK_EXT_shader_tile_image is also present on the device, you can be confident that framebuffer attachment data is kept on-chip during the dynamic render pass.
VK_EXT_rasterization_order_attachment_access
VK_EXT_shader_tile_image
Mentioned previously, it is impractical to implement programmable blending using subpasses and subpass self-dependencies. For some use cases we can rely on default blending in Vulkan, more complicated blending algorithms require doing it in shader. For example, the previous color values are loaded from i_color attachment and then updated by assigning the value to o_color, which is the same attachment:
layout(input_attachment_index = 0, binding = 0) uniform subpassInput i_color; layout(location = 0) out vec4 o_color; layout(set = 0, binding = 0) uniform BlendUniform { vec4 a; vec4 b; } blend_uniform; void main(void) { vec4 color = subpassLoad(i_color); color = lerp(color, blend_uniform.a, color.w * blend_uniform.a.w); color *= blend_uniform.b; o_color = color; }
In other words, the value of i_color at the current coordinate is overwritten and can be used in subsequent draw calls.
A detailed explanation of deferred shading implementation on Mali is covered in a previous blog from Hans-Kristian Arntzen. It explains how to use subpasses and attachments to implement this technique. Following that explanation, we split the rendering into two steps:
In the geometry pass, we render all the objects in the scene and store their properties (albedo, normal, depth) in the output attachments. These are then used in the next pass to calculate lighting. In the following image, you can see multiple lights applied in the final subpass:Example scene using deferred shading
The lighting shader looks like this:
layout(input_attachment_index = 0, binding = 0) uniform subpassInput i_depth; layout(input_attachment_index = 1, binding = 1) uniform subpassInput i_albedo; layout(input_attachment_index = 2, binding = 2) uniform subpassInput i_normal; layout(location = 0) out vec4 o_color; layout(set = 0, binding = 0) uniform LightsInfo { Light lights[MAX_LIGHT_COUNT]; } lights_info; ... void main() { // Read G-Buffer values using subpassLoad // Calculate world position and normal vec3 L = vec3(0.0); for (uint i = 0U; i < MAX_LIGHT_COUNT; ++i) { L += calculate_light(position, normal, lights_info.lights[i]); } o_color = vec4(ambient_color + L * albedo.xyz, 1.0); }
The rasterization order extension enables us to implement the same logic in a single subpass. We can avoid the need for a fixed MAX_LIGHT_COUNT value. Instead, we accumulate the final color value in the lighting pass, calling the shader for each light and adding to the previous values (probably with some custom blending logic).
MAX_LIGHT_COUNT
In this version of the shader, we are using subpassLoad to read the previous color values, and then write the updated value to o_color, which is bound to the same attachment:
layout(input_attachment_index = 0, binding = 0) uniform subpassInput i_light; layout(input_attachment_index = 1, binding = 1) uniform subpassInput i_depth; layout(input_attachment_index = 2, binding = 2) uniform subpassInput i_albedo; layout(input_attachment_index = 3, binding = 3) uniform subpassInput i_normal; layout(location = 0) out vec4 o_color; ... void main() { // Read G-Buffer values using subpassLoad // Calculate world position and normal vec3 previous_light_value = subpassLoad(i_light).xyz; // Add ambient light value, if it’s the first light at this location previous_light_value = max(previous_light_value, ambient_color); vec3 light_value = calculate_light(position, normal, light); o_color = vec4(previous_light_value + light_value * albedo.xyz, 1.0); }
Subpasses and subpass dependencies are great built-in mechanisms in Vulkan. This helps us to access tile-memory on mobile GPUs in an efficient way. Similar functionality existed in OpenGL ES using pixel local storage and framebuffer fetch extensions. However, the OpenGL ES extensions add more freedom for implementing such techniques as programmable blending.
With the extensions described in this blog, you can now do the same in Vulkan without needing to specify subpass self-dependencies.
In summary, let us recap when you should use the different extensions mentioned in this blog for framebuffer fetch: