The introduction of Vulkan not only brought a more explicit way to control GPU hardware but also a number of novel concepts that unlock new and interesting possibilities. One of those new concepts is multipass rendering. Multipass rendering's primary function (but not the only one) is to enable tile-based GPUs, like Mali, to save precious bandwidth by utilizing the on-chip tile buffer for storing intermediate subpass states, which can be consumed directly from the tile buffer without ever being written back to main memory. You can read all about multipass rendering in a previous blog post here.
One of the rendering techniques that benefit greatly from multipass rendering is deferred shading. Despite all the benefits that deferred shading is offering it has some glaring disadvantages with the biggest one being the lack of anti-aliasing. Technically it is possible to use MSAA with multipass and deferred shading but that will -at least- quadruple the memory requirements for the on-chip tile buffer. In the best case the increased per-pixel storage size will force the GPU to work on smaller tiles which will typically reduce performance and in the worst case the requirement will be high enough that we hit the Vulkan specification limit. Whatever the case is using MSAA with deferred shading becomes a bit problematic.
In this article, we will try to explore an anti-aliasing method that works well with deferred shading called temporal anti-aliasing. First, we will present some implementation details with screenshots and finally we will share some details around performance.
The main idea behind temporal AA is quite simple. Where classic MSAA computes all of the subpixels in one go and then resolves to a single input sample, temporal AA computes a single subpixel per frame and then it accumulates it with the previous frame [Lottes12]. In other words, it spreads the calculation of subsamples temporally (between frames) rather than spatially (within a single frame).
The implementation can be divided into two key steps.
Figure 1: Our test scene without AA
The engine should jitter the projection matrix that will be used in the G-buffer pass as well as in the light pass (for the light volume drawcalls). The following code snippet shows how we compute the jitter matrix in our internal demo engine.
// Compute jittered matrices { // Sub-sample positions for 16x TAA static const Vec2 SAMPLE_LOCS_16[16] = { Vec2(-8.0f, 0.0f) / 8.0f, Vec2(-6.0f, -4.0f) / 8.0f, Vec2(-3.0f, -2.0f) / 8.0f, Vec2(-2.0f, -6.0f) / 8.0f, Vec2(1.0f, -1.0f) / 8.0f, Vec2(2.0f, -5.0f) / 8.0f, Vec2(6.0f, -7.0f) / 8.0f, Vec2(5.0f, -3.0f) / 8.0f, Vec2(4.0f, 1.0f) / 8.0f, Vec2(7.0f, 4.0f) / 8.0f, Vec2(3.0f, 5.0f) / 8.0f, Vec2(0.0f, 7.0f) / 8.0f, Vec2(-1.0f, 3.0f) / 8.0f, Vec2(-4.0f, 6.0f) / 8.0f, Vec2(-7.0f, 8.0f) / 8.0f, Vec2(-5.0f, 2.0f) / 8.0f}; // Sub-sample positions for 8x TAA static const Vec2 SAMPLE_LOCS_8[8] = { Vec2(-7.0f, 1.0f) / 8.0f, Vec2(-5.0f, -5.0f) / 8.0f, Vec2(-1.0f, -3.0f) / 8.0f, Vec2(3.0f, -7.0f) / 8.0f, Vec2(5.0f, -1.0f) / 8.0f, Vec2(7.0f, 7.0f) / 8.0f, Vec2(1.0f, 3.0f) / 8.0f, Vec2(-3.0f, 5.0f) / 8.0f}; // Let's assume that we are using 8x #define SAMPLE_LOCS SAMPLE_LOCS_8 #define SAMPLE_COUNT 8 const unsigned SubsampleIdx = m_FrameCount % SAMPLE_COUNT; const Vec2 TexSize(1.0f / Vec2(GBufferWidth, GBufferHeight)); // Texel size const Vec2 SubsampleSize = TexSize * 2.0f; // That is the size of the subsample in NDC const Vec2 S = SAMPLE_LOCS[SubsampleIdx]; // In [-1, 1] Vec2 Subsample = S * SubsampleSize; // In [-SubsampleSize, SubsampleSize] range Subsample *= 0.5f; // In [-SubsampleSize / 2, SubsampleSize / 2] range m_JitterMatrix = Mat4Identity(); m_JitterMatrix.SetTranslationPart(Vec4(Subsample.x, Subsample.y, 0.0f, 1.0f)); m_ViewProjectionMatrixJitter = m_ViewMatrix * m_ProjectionMatrix * m_JitterMatrix; }
The interesting part though is the resolve pass; this is where all the complexity lies. A naive implementation would have been to just modulate the color of the current buffer with the color of the history buffer.
float ModulationFactor = 1.0 / 16.0; vec3 CurrentSubpixel = textureLod(CurrentBuffer, UV, 0.0).rgb; vec3 History = textureLod(HistoryBuffer, UV, 0.0).rgb; OutColor = mix(CurrentSubpixel, History, ModulationFactor);
Figure 2: 8x AA Ground truth
The above code will produce the best possible result (ground truth) as seen in figure 2 but that is expected to break under motion. To fix that we need to re-project the history buffer instead of just sampling it in-place. The issue with re-projection is that it creates some noticeable ghosting. There are some solutions that remove ghosting; in this article we will focus on a couple of them that revolve around mapping the history color into the range of the current sub-pixel neighborhood. The first one is to clamp the history color into the bounding box of the current sub-pixel neighbor [Lottes11] [Malan2012]. The second one is called variance clipping [Salvi16].
Axis-aligned bounding box (AABB) clamping is pretty straight forward:
vec3 NearColor0 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(1, 0)); vec3 NearColor1 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(0, 1)); vec3 NearColor2 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(-1, 0)); vec3 NearColor3 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(0, -1)); vec3 BoxMin = min(CurrentSubpixel, min(NearColor0, min(NearColor1, min(NearColor2, NearColor3)))); vec3 BoxMax = max(CurrentSubpixel, max(NearColor0, max(NearColor1, max(NearColor2, NearColor3))));; History = clamp(History, BoxMin, BoxMax);
In the code above we only get four neighbor texels around the current sample.
Figure 3: 8x AA with AABB clamping
Variance clipping as opposed to AABB clamping is configurable. High VARIANCE_CLIPPING_GAMMA gives a better result overall but at the same time the ghosting artefact's are increased. Lower VARIANCE_CLIPPING_GAMMA removes more ghosting but it increases jittering.
const float VARIANCE_CLIPPING_GAMMA = 1.0; vec3 NearColor0 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(1, 0)); vec3 NearColor1 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(0, 1)); vec3 NearColor2 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(-1, 0)); vec3 NearColor3 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(0, -1)); // Compute the two moments vec3 M1 = CurrentSubpixel + NearColor0 + NearColor1 + NearColor2 + NearColor3; vec3 M2 = CurrentSubpixel * CurrentSubpixel + NearColor0 * NearColor0 + NearColor1 * NearColor1 + NearColor2 * NearColor2 + NearColor3 * NearColor3; vec3 MU = M1 / 5.0; vec3 Sigma = sqrt(M2 / 5.0 - MU * MU); vec3 BoxMin = MU - VARIANCE_CLIPPING_GAMMA * Sigma; vec3 BoxMax = MU + VARIANCE_CLIPPING_GAMMA * Sigma; History = clamp(History, BoxMin, BoxMax);
Figure 4: 8x AA with variance clipping (gamma 1.0)
Both of these techniques aim to remove ghosting and to some degree they manage to do that quite well. Unfortunately, both of them also introduce some jittering -which appears as mild flickering- when the current sub-pixel color and the history color are very different. One solution is to weight the ModulationFactor using the difference of luminescence [Lottes12].
float Lum0 = ComputeLuminance(CurrentSubpixel); float Lum1 = ComputeLuminance(History); float Diff = abs(Lum0 - Lum1) / (EPSILON + max(Lum0, max(Lum1, ComputeLuminance(BoxMax)))); Diff = 1.0 - Diff; Diff *= Diff; ModulationFactor *= Diff;
Temporal AA is quite tricky to get right simply because it's hard to strike a balance between ghosting, jittering and quality. Deceasing the ghosting increases the jittering and decreasing the jittering decreases the overall quality.
Figure 5: A comparison of the 4 modes (8x AA)
The resolve renderpass samples the depth buffer (for re-projection), the history buffer and the current sub-pixel buffer. Its output goes to a fourth buffer. The resolve pass lies on the heavy side in terms of bandwidth consumption when compared to other post processing effects.
The fragment shader is relatively simple though. For the variance clipping (which is the most expensive method) the Mali offline compiler for Mali T880 reports 29 arithmetic instructions, 1 load-store instruction, 7 texture instructions and overall no register spilling.
Figure 6: Temporal AA resolve pass in Streamline performance analyzer
Running at 720p on a Galaxy S7 -which contains a Mali-T880 MP12- this pass runs in about 2.4 ms as seen in the screenshot of DS-5 Streamline performance analyzer (figure 6). Please note that this capture was taken with serialized submission to highlight the runtime of the render pass during optimization; in a production build the idle time before and after the pass would not be present.
Temporal AA is a double-edged sword. It's difficult to get it to look right, but on the other hand it's somewhat cheap even for mobile and it works quite well when MSAA cannot be used. Feel free to provide feedback and/or share your thoughts if you have tried to implement it on mobile.