Solving the Multi-sampling Problem in Deferred Shading with Temporal Anti-Aliasing

July 31, 2017

6 minute read time.

The introduction of Vulkan not only brought a more explicit way to control GPU hardware but also a number of novel concepts that unlock new and interesting possibilities. One of those new concepts is multipass rendering. Multipass rendering's primary function (but not the only one) is to enable tile-based GPUs, like Mali, to save precious bandwidth by utilizing the on-chip tile buffer for storing intermediate subpass states, which can be consumed directly from the tile buffer without ever being written back to main memory. You can read all about multipass rendering in a previous blog post here.

One of the rendering techniques that benefit greatly from multipass rendering is deferred shading. Despite all the benefits that deferred shading is offering it has some glaring disadvantages with the biggest one being the lack of anti-aliasing. Technically it is possible to use MSAA with multipass and deferred shading but that will -at least- quadruple the memory requirements for the on-chip tile buffer. In the best case the increased per-pixel storage size will force the GPU to work on smaller tiles which will typically reduce performance and in the worst case the requirement will be high enough that we hit the Vulkan specification limit. Whatever the case is using MSAA with deferred shading becomes a bit problematic.

In this article, we will try to explore an anti-aliasing method that works well with deferred shading called temporal anti-aliasing. First, we will present some implementation details with screenshots and finally we will share some details around performance.

Implementation details

The main idea behind temporal AA is quite simple. Where classic MSAA computes all of the subpixels in one go and then resolves to a single input sample, temporal AA computes a single subpixel per frame and then it accumulates it with the previous frame [Lottes12]. In other words, it spreads the calculation of subsamples temporally (between frames) rather than spatially (within a single frame).

The implementation can be divided into two key steps.

Apply a subpixel offset on every drawn primitive. This is typically done by multiplying the projection matrix with a jitter matrix. The jitter matrix is a translation matrix that has the subpixel offset for X and Y.
Modulate the jittered result of the current frame (current buffer) with the result of the previous frame (history buffer or accumulation buffer). This is a render pass that is commonly called Temporal AA resolve.

Figure 1: Our test scene without AA

The engine should jitter the projection matrix that will be used in the G-buffer pass as well as in the light pass (for the light volume drawcalls). The following code snippet shows how we compute the jitter matrix in our internal demo engine.

// Compute jittered matrices
{
	// Sub-sample positions for 16x TAA
	static const Vec2 SAMPLE_LOCS_16[16] = {
		Vec2(-8.0f, 0.0f) / 8.0f,
		Vec2(-6.0f, -4.0f) / 8.0f,
		Vec2(-3.0f, -2.0f) / 8.0f,
		Vec2(-2.0f, -6.0f) / 8.0f,
		Vec2(1.0f, -1.0f) / 8.0f,
		Vec2(2.0f, -5.0f) / 8.0f,
		Vec2(6.0f, -7.0f) / 8.0f,
		Vec2(5.0f, -3.0f) / 8.0f,
		Vec2(4.0f, 1.0f) / 8.0f,
		Vec2(7.0f, 4.0f) / 8.0f,
		Vec2(3.0f, 5.0f) / 8.0f,
		Vec2(0.0f, 7.0f) / 8.0f,
		Vec2(-1.0f, 3.0f) / 8.0f,
		Vec2(-4.0f, 6.0f) / 8.0f,
		Vec2(-7.0f, 8.0f) / 8.0f,
		Vec2(-5.0f, 2.0f) / 8.0f};

	// Sub-sample positions for 8x TAA
	static const Vec2 SAMPLE_LOCS_8[8] = {
		Vec2(-7.0f, 1.0f) / 8.0f,
		Vec2(-5.0f, -5.0f) / 8.0f,
		Vec2(-1.0f, -3.0f) / 8.0f,
		Vec2(3.0f, -7.0f) / 8.0f,
		Vec2(5.0f, -1.0f) / 8.0f,
		Vec2(7.0f, 7.0f) / 8.0f,
		Vec2(1.0f, 3.0f) / 8.0f,
		Vec2(-3.0f, 5.0f) / 8.0f};
 
 	// Let's assume that we are using 8x
	#define SAMPLE_LOCS SAMPLE_LOCS_8
	#define SAMPLE_COUNT 8

	const unsigned SubsampleIdx = m_FrameCount % SAMPLE_COUNT;
	 
	const Vec2 TexSize(1.0f / Vec2(GBufferWidth, GBufferHeight)); // Texel size
	const Vec2 SubsampleSize = TexSize * 2.0f; // That is the size of the subsample in NDC
	 
	const Vec2 S = SAMPLE_LOCS[SubsampleIdx]; // In [-1, 1]
	 
	Vec2 Subsample = S * SubsampleSize; // In [-SubsampleSize, SubsampleSize] range
	Subsample *= 0.5f; // In [-SubsampleSize / 2, SubsampleSize / 2] range
 
	m_JitterMatrix = Mat4Identity();
	m_JitterMatrix.SetTranslationPart(Vec4(Subsample.x, Subsample.y, 0.0f, 1.0f));
	m_ViewProjectionMatrixJitter = m_ViewMatrix * m_ProjectionMatrix * m_JitterMatrix;
}

The interesting part though is the resolve pass; this is where all the complexity lies. A naive implementation would have been to just modulate the color of the current buffer with the color of the history buffer.

float ModulationFactor = 1.0 / 16.0;

vec3 CurrentSubpixel = textureLod(CurrentBuffer, UV, 0.0).rgb;
vec3 History = textureLod(HistoryBuffer, UV, 0.0).rgb;
OutColor = mix(CurrentSubpixel, History, ModulationFactor);

Figure 2: 8x AA Ground truth

The above code will produce the best possible result (ground truth) as seen in figure 2 but that is expected to break under motion. To fix that we need to re-project the history buffer instead of just sampling it in-place. The issue with re-projection is that it creates some noticeable ghosting. There are some solutions that remove ghosting; in this article we will focus on a couple of them that revolve around mapping the history color into the range of the current sub-pixel neighborhood. The first one is to clamp the history color into the bounding box of the current sub-pixel neighbor [Lottes11] [Malan2012]. The second one is called variance clipping [Salvi16].

Axis-aligned bounding box (AABB) clamping is pretty straight forward:

vec3 NearColor0 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(1, 0));
vec3 NearColor1 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(0, 1));
vec3 NearColor2 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(-1, 0));
vec3 NearColor3 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(0, -1));

vec3 BoxMin = min(CurrentSubpixel, min(NearColor0, min(NearColor1, min(NearColor2, NearColor3))));
vec3 BoxMax = max(CurrentSubpixel, max(NearColor0, max(NearColor1, max(NearColor2, NearColor3))));;

History = clamp(History, BoxMin, BoxMax);

In the code above we only get four neighbor texels around the current sample.

Figure 3: 8x AA with AABB clamping

Variance clipping as opposed to AABB clamping is configurable. High VARIANCE_CLIPPING_GAMMA gives a better result overall but at the same time the ghosting artefact's are increased. Lower VARIANCE_CLIPPING_GAMMA removes more ghosting but it increases jittering.

const float VARIANCE_CLIPPING_GAMMA = 1.0;

vec3 NearColor0 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(1, 0));
vec3 NearColor1 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(0, 1));
vec3 NearColor2 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(-1, 0));
vec3 NearColor3 = textureLodOffset(CurrentBuffer, UV, 0.0, ivec2(0, -1));

// Compute the two moments
vec3 M1 = CurrentSubpixel + NearColor0 + NearColor1 + NearColor2 + NearColor3;
vec3 M2 = CurrentSubpixel * CurrentSubpixel + NearColor0 * NearColor0 + NearColor1 * NearColor1 
	+ NearColor2 * NearColor2 + NearColor3 * NearColor3;

vec3 MU = M1 / 5.0;
vec3 Sigma = sqrt(M2 / 5.0 - MU * MU);

vec3 BoxMin = MU - VARIANCE_CLIPPING_GAMMA * Sigma;
vec3 BoxMax = MU + VARIANCE_CLIPPING_GAMMA * Sigma;

History = clamp(History, BoxMin, BoxMax);

Figure 4: 8x AA with variance clipping (gamma 1.0)

Both of these techniques aim to remove ghosting and to some degree they manage to do that quite well. Unfortunately, both of them also introduce some jittering -which appears as mild flickering- when the current sub-pixel color and the history color are very different. One solution is to weight the ModulationFactor using the difference of luminescence [Lottes12].

float Lum0 = ComputeLuminance(CurrentSubpixel);
float Lum1 = ComputeLuminance(History);

float Diff = abs(Lum0 - Lum1) / (EPSILON + max(Lum0, max(Lum1, ComputeLuminance(BoxMax))));
Diff = 1.0 - Diff;
Diff *= Diff;
	
ModulationFactor *= Diff;

Temporal AA is quite tricky to get right simply because it's hard to strike a balance between ghosting, jittering and quality. Deceasing the ghosting increases the jittering and decreasing the jittering decreases the overall quality.

Figure 5: A comparison of the 4 modes (8x AA)

Performance

The resolve renderpass samples the depth buffer (for re-projection), the history buffer and the current sub-pixel buffer. Its output goes to a fourth buffer. The resolve pass lies on the heavy side in terms of bandwidth consumption when compared to other post processing effects.

The fragment shader is relatively simple though. For the variance clipping (which is the most expensive method) the Mali offline compiler for Mali T880 reports 29 arithmetic instructions, 1 load-store instruction, 7 texture instructions and overall no register spilling.

Figure 6: Temporal AA resolve pass in Streamline performance analyzer

Running at 720p on a Galaxy S7 -which contains a Mali-T880 MP12- this pass runs in about 2.4 ms as seen in the screenshot of DS-5 Streamline performance analyzer (figure 6). Please note that this capture was taken with serialized submission to highlight the runtime of the render pass during optimization; in a production build the idle time before and after the pass would not be present.

Conclusion

Temporal AA is a double-edged sword. It's difficult to get it to look right, but on the other hand it's somewhat cheap even for mobile and it works quite well when MSAA cannot be used. Feel free to provide feedback and/or share your thoughts if you have tried to implement it on mobile.

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Solving the Multi-sampling Problem in Deferred Shading with Temporal Anti-Aliasing

Implementation details

Performance

Conclusion

Unlock the power of SVE and SME with SIMD Loops

What is Arm Performance Studio?

How Neural Super Sampling works: Architecture, training, and inference