Hidden Surface Removal in Immortalis-G925: The Fragment Prepass

November 28, 2024

16 minute read time.

Arm's Immortalis and Mali GPUs are all about energy efficiency, and as every lazy person knows, the best way to avoid spending energy is to avoid doing work. Arm GPUs have a lot of tricks up its sleeve to avoid doing work, and starting with the Immortalis-G925 series, or Mali-G725 or Mali-G625 it has yet another trick, the fragment pre-pass.

The Fragment Pre-pass is a Hidden Surface Removal (HSR) technique that does a first pass over the fragments to find out which fragments are going to be visible in the result. When that is done, it loops back and renders only the visible ones.

Sounds familiar? That is because it is very similar in concept to the well-known Z pre-pass technique, which has been employed by applications for a long time. But to do it in hardware is quite another matter, because unlike applications, hardware has to do this transparently, and deal with all of the corner cases allowed in the API. These complexities lead to limitations in the design of the pre-pass, and for an application developer it can be useful to be mindful of these limitations to be able to extract as much benefit as possible from the pre-pass. It is worthwhile though.

When an application inserts a Z pre-pass, it has to render all the geometry twice, first to build the Z buffer, and then again to render the color buffer. Doing it in hardware, on the other hand avoids having to submit and tile the geometry twice, keeps the intermediate data structures on-chip and can overlap bits and pieces of the pre-pass with the main-pass.

This blog post is quite heavy on technical detail. For that reason, we have included a glossary at the end of the post, check it out if you find some of the terms confusing.

What makes the Mali Fragment Pre-pass good?

The Mali Fragment Pre-pass is designed to go to great lengths to avoid doing work.

It is very robust, and can perform Hidden Surface Removal in cases where one would normally expect a hardware pre-pass to give up:
- Fragment shader side effects
- Arbitrary stencil operations
- Arbitrary depth compare functions
- Late-Z
- A useful subset of transparent draw calls
It has favorable interactions with the Deferred Vertex Shading that is introduced in Immortalis-G920 series.
- Early-Z primitives that end up with no covered samples never have to shade their varyings
It does sample-perfect culling
Its culling efficiency is not sensitive to how the incoming primitives are ordered with respect to Z
- It is, however, sensitive to ordering in the sense that incompatible draw calls should be placed after compatible draw calls, more on that to follow.

How good is it really?

We ran a selection of modern content with the Fragment Pre-pass that was turned on and off to compare on an internal fixed-frequency platform.

Reduction in	Overall GPU cycles	Fragment read bandwidth	External read bandwidth	External write bandwidth	Floating-point arithmetic instructions	Texturing operations
Fortnite: Parachute scene	6.5%	10.7%	1.0%	-0.5%	16.1%	13.9%
Fortnite: Mountain or town scene	5.4%	8.6%	3.7%	0.2%	11.4%	9.6%
Justice Online Mobile	4.6%	6.0%	0.6%	0.4%	15.9%	11.1%
3DMark Steel Nomad Light	3.6%	3.9%	8.6%	2.4%	2.7%	3.4%
Zenless Zone Zero: lumina square scene	1.6%	2.0%	2.4%	0.3%	2.4%	1.8%
Zenless Zone Zero: battle scene	1.3%	1.4%	2.3%	0.1%	2.3%	1.5%
Roblox Towers Theme Park: Pirate Bay	1.2%	39%	2.1%	-1.4%	7.2%	12.5%
Star Rail: Admin District	0.6%	0.8%	0.7%	0.4%	1.2%	0.8%
Arena Breakout: farmland	0.6%	1.0%	3.3%	2.2%	3.1%	-1.4%

How much the Fragment Pre-pass helps clearly depends on content, but there are some very nice double-digit reductions in power-hungry FMA and texturing operations.
Also, note that we only disabled the Fragment Pre-pass in these runs, Mali's other Hidden Surface Removal technologies for example Forward Pixel Kill were left on.

How, exactly does the Mali Fragment Pre-pass work?

As of the Immortalis-G920 series, Arm GPUs are tile-based deferred renderers with Deferred Vertex Shading. In a nutshell, this means:

During the tiling phase, Arm GPUs do not write out position data for small triangles
During the fragment phase, Arm GPUs will execute a full vertex shader for small triangles.

This is, on its own, a powerful technique to reduce the amount of bandwidth going off-chip between the tiling phase and the fragment phase. Fragment Pre-pass, introduced in the Immortalis-G925 series builds on this, by extending the fragment phase like this:

For every tile:

Run a pre-pass:

This pre-pass iterates over the primitives in the tile.
- For compatible (see below for the conditions that make primitives incompatible), opaque primitives, it rasterizes them and runs fragments up to the point where coverage is known. At that point, for each covered sample, the hardware records that this primitive covers this sample.
  - This includes running the vertex shader for small triangles; although only the position shader for cases where the fragment shader does not emit ZS values or impact coverage.
- For compatible, ZS-only primitives, it rasterizes them and runs fragments up to the point where the ZS values are written.
- For compatible, transparent primitives, it simply skips them
- On encountering the first incompatible draw call it terminates, ignoring any further primitives.
- At the end of this pre-pass, the hardware has a fully populated ZS buffer, and it knows which opaque primitives cover which samples.

Run a main-pass:

For each primitive:
- For compatible, opaque primitives
  - It checks whether the primitive covers any samples. If not - the primitive is culled.
  - It also checks if the primitive outputs color at all. If not - the primitive is culled; it has no further work to do.
  - Then it runs the vertex shader (for small triangles), and rasterizes. But instead of a per-sample ZS test, it does a per-sample test to check if this primitive is the one that ended up being visible at its covered sample locations.
  - Samples that survive this test run the fragment shader
- For compatible, transparent primitives, it does a per-sample test to see if this sample was overdrawn by a later primitive, and a ZS test.
- On encountering the first incompatible draw call, it disables the fragment pre-pass specific tests, falls back to regular ZS testing, and treats all subsequent primitives as incompatible.
  - These then go through regular rasterization, deferred vertex shading, Early-Z, fragment shader execution, late-z as usual. Note that Forward Pixel Kill is still around and will still be able to perform some level of hidden surface removal.

This figure illustrates a pre-pass for a 4x4 tile, with 3 primitives.

A pre-pass for a 4x4 tile

Pre-pass:

The orange primitive is drawn first. It is determined to be compatible, so it is rasterized and runs up until coverage is known. Then the hardware records that the orange primitive covers sample position (1,2).
The blue primitive is drawn next. It is also determined to be compatible, and the hardware ultimately records that it covers sample positions (2,1), (2,2) and (3,1).
Then the green primitive comes along. It too is compatible, and the hardware records that it covers positions (0,1), (1,1), (2,1) and (1,2). Note that this overwrites the record of the orange primitive on sample (1,2)

When that is done, it is time for the main-pass:

The orange primitive is determined to be compatible, but it is also determined to not have any covered samples. So, it is simply thrown away.
The blue primitive is next up. This one does have covered samples, but only samples (2,2) and (3,1). Fragment shaders are spawned for these two samples
Finally it's the green primitive's turn. This primitive still has coverage recorded for positions (0,1), (1,1), (2,1) and (1,2), so fragment shaders are spawned for these four samples.

In this example, the fragment pre-pass is able to save:

The varying shading cost and fragment shading cost of the entire orange primitive
The fragment shading cost of sample (2,1) of the blue primitive

But what if the blue primitive was incompatible?

If the blue primitive is incompatible, then the pre-pass will behave like this:

The orange primitive is drawn first. It is determined to be compatible, so it is rasterized and runs up until coverage is known. Then the hardware records that the orange primitive covers sample position (1, 2).
The blue primitive is drawn next. It is determined to be incompatible, and the pre-pass stops here.

When that is done, it is time for the main-pass:

The orange primitive is determined to be compatible and since the pre-pass stopped before any other primitives could cover it, it is fully drawn.
The blue primitive is next up. This one was not part of the pre-pass, so we have to draw it in full in the main-pass.
Finally, it is the green primitive's turn. This primitive would have been part of the pre-pass, but because the blue primitive before it caused the pre-pass to terminate, it is not included in the pre-pass. For that reason, we have to draw it in full in the main-pass.

In this example, the fragment pre-pass cannot save anything.

It is important to try and place incompatible primitives after compatible primitives, not before. With that in mind, as a developer, it is important to know the following:

What makes a draw call compatible?

So, what constitutes a compatible draw call?

Mali's Fragment Pre-pass is fairly robust and can handle a lot of cases, so it is easier to describe the cases that makes a draw call incompatible.

A non-opaque draw call that writes Z or S.
1.1. These draw calls depend on reading a color value from the color buffer; so earlier primitives must be fragment shaded to completion. Therefore they cannot be recorded as overwriting previous primitives; implying the pre-pass won't remember the outcome of the ZS test; so they must be ZS tested in the main-pass instead. But you cannot do that either, because later draws may have altered the ZS values by the time you get to the main pass; hence they are incompatible.
1.2. The hardware tracks which render targets are being written to. If it encounters a draw that does not fully overwrite all render targets that have previously been written to in the tile, then it is effectively transparent; there something from a previous draw call that "shines through" it.
1.3. Draw calls that read the tile buffer are also considered transparent.
A ZS-only draw that follows a compatible non-opaque draw call is considered incompatible
2.1. This is because the ZS-only draw call does not get recorded as opaque, but it does update the ZS buffer. This could change the outcome of the ZS test for the earlier non-opaque primitives in the main-pass.
Fragment shader side effects; one would think these are considered incompatible, but perhaps surprisingly the pre-pass does allow a subset of fragment shader side effects. The one thing that is incompatible here is read-write access to something. You can write something, you can read something, but you cannot read and write the same thing. One notable exception to this is that atomic updates are only incompatible if the return value from the atomic update is used in the fragment shader.
Anything where the rasterizer coverage is required in the fragment shader. This includes:
4.1. Centroid varyings
4.2. Reading the coverage mask
4.3. Checking if the lane is a helper lane
Reading from the tile buffer at sample positions for which the primitive doesn't have coverage
5.1. This can be caused by reading the tile buffer in helper lanes and communicating the value across to adjacent threads via subgroup operations or using the read value as texture coordinates in a texture lookup with computed level-of-detail.
5.2. It also includes tile buffer reads with multi-sampling or Variable Rate Shading.
Finally, and somewhat surprisingly, for draw calls that write Z or Stencil where the fragment shader modifies coverage, but the shader explicitly states that ZS testing and update must happen early
6.1. This can result in samples where the Z or Stencil values get updated without an actual opaque primitive covering that sample (because the sample gets discarded). This will interfere with non-opaque draws.

What are the pitfalls of the Fragment Pre-pass?

While the pre-pass is generally quite robust, there are situations in which it can cause performance problems.

Avoid expensive computation of position in vertex shaders

For a simple, Early-Z draw call, the position calculation in vertex shaders is run up to three times:

Once to determine which tiles are covered, and to do back-face culling
In the fragment pre-pass, it is run to be able to rasterize the triangle and find out which samples it covers
In the fragment main-pass, if there are visible samples, the full vertex shader is run.

Therefore, optimizing the position calculation the vertex shader is likely to yield good return on investment.

Avoid fat G-buffers

Generally speaking on Mali, as you increase the number of bytes per pixel, the tile size in number of pixels per tile has to shrink to be able to fit the data in the on-chip tile buffer.
The exact thresholds vary by GPU, but on the Immortalis-G925 series GPUs there is a threshold at 128 bits-per-pixel causing the tile size to shrink from 64x64 to 64x32.
There is another threshold at 256 bits-per-pixel, beyond which the tile size shrinks to 32x32.

Broadly speaking there are three downsides to shrinking the tile size:

It increases the amount of Deferred Vertex Shading re-shading
1.1. Primitives that cover multiple tiles have to run their deferred vertex shader for every tile they are in. Shrinking the tile size means they are in more tiles.
1.2. Fragment pre-pass amplifies this downside, because the deferred vertex shader runs in both the pre-pass and the main-pass.
Smaller tile sizes makes it harder to hide the dependency between the pre-pass and the main-pass.
2.1. You cannot start a main-pass until the pre-pass has completed. Mali is generally quite good at hiding this dependency, but that gets harder with smaller tile sizes.

Avoid late-Z and keep the amount of computation needed to know coverage small

When a draw call uses late-Z, for example writing to gl_FragDepth or using discard in the fragment shader, the Fragment Pre-pass has to run a portion of the fragment shader to determine which samples are covered.
This part of the fragment shader will then have to run both in the pre-pass and, for visible samples, in the main-pass, meaning it can potentially mean doing more work overall.

Additionally, because late-Z draw calls have to run a bit of the fragment shader in the pre-pass, they also get all of their varyings shaded in the deferred vertex shader in the pre-pass (as opposed to just the positions).

But sometimes you just cannot avoid it. For those cases, it is worthwhile to optimize the path to known coverage. That means:

Keep the vertex shader, including varying calculations, as simple as possible
Place any writes to gl_FragDepth, and any discards early in the fragment shader, and keep the computations leading up to them simple
If you are using alpha-to-coverage, keep the computation of the alpha value simple

Place incompatible primitives last or avoid them if you can

One fundamental limitation of the Fragment Pre-pass is that once it has encountered an incompatible draw-call in a tile, then all subsequent draw calls in that tile are considered incompatible.
In other words, if there is an incompatible primitive early on in the tile, then the pre-pass loses its ability to cull further primitives in the tile.

For that reason it is important to place incompatible primitives after the draw calls you want effective HSR for.
One important note here is about render target masks; if your draw calls do not write to all render targets that have been previously written to, then the draw call is considered transparent. If it then also writes depth or uses stencil, it is considered incompatible!

As an example, say you are building a G-buffer where some of your materials write some extra data to an extra render target.

Draw call 0: Has a standard material, writes to render targets 0, 1, 2, 3 and depth
Draw call 1: Has a special material, writes to render targets 0, 1, 2, 3, 4 and depth
Draw call 2: Has a standard material, writes to render targets 0, 1, 2, 3 and depth

This will cause Draw call 2 to be considered incompatible because it does not fully overwrite all the outputs of Draw call 1.
There are two ways to solve this particular problem:

Have all G-buffer draw calls always write to all render targets (just write zeros to render targets that are not strictly needed by that material)
Place the "special" materials last in the G-buffer rendering.

How do I maximize the benefits of the Fragment Pre-pass?

The perfect draw call

The "perfect" draw call for the Fragment Pre-pass looks like this:

It is Early-Z, because then only position shading is required in the Pre-pass Deferred Vertex Shading phase, and the fragment shader does not need to run in the pre-pass
It is in a render-pass using 128 bits per pixel or less
It is opaque, and writes to all render targets in the render pass
The vertex shader is simple, especially computing the position.

No need to sort by depth any more

Assuming you are able to keep incompatible primitives late in the fragment pass, the fragment pre-pass culling efficiency is otherwise not sensitive to draw order.
Traditionally, rendering objects sorted by depth has been a technique used by application to maximize the amount of Early-Z culling. With the fragment pre-pass this is no longer necessary, you will get the same amount of culling regardless of the order.

This means you can skip the expensive CPU-side sorting and just render the opaque geometry in any order.

Which GPU supports Fragment Pre-pass?

At the moment of writing this blog post, this feature is supported by Mali-G625, Mali-G725 and Immortalis-G925.

Arm GPU Best Practices

Advice about using the Fragment Pre-pass is included in the latest release of the Arm GPU Best Practices Developer Guide. There are also updates to advice about Ray Tracing Pipeline, Arm Fixed Rate Compression, Pipeline Caches, Dynamic Rendering, runtime compression and new advice about Staging Buffers among other changes. For the latest on how to get the best out Arm GPUs, right up to the Immortalis-G925, make sure to have a read.

Glossary

What is rasterization?

Rasterization is the process of turning a primitive, usually a triangle, into individual fragments.

What is a fragment?

A fragment is one tiny part of a triangle, usually corresponding to a pixel. There can be many fragments per pixel if triangles are drawn on top of each other.

What is coverage?

Coverage is a term graphics people like to throw around.

A fragment usually covers a pixel - except if Variable Rate Shading is used, in which case it can cover multiple pixels.
A pixel might consist of multiple samples
Coverage refers to which sample of which pixels the fragment will eventually output values to.

In the simple case where a fragment covers one pixel, and the pixel only has one sample - coverage simply means whether the fragment is visible or not.

What is Zs?

In 3D Graphics, "Z" is the depth of a fragment. When the GPU does Z testing for a fragment, it looks in its Z buffer at the fragment location to see if this fragment's Z is behind what's in the buffer. If the fragment is behind what's in the Z buffer - the fragment is discarded. Otherwise, if the draw call enabled writing to the Z buffer, the Z buffer is updated with the value of this fragment, and execution continues.

"S" stands for stencil. It is a bit like Z, but far more programmable, and has different use cases.

What is Early-ZS and Late-ZS?

Early-ZS is an optimization where ZS testing is performed before the fragment shader executes. This can only be done if the fragment shader does not modify its Z or S values in any way, or the fragment shader does not modify its coverage.
This is a very powerful and well-known optimization that can eliminate a lot of work. "Late-ZS" refers to cases where this optimization cannot be applied.

What is Deferred Vertex Shading (DVS)?

Deferred Vertex Shading (DVS) is an optimization introduced in Mali-G720/Immortalis-G920. The gist of it is that instead of doing vertex shading up-front during the tiling phase, the GPU only does the position shading to find out which bins a particular primitive covers. When that is done, the position values are thrown away but are then re-generated in the fragment shading phase.
This saves DDR bandwidth at the expense of some extra computation, a trade-off that is increasingly favorable as computation capability is increasing far more rapidly than DDR bandwidth.

What are fragment shader side effects?

Fragment shader side effects are writes to memory that do not follow the usual pattern of writing to a render target.
Examples of this include using imageStore or storing to a buffer.

Learn more

2 comments
0 members are here

Top Comments

pape 2 months ago +1

I've heard of ARM's FPK before, and as far as I understand, both of them perform certain deferred operations. What are the differences between them?

Mobile, Graphics, and Gaming blog

Join the Upscaling Revolution with Arm Accuracy Super Resolution (Arm ASR)

Lisa Sheckleford

With Arm ASR you can easily improve frames per second, enhance visual quality, and prevent thermal throttling for smoother, longer gameplay.
- March 18, 2025
Generative AI in game development

Roberto Lopez Mendez

How is Generative AI (GenAI) technology impacting different areas of game development?
- March 13, 2025
Physics simulation with graph neural networks targeting mobile

Tomas Zilhao Borges

In this blog post, we perform a study of the GNN architecture and the new TF-GNN API and determine whether GNNs are a viable approach for implementing physics simulations.
- February 26, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog