Arm's Immortalis and Mali GPUs are all about energy efficiency, and as every lazy person knows, the best way to avoid spending energy is to avoid doing work. Arm GPUs have a lot of tricks up its sleeve to avoid doing work, and starting with the Immortalis-G925 series, or Mali-G725 or Mali-G625 it has yet another trick, the fragment pre-pass.
The Fragment Pre-pass is a Hidden Surface Removal (HSR) technique that does a first pass over the fragments to find out which fragments are going to be visible in the result. When that is done, it loops back and renders only the visible ones.Sounds familiar? That is because it is very similar in concept to the well-known Z pre-pass technique, which has been employed by applications for a long time. But to do it in hardware is quite another matter, because unlike applications, hardware has to do this transparently, and deal with all of the corner cases allowed in the API. These complexities lead to limitations in the design of the pre-pass, and for an application developer it can be useful to be mindful of these limitations to be able to extract as much benefit as possible from the pre-pass. It is worthwhile though.
When an application inserts a Z pre-pass, it has to render all the geometry twice, first to build the Z buffer, and then again to render the color buffer. Doing it in hardware, on the other hand avoids having to submit and tile the geometry twice, keeps the intermediate data structures on-chip and can overlap bits and pieces of the pre-pass with the main-pass.
This blog post is quite heavy on technical detail. For that reason, we have included a glossary at the end of the post, check it out if you find some of the terms confusing.
The Mali Fragment Pre-pass is designed to go to great lengths to avoid doing work.
We ran a selection of modern content with the Fragment Pre-pass that was turned on and off to compare on an internal fixed-frequency platform.
How much the Fragment Pre-pass helps clearly depends on content, but there are some very nice double-digit reductions in power-hungry FMA and texturing operations.Also, note that we only disabled the Fragment Pre-pass in these runs, Mali's other Hidden Surface Removal technologies for example Forward Pixel Kill were left on.
As of the Immortalis-G920 series, Arm GPUs are tile-based deferred renderers with Deferred Vertex Shading. In a nutshell, this means:
This is, on its own, a powerful technique to reduce the amount of bandwidth going off-chip between the tiling phase and the fragment phase. Fragment Pre-pass, introduced in the Immortalis-G925 series builds on this, by extending the fragment phase like this:
For every tile:Run a pre-pass:
Run a main-pass:
This figure illustrates a pre-pass for a 4x4 tile, with 3 primitives.
Pre-pass:
When that is done, it is time for the main-pass:
In this example, the fragment pre-pass is able to save:
If the blue primitive is incompatible, then the pre-pass will behave like this:
In this example, the fragment pre-pass cannot save anything.
It is important to try and place incompatible primitives after compatible primitives, not before. With that in mind, as a developer, it is important to know the following:
So, what constitutes a compatible draw call?
Mali's Fragment Pre-pass is fairly robust and can handle a lot of cases, so it is easier to describe the cases that makes a draw call incompatible.
While the pre-pass is generally quite robust, there are situations in which it can cause performance problems.
For a simple, Early-Z draw call, the position calculation in vertex shaders is run up to three times:
Therefore, optimizing the position calculation the vertex shader is likely to yield good return on investment.
Generally speaking on Mali, as you increase the number of bytes per pixel, the tile size in number of pixels per tile has to shrink to be able to fit the data in the on-chip tile buffer.The exact thresholds vary by GPU, but on the Immortalis-G925 series GPUs there is a threshold at 128 bits-per-pixel causing the tile size to shrink from 64x64 to 64x32.There is another threshold at 256 bits-per-pixel, beyond which the tile size shrinks to 32x32.
Broadly speaking there are three downsides to shrinking the tile size:
When a draw call uses late-Z, for example writing to gl_FragDepth or using discard in the fragment shader, the Fragment Pre-pass has to run a portion of the fragment shader to determine which samples are covered.This part of the fragment shader will then have to run both in the pre-pass and, for visible samples, in the main-pass, meaning it can potentially mean doing more work overall.
gl_FragDepth
discard
Additionally, because late-Z draw calls have to run a bit of the fragment shader in the pre-pass, they also get all of their varyings shaded in the deferred vertex shader in the pre-pass (as opposed to just the positions).But sometimes you just cannot avoid it. For those cases, it is worthwhile to optimize the path to known coverage. That means:
discards
One fundamental limitation of the Fragment Pre-pass is that once it has encountered an incompatible draw-call in a tile, then all subsequent draw calls in that tile are considered incompatible.In other words, if there is an incompatible primitive early on in the tile, then the pre-pass loses its ability to cull further primitives in the tile.
For that reason it is important to place incompatible primitives after the draw calls you want effective HSR for.One important note here is about render target masks; if your draw calls do not write to all render targets that have been previously written to, then the draw call is considered transparent. If it then also writes depth or uses stencil, it is considered incompatible!
As an example, say you are building a G-buffer where some of your materials write some extra data to an extra render target.
This will cause Draw call 2 to be considered incompatible because it does not fully overwrite all the outputs of Draw call 1.There are two ways to solve this particular problem:
The "perfect" draw call for the Fragment Pre-pass looks like this:
Assuming you are able to keep incompatible primitives late in the fragment pass, the fragment pre-pass culling efficiency is otherwise not sensitive to draw order.Traditionally, rendering objects sorted by depth has been a technique used by application to maximize the amount of Early-Z culling. With the fragment pre-pass this is no longer necessary, you will get the same amount of culling regardless of the order.
This means you can skip the expensive CPU-side sorting and just render the opaque geometry in any order.
At the moment of writing this blog post, this feature is supported by Mali-G625, Mali-G725 and Immortalis-G925.
Advice about using the Fragment Pre-pass is included in the latest release of the Arm GPU Best Practices Developer Guide. There are also updates to advice about Ray Tracing Pipeline, Arm Fixed Rate Compression, Pipeline Caches, Dynamic Rendering, runtime compression and new advice about Staging Buffers among other changes. For the latest on how to get the best out Arm GPUs, right up to the Immortalis-G925, make sure to have a read.
Rasterization is the process of turning a primitive, usually a triangle, into individual fragments.
A fragment is one tiny part of a triangle, usually corresponding to a pixel. There can be many fragments per pixel if triangles are drawn on top of each other.
Coverage is a term graphics people like to throw around.
In the simple case where a fragment covers one pixel, and the pixel only has one sample - coverage simply means whether the fragment is visible or not.
In 3D Graphics, "Z" is the depth of a fragment. When the GPU does Z testing for a fragment, it looks in its Z buffer at the fragment location to see if this fragment's Z is behind what's in the buffer. If the fragment is behind what's in the Z buffer - the fragment is discarded. Otherwise, if the draw call enabled writing to the Z buffer, the Z buffer is updated with the value of this fragment, and execution continues.
"S" stands for stencil. It is a bit like Z, but far more programmable, and has different use cases.
Early-ZS is an optimization where ZS testing is performed before the fragment shader executes. This can only be done if the fragment shader does not modify its Z or S values in any way, or the fragment shader does not modify its coverage.This is a very powerful and well-known optimization that can eliminate a lot of work. "Late-ZS" refers to cases where this optimization cannot be applied.
Deferred Vertex Shading (DVS) is an optimization introduced in Mali-G720/Immortalis-G920. The gist of it is that instead of doing vertex shading up-front during the tiling phase, the GPU only does the position shading to find out which bins a particular primitive covers. When that is done, the position values are thrown away but are then re-generated in the fragment shading phase.This saves DDR bandwidth at the expense of some extra computation, a trade-off that is increasingly favorable as computation capability is increasing far more rapidly than DDR bandwidth.
Fragment shader side effects are writes to memory that do not follow the usual pattern of writing to a render target.Examples of this include using imageStore or storing to a buffer.
imageStore
Learn more