Pixel Local Storage on ARM Mali GPUs

ARM has recently published a set of OpenGL ES extensions. Here we explain some of the background that led us to develop these and show how they can be used to implement some common graphics algorithms more efficiently.

Locality of reference

Many algorithms in computer science can be implemented more efficiently by exploiting locality of reference. That is, efficiency can be gained by making the memory access patterns of an algorithm more predictable. This is also true for computer graphics and is an underlying principle behind the tile-based architectures of the ARM Mali GPUs.

But the locality principle applies beyond tiles. Many graphics algorithms have locality at the level of individual pixels: a value written to a pixel in one operation may be read or modified by a later operation working at the exact same pixel location. Blending is a basic example of this principle that is supported on current GPUs. The value of a pixel is written while rendering a primitive and later read and modified while rendering another primitive. But there are also more complex examples, such as deferred shading, where this principle is not yet exploited. These more complex algorithms require storing multiple values per pixel location, which are finally combined in an application-specific way to produce the final pixel value. On today’s graphics APIs, these algorithms are typically implemented by a multi-pass approach. Pixel values are first written to a set of off-screen render targets, for example using the Multiple Render Target (MRT) support in OpenGL ES 3.0. In a second pass, these render targets are read as textures and used to compute the final pixel value that is written to the output framebuffer.

One obvious issue with the multi-pass approach is that the intermediate values must be written back to memory. This is far from ideal since keeping memory bandwidth – and thereby power - down is very important for mobile GPUs.

A more efficient approach is possible on the ARM Mali GPUs. As mentioned above, ARM Mali GPUs have a tile-based architecture. As described in a previous blog post by Peter Harris (The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering), this means that we perform fragment shading on 16x16 tiles at a time. All memory required to store the framebuffer values for a tile is stored on-chip until all fragment shading for the tile is complete. This property led us to develop a set of extensions that enables applications to better exploit the locality principle, or what we generally refer to as pixel local storage. The first two extensions, ARM shader framebuffer fetch and ARM shader framebuffer fetch depth stencil add the ability to return the current color, depth, and stencil values of a pixel to the fragment shader. The third extension, EXT shader pixel local storage, enables applications to store custom data per pixel.

Shader Framebuffer Fetch

ARM shader framebuffer fetch enables applications to read the current framebuffer color from the fragment shader. This is useful for techniques such as programmable blending. An example of how this would be used is shown below.

#extension GL_ARM_shader_framebuffer_fetch : enable
precision mediump float;
uniform vec4 uBlend0;
uniform vec4 uBlend1;

void main(void)
{
     vec4 color = gl_LastFragColorARM;
     color = max(color, uBlend0, Color.w * uBlend0.w) ;
     color *= uBlend1;

     gl_FragColor = color;
}


ARM shader framebuffer fetch depth stencil additionally allows applications to read the current depth and stencil values from the framebuffer. This enables use-cases such as programmable depth and stencil testing, modulating shadows, soft particles and creating variance shadow maps in a single render pass. Example code for the last two uses-cases is included in the Bandwidth Efficient Graphics with ARM Mali GPUs white paper.

Shader Pixel Local Storage

EXT shader pixel local storage enables applications to store and retrieve arbitrary values at a given pixel location. This is a powerful principle that enables algorithms such as deferred shading to be implemented without incurring a large bandwidth cost. The amount of storage per pixel is implementation defined, but the extension guarantees that there is storage for at least 16 bytes per pixel.

You will notice that this is an “EXT” extension rather than a vendor-specific “ARM” extension. In OpenGL ES parlance, an “EXT” means multi-vendor. In this case, we worked with other industry players to define the extension, in order to ensure that it works well on their hardware as well as on ours.

So how does it work? Let’s look at a deferred shading example. A typical implementation of this technique using EXT shader pixel local storage splits the rendering into three passes: a G-Buffer generation pass where the properties (diffuse color, normal, etc.) of each pixel are stored in pixel local storage, a Shading pass where lighting is calculated based on the stored properties and accumulated in pixel local storage, and a Combination pass that uses the values in pixel local storage to calculate the final value of the pixel. These passes are outlined below. For a complete example and further descriptions of the algorithm, refer to the code sample on GitHub.

In the G-Buffer generation pass, instead of writing to regular color outputs, the fragment shader would declare a pixel local storage output block:

__pixel_local_outEXT FragData
{
     layout(rgba8) highp vec4 Color;
     layout(rg16f) highp vec2 NormalXY;
     layout(rg16f) highp vec2 NormalZ_LightingB;
     layout(rg16f) highp vec2 LightingRG;
} gbuf;

void main()
{
     gbuf.Color = calcDiffuseColor();
     vec3 normal = calcNormal();
     gbuf.NormalXY = normal.xy;
     gbuf.NormalZ_LightingB.x = normal.z;
}

The shader would use this block to store the G-Buffer values in the pixel local storage. The image below illustrates what the contents of the pixel local storage might look like at the end of this pass. Keep in mind that that only one tile’s worth of data would be stored at any given time.

Pixel local storage pass

In the Shading pass, the same pixel local storage block would be used to accumulate lighting. In this case, the pixel local storage block would be both read from and written to:

__pixel_localEXT FragData
{
     layout(rgba8) highp vec4 Color;
     layout(rg16f) highp vec2 NormalXY;
     layout(rg16f) highp vec2 NormalZ_LightingB;
     layout(rg16f) highp vec2 LightingRG;
} gbuf;

void main()
{
     vec3 lighting = calclighting(gbuf.NormalXY.x,
                                  gbuf.NormalXY.y,
                                  gbuf.NormalZ_LightingB.x);
     gbuf.LightingRG += lighting.xy;
     gbuf.NormalZ_LightingB.y += lighting.z;
}

At this point, the contents of the pixel local storage would also include the accumulated lighting (see image below):

Pixel local storage accumulated lighting

Finally, the Combination pass would read from the pixel local storage and calculate the final pixel value:

__pixel_local_inEXT FragData
{
     layout(rgba8) highp vec4 Color;
     layout(rg16f) highp vec2 NormalXY;
     layout(rg16f) highp vec2 NormalZ_LightingB;
     layout(rg16f) highp vec2 LightingRG;
} gbuf;

out highp vec4 fragColor;

void main()
{
     fragColor = resolve(gbuf.Color,
                         gbuf.LightingRG.x,
                         gbuf.LightingRG.y
                         gbuf.NormalZ_LightingB.y);
}

We now have our final image (see below) and the pixel local storage is no longer valid.

Pixel local storage final image

The important point here is that the pixel local storage data is never written back to memory! The memory for the pixel local storage is kept on-chip throughout and incurs no bandwidth cost. This is significantly more efficient than existing solutions that would require writing 16 bytes of data per pixel for the G-Buffer pass and subsequently read the same amount of data back again in the Shading and Combination passes.

It is also worth pointing out that the above example does not store the depth value in pixel local storage. This is not necessary since ARM shader framebuffer fetch depth stencil works well in combination with pixel local storage, effectively increasing the amount of application specific data that can be stored per pixel.

Conclusion

We are very excited about the possibilities opened up by these extensions. These pave the way for algorithms such as deferred shading to be implemented efficiently on mobile GPUs.

And it’s not just about efficiency: these extensions allow you to express the algorithm more directly compared to using an approach based around MRTs. Support for framebuffer fetch from MRTs could avoid some of the bandwidth issues for deferred shading, but would require a more complex implementation. In addition to creating and managing the textures and render buffers for the off-screen render passes, the application would have to provide the appropriate hints, like glInvalidateFramebuffer, to prevent the off-screen render targets from being written to memory. It would also have to rely on clever driver heuristics to avoid the memory being allocated in the first place. Using the extensions presented here, these complexities go away. Everything happens in the fragment shader, allowing you to focus on the core of your algorithm rather than complex state management.

ARM will support these extensions on all GPUs based on the Midgard Architecture. Support for ARM_shader_framebuffer_fetch and ARM_shader_framebuffer_fetch_depth_stencil is also expected to be added to the Mali-400 series of GPUs later this year.

What ideas do you have for using these extensions? We'd be interested in hearing, let us know in the comments below.

Anonymous
  • Thanks janharaldfredriksen,

    As it stands, the SPLS extension is remarkably useful for many use-cases, and was very forward thinking in its implementation. I'm really looking forward to see how this idea evolves further! The vision to increasingly keep memory entirely on chip is a game changer, and one that may advance mobile GPU performance much faster than the evolution previously enjoyed by outlet (mains) powered devices.

    For example: if a complete row of tile-buffers (eg. 1920x16 pixels of 128 bits == ~500KB) could be stored on chip (ie. SRAM or a partition of L2 Cache), and the fragment shading for a subsequent pass was cleverly deferred until after the row had been fully resolved, a safe "radius" of buffer reads around a fragment being processed could be established without ever having to access off-chip memory for all pixels. In this way, even screen-space fragment shading that needs to sample a modest radius of near-by buffer pixels (eg. 6 pixels all-around) could be made to reside entirely on chip!

    An image is below to make the concept more clear:

    on-chip-deferred-fragment-pass.png

    One limitation of such a technique would prevent the deferred buffer pass from being read by subsequent passes unless it was written out to memory. In other words optionally keeping everything on-chip may increase fragment processing load (doing the work multiple times) in cases that had subsequent passes that relied upon the deferred pass. Strategies to mitigate this may include simply combining passes.

  • Hi Sean,

    I agree that this would be a quite useful ability. The extension, however, generates an error if multiple color attachments are in use while PLS is active, even if the PLS values have been resolved. Unfortunately, this prevents you from writing out values to multiple targets as you describe.

  • Hi chrisvarns,

    Certainly! My targeted use-case involves things that are pretty common: screen space post-processing bloom and motion-blur. For example, being able to do a deferred lighting pass on-chip (saving bandwidth), but also having access to per-pixel luminance information in an off-chip buffer for a stepped-down bloom and tone-mapping pass would be very useful to do. Motion blur would equally benefit from being able to write-out a buffer of per-pixel motion vectors.

    Since these effects require entire resolved buffers to query, they would have to be written out to memory. But certain intermediate buffers can remain on-chip (if it's possible) meaning a potentially significant savings even in this presence of these screen-space effects. For example, the per-pixel normal data used in deferred lighting need never be written out to memory meaning a savings of 32 bits/pix! It's not much, but this is a reasonably naive example, and with exploration may lend itself to more clever uses with far greater savings.

  • Hi Sean,

    I'm interested to hear about different use cases, what is your use case for using PLS, and then resolving that and writing out to multiple targets?

    Thanks,

    Chris

  • Wonderful, janharaldfredriksen!

    Thanks very much for the feedback. It is very helpful for getting an idea of what's happening. I've also been re-working something that I've been writing, and I should be able to fit it into the 128bits of the accessible tile memory, which is a relief. Thankfully, the depth/stencil is not necessary to store via PLS which is very helpful.

    One last question:

    Would it be possible to write out to multiple targets after the PLS has been used?

Graphics & Multimedia blog