ARM has recently published a set of OpenGL ES extensions. Here we explain some of the background that led us to develop these and show how they can be used to implement some common graphics algorithms more efficiently.
Many algorithms in computer science can be implemented more efficiently by exploiting locality of reference. That is, efficiency can be gained by making the memory access patterns of an algorithm more predictable. This is also true for computer graphics and is an underlying principle behind the tile-based architectures of the ARM Mali GPUs.
But the locality principle applies beyond tiles. Many graphics algorithms have locality at the level of individual pixels: a value written to a pixel in one operation may be read or modified by a later operation working at the exact same pixel location. Blending is a basic example of this principle that is supported on current GPUs. The value of a pixel is written while rendering a primitive and later read and modified while rendering another primitive. But there are also more complex examples, such as deferred shading, where this principle is not yet exploited. These more complex algorithms require storing multiple values per pixel location, which are finally combined in an application-specific way to produce the final pixel value. On today’s graphics APIs, these algorithms are typically implemented by a multi-pass approach. Pixel values are first written to a set of off-screen render targets, for example using the Multiple Render Target (MRT) support in OpenGL ES 3.0. In a second pass, these render targets are read as textures and used to compute the final pixel value that is written to the output framebuffer.
One obvious issue with the multi-pass approach is that the intermediate values must be written back to memory. This is far from ideal since keeping memory bandwidth – and thereby power - down is very important for mobile GPUs.
A more efficient approach is possible on the ARM Mali GPUs. As mentioned above, ARM Mali GPUs have a tile-based architecture. As described in a previous blog post by Peter Harris (The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering), this means that we perform fragment shading on 16x16 tiles at a time. All memory required to store the framebuffer values for a tile is stored on-chip until all fragment shading for the tile is complete. This property led us to develop a set of extensions that enables applications to better exploit the locality principle, or what we generally refer to as pixel local storage. The first two extensions, ARM shader framebuffer fetch and ARM shader framebuffer fetch depth stencil add the ability to return the current color, depth, and stencil values of a pixel to the fragment shader. The third extension, EXT shader pixel local storage, enables applications to store custom data per pixel.
ARM shader framebuffer fetch enables applications to read the current framebuffer color from the fragment shader. This is useful for techniques such as programmable blending. An example of how this would be used is shown below.
#extension GL_ARM_shader_framebuffer_fetch : enable precision mediump float; uniform vec4 uBlend0; uniform vec4 uBlend1; void main(void) { vec4 color = gl_LastFragColorARM; color = max(color, uBlend0, Color.w * uBlend0.w) ; color *= uBlend1; gl_FragColor = color; }
ARM shader framebuffer fetch depth stencil additionally allows applications to read the current depth and stencil values from the framebuffer. This enables use-cases such as programmable depth and stencil testing, modulating shadows, soft particles and creating variance shadow maps in a single render pass. Example code for the last two uses-cases is included in the Bandwidth Efficient Graphics with ARM Mali GPUs white paper.
EXT shader pixel local storage enables applications to store and retrieve arbitrary values at a given pixel location. This is a powerful principle that enables algorithms such as deferred shading to be implemented without incurring a large bandwidth cost. The amount of storage per pixel is implementation defined, but the extension guarantees that there is storage for at least 16 bytes per pixel.
You will notice that this is an “EXT” extension rather than a vendor-specific “ARM” extension. In OpenGL ES parlance, an “EXT” means multi-vendor. In this case, we worked with other industry players to define the extension, in order to ensure that it works well on their hardware as well as on ours.
So how does it work? Let’s look at a deferred shading example. A typical implementation of this technique using EXT shader pixel local storage splits the rendering into three passes: a G-Buffer generation pass where the properties (diffuse color, normal, etc.) of each pixel are stored in pixel local storage, a Shading pass where lighting is calculated based on the stored properties and accumulated in pixel local storage, and a Combination pass that uses the values in pixel local storage to calculate the final value of the pixel. These passes are outlined below. For a complete example and further descriptions of the algorithm, refer to the code sample on GitHub.
In the G-Buffer generation pass, instead of writing to regular color outputs, the fragment shader would declare a pixel local storage output block:
__pixel_local_outEXT FragData { layout(rgba8) highp vec4 Color; layout(rg16f) highp vec2 NormalXY; layout(rg16f) highp vec2 NormalZ_LightingB; layout(rg16f) highp vec2 LightingRG; } gbuf; void main() { gbuf.Color = calcDiffuseColor(); vec3 normal = calcNormal(); gbuf.NormalXY = normal.xy; gbuf.NormalZ_LightingB.x = normal.z; }
The shader would use this block to store the G-Buffer values in the pixel local storage. The image below illustrates what the contents of the pixel local storage might look like at the end of this pass. Keep in mind that that only one tile’s worth of data would be stored at any given time.
In the Shading pass, the same pixel local storage block would be used to accumulate lighting. In this case, the pixel local storage block would be both read from and written to:
__pixel_localEXT FragData { layout(rgba8) highp vec4 Color; layout(rg16f) highp vec2 NormalXY; layout(rg16f) highp vec2 NormalZ_LightingB; layout(rg16f) highp vec2 LightingRG; } gbuf; void main() { vec3 lighting = calclighting(gbuf.NormalXY.x, gbuf.NormalXY.y, gbuf.NormalZ_LightingB.x); gbuf.LightingRG += lighting.xy; gbuf.NormalZ_LightingB.y += lighting.z; }
At this point, the contents of the pixel local storage would also include the accumulated lighting (see image below):
Finally, the Combination pass would read from the pixel local storage and calculate the final pixel value:
__pixel_local_inEXT FragData { layout(rgba8) highp vec4 Color; layout(rg16f) highp vec2 NormalXY; layout(rg16f) highp vec2 NormalZ_LightingB; layout(rg16f) highp vec2 LightingRG; } gbuf; out highp vec4 fragColor; void main() { fragColor = resolve(gbuf.Color, gbuf.LightingRG.x, gbuf.LightingRG.y gbuf.NormalZ_LightingB.y); }
We now have our final image (see below) and the pixel local storage is no longer valid.
The important point here is that the pixel local storage data is never written back to memory! The memory for the pixel local storage is kept on-chip throughout and incurs no bandwidth cost. This is significantly more efficient than existing solutions that would require writing 16 bytes of data per pixel for the G-Buffer pass and subsequently read the same amount of data back again in the Shading and Combination passes.
It is also worth pointing out that the above example does not store the depth value in pixel local storage. This is not necessary since ARM shader framebuffer fetch depth stencil works well in combination with pixel local storage, effectively increasing the amount of application specific data that can be stored per pixel.
We are very excited about the possibilities opened up by these extensions. These pave the way for algorithms such as deferred shading to be implemented efficiently on mobile GPUs.
And it’s not just about efficiency: these extensions allow you to express the algorithm more directly compared to using an approach based around MRTs. Support for framebuffer fetch from MRTs could avoid some of the bandwidth issues for deferred shading, but would require a more complex implementation. In addition to creating and managing the textures and render buffers for the off-screen render passes, the application would have to provide the appropriate hints, like glInvalidateFramebuffer, to prevent the off-screen render targets from being written to memory. It would also have to rely on clever driver heuristics to avoid the memory being allocated in the first place. Using the extensions presented here, these complexities go away. Everything happens in the fragment shader, allowing you to focus on the core of your algorithm rather than complex state management.
ARM will support these extensions on all GPUs based on the Midgard Architecture. Support for ARM_shader_framebuffer_fetch and ARM_shader_framebuffer_fetch_depth_stencil is also expected to be added to the Mali-400 series of GPUs later this year.
What ideas do you have for using these extensions? We'd be interested in hearing, let us know in the comments below.
Thanks, Jan-Harald Fredriksen!
This is an incredibly informative read that shows just how effective keeping things on chip can be at reducing memory bandwidth and speeding things up! The deferred lighting example is a very good one, though I'm sure this will work for a great number of creative cases outside of deferred rendering. I hope that an exploration of alternatives techniques will be shared at some point!
I have a handful of questions about Shader Pixel Local Storage.
1) Can Shader Pixel Local Storage be used in combination with writing out to buffers in physical memory? For example, some alternate effect that may use certain render components to be later composited for a final scene (eg. screen space effects). If not, how could Shader Pixel Local Storage be used in combination with rendering to a texture?
2) I understand that MRTs are also an option to accomplish something similar, and I've read a blog post stating that Mali also uses a similar on-chip strategy to deal with bandwidth. Is this incorrect, or would MRTs be a no-penalty substitute if writing less than 16bytes of data per pixel?
3) Does reading from Shader Pixel Local storage use the Texture Pipeline or the Data Load/Store pipeline? Put another way: will reading a member of SPLS from a tile eat into the hardware limit of the number of data reads possible?
4) For modern Mali implementations like the T760, is the Shader Pixel Local Storage larger than 16 bytes per pixel? 16 bytes seems to be pretty small, leaving very little wiggle room for storing more exotic components.
Lastly, I notice that you are declaring highp for the rg16fp components in the gbuf PLS struct. Is this correct? Wouldn't a vec2 of 16bit floats be mediump? Perhaps I'm missing something (I'm rather new to this all)..