Back in 2014 . Since then, many things have happened that have transformed mobile Graphics, particularly the release of the first Vulkan version in February 2016. Built from the ground up, Vulkan was intended to replace OpenGL as the main Graphics API, after OpenGL had successfully served the industry for more than 20 years. The new Graphics API was expected to provide a set of benefits across multiple platforms that the graphics community recognizes and values today.
As expected, the transition from OpenGL to Vulkan is taking several years. Although today the default API in the main game engines is Vulkan, all of them still support OpenGL ES 3.x for Android. OpenGL ES 3.1 brought compute to mobile graphics, and OpenGL ES 3.2 added the Android Extension Pack, bringing the mobile API's functionality significantly closer to its desktop counterpart – OpenGL. OpenGL ES 3.2 is supported in Android 6.0 and higher if the device itself supports this graphics pipeline, which is reflected in the high use of OpenGL ES by game developers.
Part 1 in this blog series explores the PLS extension from today’s perspective, but also from when it was launched. We highlight the main benefits it introduced, so developers can make the most of it when coding their games using OpenGL ES. Some representative examples are described and links to relevant publications and presentations are provided.
Saving power is always in the minds of mobile game developers. Reducing power usage while gaming enables battery savings and makes the gaming experience last longer. The power-saving benefit is what makes PLS so important for game developers using OpenGL ES to implement their games. To understand this, we first need to look at how PLS works.
PLS takes advantage of the Mali GPU tile architecture. It is the concept of tile-based rendering that allows Mali GPUs to keep rendering power consumption low. Mali GPUs break up the screen into small regions of 16x16 or 32x32 pixels known as tiles. Rendering takes place in two passes (see Fig. 1 below). The first pass builds the list of geometric primitives that fall into each tile. In the second pass, each shader core executes the fragment shading tile-by-tile and writes tiles back to memory as they have been completed.
During shading, due to the small size of the tile, it is possible to keep the whole set of working data (color, depth, and stencil) in an on-chip RAM within the shader core. This RAM is fast and tightly coupled to the GPU shader core. This allows saving valuable bandwidth and thus power.
Figure 1. Tile-based Rendering data flow
Originally Mali GPU per-pixel on-chip memory storage was normally used to do multisampling anti-aliasing. PLS is an API extension that exposes this memory to the programmer so they may read/write their own per-pixel data. This data is preserved between draw calls and remains active as long as the framebuffer remains active. This persistent per-pixel storage was the key property that made PLS unique and changed the way graphics could be done on tile-based GPUs.
With PLS, it became possible to chain render tasks without flushing out the memory, keeping bandwidth consumption down massively. We would typically use this memory to build up the final pixel color progressively, using multiple shaders, with a final ‘resolve’ shader at the end to explicitly copy to the framebuffer output.
The programmers view of the persistent per-pixel storage was very flexible. PLS allowed each shader to declare its per-pixel ‘view’ of the PLS as a struct. This allows developers to re-interpret the data and change the view between shaders. The per-pixel view of the PLS is completely independent of the current framebuffer format. This means that what is flushed back to main memory in the end will still conform to the current framebuffer format. Below is an example of a PLS shader view:
Figure 2. PLS structure
The layout qualifier on the left is used to specify the data format of the individual PLS variables, with all formats 32-bits in size. The precision and type specified in the middle describes the type the shader uses to read/write to these variables. There is an implied conversion between this type and the layout format when you read and write from your shader. For more detailed information, we recommend reading the extension specification.
Before PLS, GLES 3.0 relied on Multiple Render Targets (MRT) for complex rendering tasks. This allowed shader output to be written to more than one texture in a single render pass. These textures can then be used as inputs to other shaders. A common use of MRT in OpenGL is deferred rendering that performs lighting calculations of the entire 3D scene at once instead of on each individual object as happens in forward rendering.
Deferred rendering is based on the idea that most of the heavy rendering (i.e. lighting) is deferred or postponed to a later stage. This approach significantly optimizes scenes with large numbers of lights and is commonly used in consoles. In the first render pass, known as the geometry pass, the scene is rendered once to retrieve geometrical information from the objects, which is stored in a collection of textures called the G-buffer. MRT is used to store the information for the lighting calculations in multiple render targets. These are then used after the entire scene has been drawn to calculate the final lit image. Typically, one render target stores color and surface information of objects, while another holds the surface normals and depth information. Additional render targets can be used to store for example ambient occlusion data.
The problem with MRT on mobile is that each render target is written to the main memory and is then read back to retrieve the stored information. Although the combination of MRT with framebuffer fetch might seem like an alternative, PLS has some extra benefits. These include: many fewer performance pitfalls; more flexible storage as the data format is independent of the color attachment format; and offering a programming model that is closer to what the shader programmer wants to express.
Deferred rendering can be implemented efficiently on mobile using the on-chip memory exposed by the PLS extension. This is supported by the Mali-T760 onwards, allowing 16 bytes of local storage per pixel.
The figure below schematically summarizes how deferred rendering can be implemented using PLS in three passes. An excellent blog by Jan-Harald Fredriksen and an exhaustive article  in GPU Pro 5 by Marius Bjorge – both among the original authors of the PLS extension – offer detailed explanations of this implementation. A more recent blog from Hans-Kristian Arntzen shares a comprehensive analysis of deferred shading on mobile, and compares PLS, framebuffer fetch and Vulkan multipass implementations. Note that in Vulkan’s multipass concept, it is not possible to directly access tile storage. Instead, we provide enough information up front to the driver, so that it can optimize a render pass to allow the current pass to access the pixel data stored by the previous pass at the same location. Just like PLS.
Figure 3. Deferred rendering passes using PLS
In the G-Buffer pass, the fragment shader declares a PLS output block, which is then filled by the shader with the scene color and normals information. Note that the total memory used by the PLS output block is 128 bits (32 bits x 4).
layout(rgba8) highp vec4 Color;
layout(rg16f) highp vec2 NormalXY;
layout(rg16f) highp vec2 NormalZ_LightingB;
layout(rg16f) highp vec2 LightingRG;
gbuf.Color = calcDiffuseColor();
vec3 normal = calcNormal();
gbuf.NormalXY = normal.xy;
gbuf.NormalZ_LightingB.x = normal.z;
The extension qualifier __pixel_local_outEXT means that the storage can be written and is persistent across shader invocations covering the same pixel. The storage is then read in another shader invocation declaring __pixel_localEXT or __pixel_local_inEXT storage.
In the second pass the same PLS is declared, but this time using the qualifier __pixel_localEXT to indicate that the storage can be read and written. The shader reads the stored color and normal data from the previous pass and writes the calculated lighting. At this point, the PLS block will contain the color, normal and accumulated lighting information.
vec3 lighting = calclighting(gbuf.NormalXY.x,
gbuf.LightingRG += lighting.xy;
gbuf.NormalZ_LightingB.y += lighting.z;
The third pass only reads from the PLS block the lighting information and calculates the final pixel color value. It is worth noting that the PLS is automatically discarded once the tile is fully processed, so it has no impact on external memory bandwidth. The only data that goes off-chip is the data explicitly copied to the ‘out’ variables, at which point the PLS data is invalidated.
At GDC 2014, Marius Bjorge and Sam Martin presented a talk on the 'Revolution of Mobile Game Graphics' to the game developer community. They shared a very interesting comparison of bandwidth consumption PLS vs MRT as shown in the following graph. We can see that the PLS approach consumes ~8x less bandwidth than the MRT approach, which is a significant difference. Less bandwidth consumption means less power and longer battery life to enjoy the game. It is worth noting that the impact of bandwidth reduction on power saving is probably even more significant if we consider improved thermals as part of the equation.
Figure 4. PLS vs MRT bandwidth consumption in deferred rendering
Less bandwidth consumption also means higher performance as less time is spent on data traffic. A benchmark study shared in GitHub devoted to on-chip memory management shows that keeping data on-chip can result in a 40% performance improvement.
In the next blog, I will explore more advanced shading techniques that are possible with Pixel Local Storage.
 - Bandwidth Efficient Graphics with Arm Mali GPUs, Marius Bjorge, GPU Pro 5, p. 275.