In the last six years we have seen huge improvements in the performance of mobile system-on-a-chip designs, with gains spanning the CPU (x100), the GPU (x300), and the DDR memory system (20x). These improvements make it finally possible to start treating high-end mobile phones as a target for the types of graphics algorithms which would have previously only been possible on games consoles and PC platforms. However, the mobile form factor places obstacles in the path of developers which must be intelligently avoided.
Figure 1: Digital Legends: Afterpulse
The most significant limitation of smartphones from a performance point of view is the form-factor. Passively cooling a chip inside a sealed case is never an easy task and the ability of a device to dissipate heat will determine the how much power it can sustainably draw during game play. The challenge for AAA content on mobile is to get as much useful work as possible out of that thermally stable power budget, which is somewhere around 3 Watts for a typical smartphone. This blog explores some of the recommended techniques for Mali GPUs, using the Afterpulse third-person multiplayer shooter from Digital Legends as a case study.
One of my favorite quotes, from Peter Drucker’s book “The Effective Executive”, is:
Efficiency is doing things right; effectiveness is doing the right things.
While originally aimed at business managers, this quote can be re-tasked for our purposes of achieving the best possible performance in game content. Whenever working on content optimization, it is human nature to focus effort on the algorithms and assets already implemented in the game – efficiency – rather than taking a step back to review if new approaches in the rendering pipeline or more significant reworking of assets could give bigger improvements – effectiveness.
What is “effectiveness” for console quality gaming on mobile? In my opinion it means spending energy on CPU cycles, GPU cycles, and DDR memory accesses which result in a visible output on the screen. Any cycle or byte we spend on something which is not visible is energy wasted and visual quality lost. Here are the top six principles which should be considered when trying to optimize content for mobile devices.
The game application is the top of the stack and the only part which has overall knowledge of the scene structure; by the time draw operations reach the graphics driver that structural knowledge has been lost and cannot be exploited. Applications must aggressively exploit their scene knowledge to cull work which can be proven to be out of camera-shot, using techniques such as scene-graph node bounding boxes, zone and portal visibility information, primitive visibility sets, etc.
The Karisma game engine used by Afterpulse takes proactive steps to cull occluded meshes by inserting simplified occlusion volumes into the rendering pipeline. This allows the engine to discard complex meshes in the next frame if they are occluded by geometry which is closer to the camera. The images below show the effect of occlusion culling (top) compared to the unoptimized scene (bottom) when the player’s view is blocked by a vehicle. We can see that many of the occluded models, such as the ship mast, have been dropped from the optimized render.
Figure 2: Optimized scene with occlusion culling
Figure 3: Unoptimized scene without occlusion culling
Game engine culling schemes can also be used to minimize CPU usage, freeing up more Joules of energy which can be spent rendering instead. For example, culling an entire sub-tree in the scene graph rather than testing every node in it, or evaluating logic and physics updates for off-screen game elements at a lower frequency or precision.
More detail on the techniques an application might consider using are discussed in an earlier blog here: Mali Performance 5: An Application's Performance Responsibilities.
A game engine can remove things which are out of frustum relatively easily, but it is much harder for things which are inside the frustum unless they can be removed by high level checks such as portal visibility which can cull all objects inside rooms which are guaranteed to be invisible even if they are inside the frustum. One of the major potential sources of inefficiency in applications is fragment overdraw, with occluded fragments getting fragment shaded and then subsequently being overwritten by fragments closer to the camera.
Figure 4: Doorway is outside the viewing frustum in a portal culling scheme
All GPUs support early depth and stencil testing, which allows fragments which fail tests to be discarded before fragment shading occurs. To maximize the utility of early-zs testing first render opaque geometry in front-to-back depth order with depth testing enabled using a GL_LESS_THAN comparison. Render transparencies in a back-to-front pass with depth testing enabled, but depth writes disabled. This will ensure the hardware can remove as much overdraw as possible, but still allow correct blending.
Figure 5: The recommended render order to maximize early-ZS rejection
Despite our best efforts graphics drivers are not zero-cost; committing operations into the command stream consumes CPU cycles. To reduce CPU use it is important that applications batch draw operations together to make larger submissions into the graphics API. This may require the use of texture atlases and similar techniques to merge render states together, which will allow for larger batches. This advice holds true even for the new low-overhead APIs such as Vulkan; batching will help minimize CPU usage even there.
More information on batching can be found on this earlier blog from Stacy Smith: Game Set and Batch
GPUs are data-plane processors which consume large vertex and texture payloads, so optimizing these data streams is critical when trying to minimize DDR power. The aim here is to minimize the amount of bytes we need to convey the necessary information to the GPU.
The easiest means to control geometry bandwidth is to reduce the number of triangles being used. This will have an impact on object silhouette quality, but a balance can be struck using reduced triangle count for meshes which are further from the camera and normal maps to improve lighting inside the silhouette edge. In addition per-vertex bandwidth can be minimized by using lower precision inputs, and minimizing padding and unused fields. Afterpulse makes heavy use of compact data formats, such as using GL_INT_10_10_10_2 for normals and GL_BYTE vectors for color values, to ensure it makes the best use of the available bandwidth.
Static texture data should also be compressed using texture compression whenever possible, and should use mipmapping to dynamically match data size to fragment size. Adaptive Scalable Texture Compression (ASTC) LDR profile, which provides a flexible selection of format and bitrate for both color and non-color data, is mandatory in OpenGL ES 3.2 and a widely available optional feature for Vulkan so it’s a great time to re-review your assets and what compression schemes and bitrates you are using.
OpenGL ES and Vulkan provide an abstraction which hides the underlying GPU hardware from the application, but it is always worth understanding what each underlying GPU can give in terms of additional benefits and energy efficiency.
All of the Mali GPUs are tile-based renderers which process fragment shading in small 16x16 pixel regions, using a small SRAM tightly coupled to each shader core to store the framebuffer working set which minimizes the external memory accesses for blending, and ZS testing.
Figure 6: Tile-based renderer data flow
See this blog for more details on tile-based renderers: The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering
From an application point of view the local memory has some algorithm advantages which can be exploited.
The first use is that it can be used to provide low cost 4x multi-sample anti-aliasing (MSAA), with all of the intermediate samples kept inside the GPU and resolved into a single pixel color before writeout. This means that the additional samples never hit memory, saving bandwidth, and do not need additional work such as a call to glBlitFramebuffer() to resolve the multi-sample data into a final pixel color. EGL window surfaces support this MSAA resolve implicitly, but for offscreen framebuffer objects make sure to use the EXT_multisampled_render_to_texture extension to get the free resolve of the multi-sample data into the final pixel color.
The second use is programmatic access to the tile-buffer, allowing techniques such as programmable blending, and single pass deferred lighting. Afterpulse uses the Pixel Local Storage (PLS) extensions to OpenGL ES to implement a single pass deferred lighting scheme. The first sets of draw calls construct the geometry buffer “G-Buffer”, and the subsequent lighting draw calls read the G-Buffer data directly from the tile buffer, compute the lighting value, and accumulate the final lit output color back into the tile-buffer. This approach is similar in performance to reading the “G-Buffer” from a texture, but saves a huge amount of bandwidth; a full 1080p G-Buffer at 4x 32bpp requires ~32MB of bandwidth which much be both written (creation) and read (use for lighting).
Figure 7: The Afterpulse G-Buffer, which can be stored inside Mali tile memory
Similar in-tile rendering techniques can be implemented in Vulkan using the multi-pass support integrated into the API. See the “Vulkan multi-pass case study": https://community.arm.com/graphics/b/blog/posts/vulkan-multipass-at-gdc-2017.
The final stage of optimization is to focus on optimizing shader programs to use as few cycles and as little power as possible. Specialize shaders for each use, avoiding control flow decisions based on uniform values, and removing uniform-on-uniform and uniform-on-constant computations, which both avoid repeating the same computation for every program invocation in a draw call.
Like the data steam optimizations, also aim to use the lowest computational precision which works. The Afterpulse development process defaulted to using “mediump” fp16 operations in all shaders, and only increased precision to “highp” on a case-by-case basis when artefacts became visible.
The trick behind successfully deploying complex rendering pipelines onto mobile devices is simply to make sure you are spending your precious energy budget where it makes a visible difference in the final on-screen output. Make effective use of CPU and GPU cycles and DDR bandwidth by first removing the redundant work as early in the pipeline as possible, and then focus on efficiency, aiming to minimize the flops, bytes, and primitives required to render the parts of the scene which are visible on screen.
This blog is based on my GDC17 presentation with Unai Landa from Digital Legends, and Jon Kirkham from the Arm Mali tools team; the slides from GDC can be found here: GDC Presentation Slides. Check out Performance and Optimization Tutorials to see more performance and optimization developer education materials from the Mali engineering team.
Deferred Rendering on mobile?