The line between game quality is blurring, with the latest smartphones featuring graphics content that was unthinkable a few years ago.
If consoles are what mobile gaming is trying to compete with, developers must constantly ask the question; what is missing from mobile graphics to help it compete? The most straightforward answer from the graphics and gaming team at Arm is understanding the difference in terms of GPU budget. The greatest challenge for mobile GPUs is the device form factor, limiting both area and power consumption. It’s surprising what these tiny GPUs can achieve, but that said, there is still a noticeable difference with console GPUs in terms of raw computational power and memory bandwidth.
This doesn’t necessarily mean that mobile content is necessarily worse than that of consoles, however it poses a significant challenge for the daring developer, who needs to delve deep into optimizing their game. Removing anything unnecessary is the key here. Most console games can get away with unnecessary draw calls, rendering stuff that is occluded and so on, while mobile is in ma position where absolutely everything matters.
Apart from the number of polygons that can be rendered, there is another area where console games typically shine with respect to mobile: polished visuals achieved through post-processing effects. It is important to point out that post-processing is not "banned" on mobile altogether, we should understand the reasons why post-processing effects are not ideal for mobile GPUs and how can we work around these restrictions.
Mobile GPUs, including Arm Mali, typically use a tile-based approach for rendering, meaning that vertex and fragments shaders are run for a tile of the screen at a time. This is a great workaround for the lower memory bandwidth in mobile, as we can store all the data we need for the current tile in an internal memory and thus we reduce accesses to the main memory.
Looking at the effects of this approach, you may know that forward rendering is good for mobile, while deferred is not. This happens because traditional deferred rendering requires us to do a first pass for all the tiles, save the results to memory and then load them back for the next pass. A modified deferred rendering, however, works just as well as forward rendering on mobile. It employs PLS (Pixel Local Storage), which lets us run the two passes for deferred rendering on a single tile without storing/loading data from the DRAM.
You can probably see where this is going; mobile GPUs don’t really like running several render passes where you need to access previously stored data. That’s the very idea of post-processing, rendering a scene and then applying an effect to it.
Can we apply the same trick for deferred rendering? Possibly, on a case-by-case basis; but let’s take blur as a counter-example (which will become relevant in a moment). To blur a pixel, you need access to the adjacent pixels, and this needs to run cross-tile as well, so we can’t really use PLS there.
At any rate, this explains why standard post-processing pipelines for consoles aren’t sustainable for mobile. In anything related to graphics, it’s a matter of cleverly optimizing the post-processing effect you want to better suit the hardware or (cleverly) achieving a similar effect without using post-processing.
After this little bit of theory, let us see what happened when we faced a real-world problem involving post-processing.
Nordeus, a Serbian-based game studio which you may already know for the incredibly popular game; Top Eleven, is developing a new 1-vs-1 PvP battle game called Spellsouls: Duel of Legends. It boasts stunning visuals and is supported by a wide range of devices.
Here’s a screenshot from the game, featuring very detailed characters and PBR shaders:
Figure 1: Scene screenshot
They were happy with the way most materials looked, but something was a bit off. The gold details on the terrain don’t really feel like gold and the character’s armors should be shinier. What is missing here is a post-processing effect, known as bloom which gives all brightly lit regions of a scene a glow-like effect.
Let us have a look at the same scene with some bloom in it. This is a mock-up image where the effect is greatly exaggerated to clearly distinguish it.
Figure 2: Scene with exaggerated bloom effects
The most common way to achieve bloom is to isolate the brightest parts of an image, blur them and apply the result back to the initial image.
The standard Unity post-processing bloom was not suited for Spellsouls for two main reasons:
Figure 3: Bloom with Unity’s post-processing pipeline
They tackled both problems by optimizing the way in which they select objects and the bloom pipeline itself.
To apply bloom to certain objects only, they rendered the areas of such objects with specular greater than 1 to a separate texture, using MRT (Multiple Render Target). This way they obtained a separate texture which they can then use as input for their custom bloom pipeline.
Figure 4: MRT bloom texture
This approach also allows them to achieve an HDR-like effect without using an HDR framebuffer. In the MRT step they used an RGB24 framebuffer, where they applied a scaling factor to represent values greater than 1 with fixed point.
This was still costly in terms of memory bandwidth, since it’s a full extra RGB24 texture that needs to be stored to/loaded from memory. For this reason, they moved to just storing the scaled luminance of the pixel in an R8 framebuffer, which significantly lowers bandwidth.
Figure 5: MRT bloom texture with a R8 framebuffer
The bloom pipeline itself consists of a downscale followed by a horizontal and vertical blur before the final compose. Applying a horizontal and vertical blur corresponds to a Gaussian blur but reduces the number of samples needed for each pixel.
Our collaboration with Nordeus started at this point. Their version of bloom contributed for 3 ms to total frame rendering time. This is a very good result considering it is post-processing on mobile, but it is still limiting since it takes up a good chunk of their budget if they want to render the game at 60 FPS. They asked us to help them reduce this number further.
Now, their bloom was very well optimized to begin with. Since our team works from a developer’s perspective, there’s no magic we can do other than helping mobile developers use what works best for the platform. Looking at improvements upon what they already had, we focused on the blur step, as it’s the most expensive part of the pipeline. The Gaussian approach is simple but non-optimal, and there are better techniques for achieving a nice blur effect while reducing the number of samples required.
The technique we picked is Dual Filtering, which you can see in detail in this presentation by Marius Bjorge. It features optimized downscaling/upscaling filters, which achieve a stronger effect, like a larger Gaussian radius, at a much lower cost (14 times performance improvement @ 1080p).
When we applied this technique to Spellsouls however, we didn’t get such a performance improvement: Nordeus already optimized their Gaussian blur by downscaling the image first, so the performance difference with Dual Filtering was quite limited. Our approach still achieves a smoother bloom at the same cost, as you can see in the comparison below.
Figure 6a: Bloom with Gaussian blur
Figure 6b: Bloom with Dual Filtering blur
And that’s it, for post-processing. There’s no way to get it completely for free, so if we want to really save these 3 ms we must look for some wilder things.
What if we ditch post-processing altogether? Can we still get bloom without post-processing? If we’re willing to accept some trade-offs we can get away with good results at a minimal cost.
Since the terrain is mostly static, why don’t we try baking our bloom? Spellsouls’ PBR shaders already feature a glossiness map, which you can see below.
Figure 7: Spellsouls’ glossiness map
We can use the luminance data from the glossiness map to obtain a “bloom map”, like the R8 texture obtained at runtime with the MRT approach. This texture can be stored in the alpha channel of the glossiness map itself, thus saving an extra texture access.
Figure 8: Bloom map
In the vertex shader we compute an alignment factor between the reflected light vector and the camera vector:
// Vertex shader float lightObjCameraAlignment = dot(objToCam, reftLightDir); half alignmentFactor = clamp(lightObjCameraAlignment, 0.0, 1.0);
// Vertex shader
float lightObjCameraAlignment = dot(objToCam, reftLightDir);
half alignmentFactor = clamp(lightObjCameraAlignment, 0.0, 1.0);
Then in the fragment shader we make a pixel brighter based on the data we sample from the bloom map, the alignment factor we previously computed and a BloomStrength factor we can tune:
// Fragment shader half bloom = rawGlossMap.a; finalColor += finalColor * bloom * i.alignmentFactor * _BloomStrength;
// Fragment shader
half bloom = rawGlossMap.a;
finalColor += finalColor * bloom * i.alignmentFactor * _BloomStrength;
Figure 9: Texture-based bloom
Figure 10: Performance comparison with different bloom approaches
All in all, a slightly less compelling effect at a greatly reduced cost – less than 1 ms. The artists were happy with this trade-off, so that’s a wrap, and happy days!
Or is it?
Well, this texture-based approach is good for the terrain, but what about the characters’ armors? We can try to apply a bloom texture to a character, but the result doesn’t look like bloom at all. It lacks light bleeding, since we can only work with pixels within the character itself.
In this case we can go for a somewhat harsher approximation: we render a plane between the camera and the character, always facing towards the camera.
Figure 11: Plane-based bloom
The effect is not perfect, but it feels like bloom thanks to light bleeding. A great trade-off once again – using both texture-based bloom for the terrain and plane-based bloom for the characters, they still account for less than 1 ms.
Figure 12: Texture-based bloom (left) and plane-based bloom (right) for characters
The takeaway here is that we can still achieve an approximation of the effect we want without post-processing. It’s not a one-size-fits-all solution, but it is the only approach that makes such effects truly cheap.
Is there anything else we can do here? We should not focus only on bloom, but rather look at the big picture; we can either try to reduce the cost for bloom until it’s almost irrelevant, as we did before, or accept the cost of post-processing and try to optimize the game around it.
What matters in the end is total rendering time. If a post-processing effect takes 3 ms, we need to save at least 3 ms from the rest of the game. Rendering time for the rest of the game will be around 16 ms (for 60 FPS) or 33 ms (for 30 FPS), so it might be easier to save a few milliseconds there rather than try to further reduce the cost of an already optimized effect.
The hardest part in this case is narrowing down our options, as optimizing the rest of the game is open to lots of possibilities. You can find these tricks and tips in the optimizing for Mali developer guides on our developer platform.
Using DS-5 Streamline we figured out that the game is GPU-bound, specifically fragment-bound. This still doesn’t tell us where to look exactly. Let’s have a look at a frame taken from the game and try to figure it out. Can you figure it out which is the heaviest part, in terms of fragment shading?
Figure 13: Can you guess the heaviest fragment shading part?
It may be surprising, but the heaviest part is the terrain! This is because the terrain covers most of the screen, so on aggregate it takes the longest to render. We found out this result by building the game with one part enabled at a time and measuring frame time: we could not have figured it out just by looking at the shader.
Now that we know we need to focus on the terrain, let’s break down the heaviest contributions within its fragment shader:
Since the terrain is a static object we could bake world-space normal maps, saving some computation at runtime.
We didn’t want to touch reflections, since one of the levels has crisp reflections and great visual quality is one of the game’s pillars. This left us with lighting as a good area for optimizations.
First thing we did was to render lights to a lower resolution lightmap.
We defined an “AdditiveLights” render texture with 512x512 resolution, then we rendered the lightmap in Unity in OnPreCull() using a replacement shader. This pass is performed after the shadows are rendered, but the relative ordering is not important. We used a “Terrain_AdditiveLight” Render Type to mark the game objects rendered through the lightmap.
The main PBR shader was modified by defining a new texture set to the “AdditiveLights” render texture. This texture is sampled at runtime to get the light information for each fragment. Using a lower resolution lightmap saved 4.3 ms, plus it lets us render more dynamic light, as the cost per light is significantly lower with respect to full resolution lighting.
We can do even more, though. What if we rendered the whole terrain at a lower resolution?
It’s not like downscaling the whole game: characters and spells, which are the main visual focus of the game, will still be rendered at full resolution. After rendering the additive lights texture in OnPreCull(), we rendered the terrain into a 720p render texture. We use a layer to filter the objects to be rendered there.
In OnPreRender(), just before the rendering of the main camera, we executed a blit of the low-res texture to the high-res final framebuffer. We only copied the color attachment: we didn’t need depth since the spells and units will always be in front of the terrain.
Ideally reducing the terrain resolution from 1080p to 720p would mean that we’d render ~55% less pixels. The result we got was reasonably close: terrain rendering time went down from 9.3 ms to ~5 ms.
These optimizations helped us save around 10 ms, which could easily cover the cost for Nordeus’ post-processing bloom. However, by using our texture-based/plane-based bloom approach along with these optimizations we could make the game run at 60 FPS on high end devices and increase the number of devices that can run it at 30 FPS.
That’s it! We learned quite a bit in our collaboration with Nordeus and I hope you will have learned something from this blog and use most of it for your own games. You probably already knew that post-processing effects are not ideal for mobile and the aim of this blog post was to go through the reasons for that with some possible workarounds.
It’s still possible to use post-processing in mobile by customizing their pipelines with a good knowledge of the target platform. However, depending on the effect, you might get away with pretty good results without post-processing altogether. If you’re using bloom specifically, feel free to try our approaches and Arm’s free debugging and profiling tools, they may help you spice up the visuals of your game.
Finally, never forget the big picture: any effect you’re going to add to your game is going to contribute to the aggregate frame time. This means that you can always free some budget by optimizing the rest of the game. You should try to truly leverage those mobile best practices: even if your game is already running at your target framerate, you can still increase your budget for some cool visual effects.
Awesome article! I'm hitting performance issues implementing bloom for my game, and I found this. However I have a question further, is there a way to generate the "plane-based bloom" texture automatically? And if the direction of the camera changes, how to ensure that the same part as intended is having a blooming effect?
Glad you liked it, thanks! :)
awesome article, learning a lot! Saved for future reference.