Hi, I am Hans-Kristian Arntzen! This is my first post here. I work in the Mali use cases team where we explore the latest mobile APIs to find efficient ways of implementing modern graphics techniques on the ARM Mali architecture.
Sometimes, we create small tech demos which result in Mali SDK samples, smaller code examples which you can take inspiration from when developing your own applications.
Since August, I've been writing quite a lot of code for OpenGL ES 3.1 and I will summarize what we have done with OpenGL ES 3.1 the last months.
OpenGL ES 3.1 is an update to OpenGL ES 3.0 which recognizes the fact that OpenGL ES 3.0 capable hardware is already capable of much more, for example compute. OpenGL ES 3.1 now brings GPU compute support directly to OpenGL ES, so there is no longer any need to interface with external APIs to expose the compute capabilities of the hardware. The interface for compute is very clean, powerful and easy to use.
Compute support in graphics APIs means there are many more opportunities now for applications to offload parallel work to the GPU than before and being able to do this on mobile hardware is very exciting.
See Here comes OpenGL ES 3.1! for more details.
We released the r5p0 driver in December with support for OpenGL ES 3.1. The driver for Linux and Android platforms can be found on Mali Drivers.
The latest Android OpenGL ES SDK has new sample code for compute shaders.
Mali OpenGL ES SDK for Android
The samples can be built for Linux development platforms with fbdev.
There is also OpenGL ES emulator support included (OpenGL ES Emulator) so you can run the Linux fbdev samples on your desktop on Linux and Windows.
If your desktop implementation supports X11/EGL in Linux, you should be able to run the samples without emulator by leveraging the GL_ARB_ES3_1_compatibility extension which went into core in OpenGL 4.5.
Introduction to compute shaders
Compute is a new subject for many graphics programmers. This document tries to explain the different mind set you need to effectively use GPU compute and the new APIs found in OpenGL ES 3.1.
It goes through the major features of compute, and in-depth into some more difficult subjects like synchronization, memory ordering and execution barriers.
It is recommended that you read this before studying the examples below unless you're already familiar with compute shaders.
Particle Flow Simulation with Compute Shaders
This sample implements a modern particle system. It uses compute shaders to sort particles back-to-front which is critical to obtain correct alpha blending.
Since we can sort now on the GPU, we can offload the entire particle system to the GPU.
It also implements cool things like 4-layer opacity shadow map for some sweet volumetric shadow effects and simplex noise to add turbulence to the particles.
Combining all these techniques together allow you to create a very nice particle system.
Occlusion Culling with Hierarchical-Z
Culling is important in complex scenes to keep vertex work down as mentioned in this blog post: Mali Performance 5: An Application's Performance Responsibilities
For game objects, there are many sophisticated CPU-based solutions which often rely on baking data structures based on how the scene is put together.
For example, in indoor scenes with separate rooms, it makes sense to only consider rendering the room you're in and objects from rooms with are visible from the room you're standing in. Doing this computation on-the-fly could get expensive, but once the information is baked, it is fairly simple.
However, when we add a large amount of "chaotic" elements to a more dynamic scene, it becomes more difficult to bake anything and we need to compute this on the fly. We have to look for some more general solutions for these scenarios.
The sample shows how you can use a low-resolution depth map and bounding spheres to efficiently cull entire instances in parallel before they are even rendered. It can also be combined with level-of-detail sorting to reduce geometry load even further.
Finally, the result is drawn with indirect draws, a new feature of OpenGL ES 3.1.
Using these kinds of techniques allow you to offload big "particle-like" systems to the GPU efficiently.
There are certain things you should think about when developing for Mali. During our work with OpenGL ES 3.1, we have found some general performance tips you should take into account.
Compute exposes more low level details about the architecture, and to get optimal performance for a particular architecture, you might need some specific optimizations.
If you are experienced with compute on desktop, you might find that many general truths about performance on desktop don't necessarily apply to mobile! Sometimes, performance tips are opposite of what you'd want to do on desktop.
If you have used OpenCL on Mali before, best practices for OpenCL also apply for compute shaders.
I presented at the GDC2015 along with Tom Olson (Chair of Khronos OpenGL ES and Vulkan working groups, Director of Graphics Research at ARM) and Dan Galpin (Developer Advocate, Google).
The presentation went through OpenGL ES 3.1 (focus on compute), some of the techniques I mentioned in this post, best practices for OpenGL ES 3.1 and AEP on Mali and a small sneak peak on early Vulkan experiments on Mali.
Unfortunately, there are some performance bugs with some features in r5p0 release. You might stumble into them when developing for OpenGL ES 3.1.
If you run into these issues, they have been addressed and should be fixed in future driver releases.
I am very excited about compute shaders and culling, so much that I wanted to create a demo for it at GDC. We do have the Occlusion Culling sample code in the SDK, but it is far too bare-bones to show at an event.
I attended GDC 2015, where I manned our tech booth most of the time and I got to show this demo to many people passing by.
Instead of dull green spheres I went for some procedurally generated asteroids. All the asteroids look slightly different even if they are instanced due to the use of a 4-component RGBA8 heightmap. All asteroids have different random weighting factors which make them look a bit different. They have independent radii, rotation axes and rotation speeds as well which makes the scene look fairly complex. Diffuse textures and normal textures are shared for all asteroids. They are also generated procedurally with perlin noise and compressed with ASTC LDR.
There are over 27000 asteroids in the scene here spread out across a big sphere around the camera.
At highest quality, each individual asteroid has over 2500 triangles. If we were to just naively draw this without any kind of optimization, we would get a triangle count in the ballpark of 50+ million which is extremely overkill.
We need some culling. The first and obvious optimization is frustum culling, which can remove most of the asteroids outright. We can do this on the GPU very efficiently and parallel since it's just a couple of dot products per instance after all.
All the asteroids in the scene are represented as a flat linear array of per-instance data such as position, base radius, rotation axis, rotation speed and heightmap weighting factors. We combine frustum culling with the physics update (rotating the asteroids and creating a final rotation quaternion per asteroid). Since we need to update every asteroids anyways, might as well do frustum culling while they are in cache!
Now we're looking at ~2000 asteroids being rendered, but just frustum culling is not enough! We also need LOD sorting to get the vertex count low enough.
The idea behind LOD sorting is that objects far away don't need high detail. We can add this technique to plain frustum culling and reduce the vertex count a lot. After these optimizations, we're looking at 500-600k triangles per frame, a 100x reduction from before. We can also use cheaper vertex shaders for objects far away, which reduces the vertex load even more. We can also do this efficiently in compute shaders, it's just a question of pushing per-instance data to one of many instance buffers if it passes the frustum test.
Here we see close objects in white and it gets darker as the LOD factor increases.
We can also use different shading for close objects. Here, close asteroids have full bling with normal mapping and specular highlights from the skydome, objects farther away are only diffuse with spherical harmonics for diffuse lighting.
This kept fragment shading load down quite a bit. Screenshot shows the debugged normals. The normals without normal mapping look a bit funky, but that's because the normals are computed directly in the vertex shader by sampling the heightmap multiple times. With shading applied it looks fine however
But we can do even better. You might have noticed the transparent "glasslike" wall in front of the asteroids? It is supposed to be opaque. We wanted this to be a space station interior or something cool, but unfortunately we didn't have time for GDC
The main point here is that there is a lot of stuff going on behind the occluder in the scene. There is no reason why we should waste precious bandwidth and cycles on vertex shading asteroids which are never seen.
Enter Occlusion Culling!
We can go from this:
After this optimization we cull over half the asteroids in the scene on average, and we are looking at a very manageable 200-300k triangles.
My hope for the future is that we'll be able to easily do all kinds of scene management directly on the GPU. It's not feasible to do everything on the GPU quite yet, the CPU is still very capable of doing things like these, but we can definitely accelerate massively instanced cases like these.
The skydome is procedurally generated with FBM noise. It is HDR and is used for all the lighting in the scene. I compressed it with ASTC HDR instead of RGB9E5, a 32-bit shared exponent format which is pretty much the only reasonable alternative if I didn't have ASTC HDR.
I squeezed out 60 fps at 1080p/4xMSAA on a Samsung Galaxy Note 10.1 (Mali-T628 MP6) and Samsung Galaxy Note 4 (Korean version, Mali-T760 MP6) when all culling was applied, which I'm quite happy with .
I used DS-5 (An In-Depth Look at Streamline) to find bottlenecks when tuning along with (Mali Offline Compiler) to fine-tune the shaders (mediump varyings can make a lot of difference!).