Previous blog in the series: Mali Performance 3: Is EGL_BUFFER_PRESERVED a good thing?
In my previous blogs in this series I have looked at the bare essentials for using a tile-based rendering architecture as efficiently as possible, in particular showing how to structure an application's use of framebuffers most efficiently to minimize unnecessary memory accesses. With those basics out of the way I can now start looking in more detail about how to drive the OpenGL ES API most efficiently to get the best out of a platform using Mali, but before we do that, I would like to introduce my five principles of performance optimization.
When starting an optimization activity have a clear set of goals for where you want to end up. There are many possible objectives to optimization: faster performance, lower power consumption, lower memory bandwidth, or lower CPU overhead to name the most common ones. The kinds of problems you look for when reviewing an application will vary depending on what type of improvement you are trying to make, so getting this right at the start is of critical importance.
It is also very easy to spend increasingly large amounts of time for smaller and smaller gains, and many optimizations will increase complexity of your application and make longer term maintenance problematic. Review your progress regularly during the work to determine when to say "we've done enough", and stop when you reach this point.
I am often asked by developers working with Mali how they can make a specific piece of content run faster. This type of question is then often quickly followed up by more detailed questions on how to squeeze a little more performance out of a specific piece of shader code, or how to tune a specific geometry mesh to best fit the Mali architecture. These are all valid questions, but in my opinion often unduly narrow the scope of the optimization activity far too early in the process, and leave many of the most promising avenues of attack unexplored.
Both of the questions above try to optimize a fixed workload, and both make the implicit assumption that the workload is necessary at all. In reality graphics scenes often contain a huge amount of redundancy - objects which are off screen, objects which are overdrawn by other objects, objects where half the triangles are facing away from the user, etc - which contribute nothing to the final render. Optimization activities therefore need to attempt to answer two fundamental questions:
In short - don't just try to make something faster, try to avoid doing it at all whenever possible! Some of this "work avoidance" must be handled entirely in the application, but in many cases OpenGL ES and Mali provides tools which can help provided you use them correctly. More on this in a future blog.
If you are optimizing a traditional algorithm on a CPU there is normally a right answer, and failure to produce that answer will result in a system which does not work. For graphics workloads we are simply trying to create a nice looking picture as fast as possible; if an optimized version is not bit-exact it is unlikely anyone will actually notice, so don't be afraid to play with the algorithms a little if it helps streamline performance.
Optimization activities for graphics should look at the algorithms used, and if their expense does not justify the visual benefits they bring then do not be afraid to remove them and replace them with something totally different. Real-time rendering is an art form, and optimization and performance is part of that art. In many cases smooth framerate and fast performance is more important than a little more detail packed into a single frame.
GPUs are data-plane processors, and graphics rendering performance is often dominated by data-plane problems. Many developers spend a long time looking at OpenGL ES API function call sequences to determine problems, without really looking at the data they are passing into those functions. This is nearly always a serious oversight.
OpenGL ES API call sequences are of course important, and many logic issues can be spotted by looking at these during optimization work, but remember that the format, size, and packing of data assets is of critical importance and must not be forgotten when looking for opportunities to make things faster.
The impact of a single draw call on scene rendering performance is often impossible to tell from the API level, and in many cases seemingly innocuous draw calls often have some of the largest performance overheads. I have seen many performance teams sink days or even weeks of time into optimization something, only to belatedly realise that the shader they have been tuning only contributes 1% of the overall cost of the scene, so while they have done a fantastic job and made it 2x faster that only improves overall performance by 0.5%.
I always recommend measuring early and often, using tools such as DS-5 Streamline to get an accurate view of the GPU hardware performance via the integrated hardware performance counters, and Mali Graphics Debugger to work out which draw calls are contributing to that rendering workload. Use the performance counters not only to identify hot spots to optimize, but also to sanity check what your application is doing against what you expect it to be doing. For example, manually estimate the number of pixels, texels, or memory accesses, per frame and compare this estimate against the counters from the hardware. If you see twice as many pixels as expected being rendered then there are possibly some structural issues to investigate first which could give much larger wins than simple shader tuning.
The best optimizations in graphics are best tackled when made a structural part of how an application presents data to the OpenGL ES API, so in my next blog I will be looking at some of the things an application might want to consider when trying very hard to not do any work at all.
TTFN, Pete
Next blog in the series now available: Mali Performance 5: An Application's Performance Responsibilities
Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.