Mali-G76: Taking High-End Graphics to the Next Level

It’s generally about this time of year that we start to get excited about the next round of premium Arm IP and this year is even better than ever. Mali-G76 is Arm’s latest premium GPU, built on the Bifrost architecture and boasting our highest ever GPU performance. Combined with the Cortex-A76 DynamIQ-based CPU and Mali-V76 premium VPU launched alongside, as well as Mali-D71 DPU released late 2017, this completes Arm’s next generation solution for premium smartphones and laptop-style devices.

Designed to deliver the best possible user experience for all the latest mobile technologies from High Fidelity Mobile Gaming to mixed realities, Mali-G76 delivers a whopping 30% more performance density and 30% more energy efficiency, giving you the best possible user experience without blowing the silicon budget. Of course, no new GPU worth its salt would be hitting the shelves these days without packing some serious Machine Learning punch, and Mali-G76 can deliver up to 2.7x ML improvements when compared to Mali-G72.

This stellar combination of performance and efficiency, for which Arm has long been famous, means you not only get the best possible user experience, but you can also make the most of it with better than ever battery life, making sustained mobile gaming and power-hungry AR/VR more viable than they’ve ever been. 

Mali-G76 premium experiences

Bifrost Microarchitectural Improvements

Now let’s look at the technical innovations behind these impressive gains. Earlier in 2018 we introduced you to Mali-G52, our latest Mainstream GPU, and the premium Mali-G76 is based on the same advanced iteration of the Bifrost graphics architecture. The wider cores enable far greater compute performance throughout the pipeline by providing twice the compute performance in much less than twice the silicon area, a very important distinction for our SiP and OEM partners. Not only does this provide the performance density gains but it also contributes greatly to the energy efficiency numbers by amortizing the overhead of the shared logic across the larger number of execution lanes and reduces the cost of the overall SoC.

Mali-G76 wide execution engine

Another feature you migiht remember from Mali-G52 is the introduction of int8 dot product support. This is the element that has the greatest effect on ML performance, though you may be wondering why this is necessary in premium devices, as they’ll surely all have dedicated ML Processors or accelerators? The answer is, they very well might, but our innovative and imaginative partners utilise premium GPUs across all sorts of amazing devices, and whilst many may opt for a dedicated ML solution, others won’t. Even those that do, however, are entering into what is still an emerging technology, we’ve barely scratched the surface of what ML has to offer and we can’t yet tell how big and hungry the applications of the next couple of years might become. It’s Arm IP’s flexibility that many of our partners value the most, so the ability to make all sorts of personalized trade-offs and design decisions is inherent in supporting the rich broadness of the Arm ecosystem.

Mali-G76 also benefits from a dual texture mapper, providing twice the throughput of the Mali-G72, and therefore a huge leap in efficiency. This means far longer sustained, premium performance for all those power-hungry high-end graphics use cases we mentioned earlier. 

In another bid to improve both performance density and power consumption, we have optimized the registers using half the number of register banks but in a larger size, which improves both area and energy efficiency.

Preloading optimisations

Varying preload at sample locations has traditionally presented something of a problem, in that varying interpolation is normally done at the pixel center, but if sample-frequency shading is enabled, varying interpolation is done at the sample location. This meant that the compiler had to encode the interpolation location within the instruction, not knowing if sample-frequency shading will be used or not and thereby having to output two different shader variants. We’ve addressed this by ensuring that the compiler can use the same sequence for either sample or center variants, which means only one shader variant is required, improving the energy efficiency of the GPU.

When we have to preload the depth buffer, we issue pre-frame shaders that use the texture mapper to fetch depth values and output these to the tile buffer, but fetching these depth values takes more time than is optimal due to the inherent memory latency, which can cause dependency stalls in the GPU.

Complex applications using multi-render targets and no MSAA tend to run out of colour tile buffer space before tile depth buffer, meaning we have spare tile depth buffers. With Mali-G76, we allocate this tile depth buffer space as early as possible and run the depth preload. If we can run these early enough, then depth preload is complete by the time normal fragments are produced therefore avoiding dependency stalls. This improves GPU performance for complex content. 

Mali-G76 performance stats

Cache improvements

Thread Local Storage (TLS) is an area of the stack used for register spilling in shaders. Mali-G76 implements TLS address interleaving, enabling data for a single thread to be grouped together at the same location in the cache, whereas previously data could be in smaller amounts and in several locations. Retrieving data from a single location is more efficient and improves overall compute performance.

Ordered writeback in the tiler can cause stalls when there is a miss in the µTLB (Translate Lookaside Buffer). In Mali-G76 we have implemented Out of order polygon list writeback, allowing the GPU to continue executing while the cache miss is resolved. This helps Mali-G76 scale to larger capabilities than any of our previous generation GPUs.

Since the introduction of the Bifrost architecture with Mali-G71, subsequent iterations and enhancements have enabled significant uplifts in both performance and efficiency, as well as the flexibility to support complex graphics use cases across all tiers of device from the Ultra-efficient Mali-G31 to this brand-new, Premium Mali-G76. As we look ahead to the next generation of flagship devices to be powered by Mali-G76 and the rest of the 2018 Arm Premium Solution, we can’t wait to see what cool tech our partners will come up with next.

Learn more about Mali-G76

Graphics & Multimedia blog