The Mali-G71 GPU is the latest and greatest offering in the Mali high-performance family of GPUs. Built on the brand new Bifrost architecture, Mali-G71 represents a whole new level of high-end mobile graphics capabilities whilst still maintaining Mali’s position as a leading GPU in a highly competitive market.
Mali-G71 was developed taking into account the advanced, and ever advancing, use cases for high end mobile like Virtual Reality (VR), Augmented Reality (AR) and 3D gaming; and modern APIs such as Vulkan and OpenCL 2.0. It’s been a few years since the pinnacle of mobile gaming was Snake but the industry has advanced so fast and so far since then that even today’s high-end devices could struggle with the next generation of gaming requirements. Mali-G71 aims to address this potential shortfall by looking ahead to the next level of mobile graphics and ensuring the devices it powers will be more powerful, efficient (and generally more awesome) than ever before. So much so, that devices powered by the Mali-G71 GPU are even capable of competing with mid-range laptops in terms of graphics capability.
The new Mali Bifrost architecture represents a step change in the industry and enables the future of mobile graphics. There are numerous innovations and optimizations built in to the new design but we’ll highlight just a few.
Claused shaders allow you to group sets of instructions together into defined blocks that will run to completion atomically and uninterrupted. This means we can be sure all external dependencies are in place prior to clause execution and we can design execution units to allow temporary results to bypass accesses to the register bank. This reduces the pressure on the register file, drastically decreasing the amount of power it consumes and also contributes to area reduction by simplifying the control logic in the execution units.
Claused shaders provide significant power savings
Another innovation in the Bifrost architecture is Quad based vectorization. Midgard GPUs used SIMD vectorization which executed one thread at a time in the pipeline stage and was very dependent on the shader code executing vector instructions. Quad vectorization allows four threads to be executed together, sharing control logic. This makes it much easier to fill the execution units, achieving close to 100% utilization and better fits recent advances in how developers are writing shader code.
The previous generation of High performance mobile GPUs were scalable from 1 to 16 cores. To reflect the ever growing performance requirements of mobile devices, Mali-G71 is scalable from 1 to 32 cores. The scalability of Mali-G71 means superior graphics performance is available across a wider than ever range of devices from DTVs through high end smartphones right up to cutting edge VR headsets, either mobile-based or standalone. This flexibility, along with the 40% improvement in area efficiency, allows our partners to configure their system to their exact requirements, striking the perfect balance between power, efficiency and cost in order to perfectly position their products in their target market.
Mobile gaming is fast becoming the platform of choice for gamers everywhere. In 2017 the market for mobile gaming is expected to hit over US$40 billion, up $10 billion from 2016.* This rapid growth needs to be sustainable on up and coming mobile devices and with greater complexity appearing year on year, this is no mean feat. Our gaming demos from just a couple of years ago had half the number of vertices as the ones we’re producing today and this all adds up in terms of power and efficiency requirements. If applications continue to advance at this rate the ability to scale to 32 cores could rapidly become a basic necessity for premium mobile devices. On top of this, Mali-G71 delivers 20% higher energy efficiency compared to Mali-T880 under similar conditions – translating to higher sustained device performance in thermally limited premium devices.
API advancements are something we take very seriously, after all, they define how developers interact with the underlying hardware. As a GPU and CPU company we need to meet developer needs so that end users get the best possible device experience. In recent years there’s been a move towards giving developers lower level access to the hardware, in Khronos, this trend lead to the emergence of the new Vulkan 1.0 API. In a similar vein, OpenCL 2.0 was developed to make heterogeneous compute more developer friendly and there are high hopes that we will see some radical new use cases popping up once OpenCL2.0 enabled devices are shipping in the market. Mali-G71 is not only designed to support Vulkan 1.0 and OpenCL 2.0 Full Profile – it even has support for Fine Grained buffers and shared virtual memory, enabled through full hardware coherency support. Again, this is primarily to ease software development effort, leading to better end user experiences.
VR is what everyone’s talking about in the graphics industry at the moment: what it takes, what it needs and how to provide the very best VR experience to the user. The Mali-G71 GPU was built with just this sort of challenge in mind. The extensive performance requirements of VR mean that GPUs for high end devices have to be more energy efficient than ever before. Not only that, but other components of the mobile, like cameras and screen resolutions, are advancing and performing at ever higher rates and therefore all contributing to maxing out the thermal budget of the device. This puts even greater pressure on the GPU to reduce power usage wherever possible.
The Mali family of GPUs also has some great VR optimization features to allow for the best possible mobile VR experience. Front buffer rendering allows you to bypass the usual off screen buffers to render directly to the front buffer, saving time and reducing latency. Mali also supports the ‘multiview’ API extensions that allow the application to submit the draw commands for a frame to the driver once and have the driver instantiate the necessary work for each eye. This greatly reduces the CPU time required in both the application and driver. On Midgard and Bifrost based Mali GPUs we further optimize the vertex processing work, running the parts of the vertex shader that do not depend upon the eye once and sharing the results between each eye. These are just some of the features that make Mali-G71 the obvious choice for the future of mobile VR.
We’re using our phones for more and more, these days many of us don’t even need a home computer or laptop because we can do everything we need on our phone, including downloading and viewing content and streaming it to other devices. The recently released Mali-DP650 display processor already has the capability to handle 4k content and the Mali-G71 allows this content to be streamed seamlessly to your TV without losing any of the quality. This means that, whilst 4k hasn’t yet taken off on mobile, you don’t need to miss out on any of the benefits when viewing the content on a separate 4k device.
Mali-G71 was designed and optimized as part of a complete system, working better together as part of the Mali Multimedia Suite with CCI-550 providing full coherency for CPU and GPU. Mali-G71 is achieving the highest possible performance for mobile graphics within the smallest possible power budget and silicon area, allowing our partners to achieve the pinnacle of mobile graphics in the most scalable and customizable way. With Mali-G71 based devices expected to hit the shelves early in 2017, next level mobile gaming and graphics is right within your grasp.
If you enjoyed this blog, why not read about memory systems and Mali-G71 below?
[CTAToken URL = "https://community.arm.com/processors/b/blog/posts/memory-system-is-key-to-user-experience-with-cortex-a73-and-mali-g71"_blank" text="Memory system is key to user experience with Mali-G71" class ="green"]
Got it! Thanks!
I notice that there isn't the presence of a special-unit for doing complex math. Is the quad being used as a pipeline for operations like sqrt, div, trig, etc? Or does the unit exist, but is absent from diagrams?
All of the arithmetic processing units are wrapped up inside the "Execution Engine" blocks from a diagram point of view.
Very interesting!
And IDPS seems like it makes a lot of sense. It should save tremendous amounts of bandwidth by culling unseen geometry altogether (after initial shading) from further processing, especially given a tile-based workload that requires intermediate write-out to memory. I'm guessing that you'll cover IDPS in much more detail, so I will withhold further questions!
I'm very excited by Bifrost, and very excited to read your article!
Sean
Near-full utilization of the "quad" ALUs will undoubtedly mean that developers will be able to simplify their shaders
That's the general idea, although for the most part this change is really just reflecting the current state of the industry in terms of the shader code we are now seeing in high-end applications. Graphics and compute shaders are increasingly containing more complex control flow, rather than straight-line vector DSP lighting code (for example), and control logic is nearly always scalar which makes efficient use of SIMD maths units challenging/impossible.
It's important to note that there is still some advantage to writing vector code, in particular for memory accesses in OpenCL, as it is more efficient to make big loads from the caches than multiple smaller ones.
Oh, and it's nice to see a dedicated varying unit, which should be much more efficient than doing interpolation on the ALUs, and freeing them up for doing other things.
We always had a dedicated interpolator so that part isn't new, but it's been split out from the general purpose load/store unit which means we can make it much smaller / more efficient / lower result latency.
Am I correct in assuming that the compiler offsets some of this work [for clauses and scheduling]?
Yes, although I cant go in to any detail on the public forums, sorry .
I'm also curious about the Index-Driving Position Shading, as there is mention that it reduces (tiling?) bandwidth, but I'm still not sure how this works or what indeed it means.
Mali has historically processed vertices first, and then worked out if they were front-facing or back-facing in the tiler as a second pass. The Index-Driven Vertex Shading flow builds triangles first, computes the on-screen position, and then only computes and writes the remaining varyings if the triangle is actually visible (inside the frustum, not killed by facing tests). In summary: ideally we shouldn't generate any intermediate varying bandwidth for triangles which are not visible, although a lot depends on the spatial locality of primitives in the models (insane geometry may not benefit as much, but I doubt this is a Mali-specific problem).
I'll be doing a performance blog on IDVS and what application developers need to do to make best use of it shortly.
Cheers, P
This is incredibly exciting! Near-full utilization of the "quad" ALUs will undoubtedly mean that developers will be able to simplify their shaders, and ARM GPUs should perform extremely well! While the implementation will most likely be extremely different, it seems very close to the functionality of the SIMD, where each lane is split among different threads rather than a single thread's operations. I expect that this design will be quite area-efficient and power-efficient compared to designs that have separate dedicated ALUs for 16 and 32-bit math but with the added benefit of mixing and matching precision or handling highp floats for the rare cases where they would be useful. I'm really excited to see the ALU perf of the Mali G71 -- the GeForce 940M performance claims are nothing short of extraordinary!
And the "clauses" are brilliant! The idea that the register file can be circumvented for a set of calculations is quite interesting, and having guaranteed blocks of execution should reduce power dramatically. I do wonder how beefy the quad-manager would have to be, though -- it seems like it has quite the task of determining quad workloads. Am I correct in assuming that the compiler offsets some of this work?
I'm also curious about the Index-Driven Position Shading. There is mention that it reduces (tiling?) bandwidth, but I'm still not sure how this works or what indeed it means.
I'm eagerly awaiting the day where the "big" CPU cores, "LITTLE" CPU cores, DSPs, etc start looking like "special" GPU cores all connected by the same fabric, and all jobs being dispatched by a single job-manager. I (obviously) have zero IC design experience, but my high-level intuition tells me that this could dramatically reduce chip wiring and open up interesting possibilities (eg. eliminating the need for job migration across cores -- drain existing cores and schedule new operations appropriately. Another interesting benefit would be multi-threaded CPU processing, or using GPU ALUs to fully replace the NEON SIMD -- it all becomes a scheduling problem)!