The new Arm Mali-G710 GPU, and its smaller siblings, include several hardware changes to improve performance and rendering energy efficiency. Some of these changes alter our best practice recommendations that help developers get the best performance out of the hardware. This blog gives an overview of the new GPUs, the changes they contain, and a summary of the corresponding updates to the best practice guidelines.
This GPU generation has a similar block architecture to earlier Mali GPU.
The GPU front-end takes work submissions from the driver and dispatches them to the relevant GPU processing units. The fixed-function tiling unit coordinates the vertex processing pipeline and handles the primitive binning that drives Mali's tile-based rendering scheme. There are one or more unified shader cores, which handle all types of shader processing, and one or more slices of level two cache, which buffer data fetched from external memory. If you have seen a Mali GPU before then this should all look familiar.
However, there are two major changes introduced in this generation of hardware hidden in this diagram …
The most visible change is a move to a larger and faster shader core, with twice the throughput of the previous Mali-G78 GPUs. They have the capability to execute 64 32-bit FMAs per clock, make 8 bilinear texture samples per clock, blend 4 fragments per clock, and write 4 pixels per clock. To enable the shader core to sustain this higher performance the shader core can now hold 2048 threads, and the Mali tile size has increased to 32x32 pixels1. This is the first increase in Mali tile size since the first Mali GPU, Mali-55, which was released all the way back in 2005!
The move to a larger shader core means that designs will need to use fewer cores to hit a specific performance point, so you can expect to see designs with lower "MP" shader core counts than earlier generations. However, the actual achievable performance will improve as the larger shader core chassis is more capable and brings improvements to both energy efficiency and silicon area efficiency. This is good news for developers, as it means you can get more performance out of the system while staying within the thermal constraints of a mobile device.
The second, and more significant change, is a completely new GPU front-end, which provides the hardware interface that the driver uses to issue work to the GPU. The Command Stream Frontend (CSF) replaces the Job Manager found on earlier products, giving Mali support for native command stream handling.
The driver for a CSF GPU writes a command stream — consisting of state changes, rendering requests, and stream synchronization operations — into a shared memory buffer that is visible to both the application and the GPU hardware. The CSF consumes the command streams written by the user-space driver, executing the commands they contain to build the complete render state, submit workloads to the GPU hardware queues, and synchronize across streams to ensure correct order of execution.
The move to the CSF gives multiple developer-visible improvements.
The biggest improvement that the CSF gives is a significant reduction in the CPU usage of the Mali driver. Nearly all operations on the high-frequency per-draw path can be handled as command stream operations, and the driver now only needs to transmit the state changes for each draw, rather than re-emitting the whole state. In addition, the CSF can handle most stream scheduling and synchronization itself, significantly reducing the workload of the Mali kernel driver too. Reducing CPU load improves battery life, but can also free up thermal budget which developers can choose to spend on more complex rendering instead.
Moving to a command stream also gives Mali native support for parts of Vulkan that really assume the hardware is executing command streams.
For the first time, this generation of hardware can directly invoke secondary command buffers, and handle the state inheritance associated with them. Previous best practice recommendations to minimize use of secondary command buffers no longer apply.
Mali also gains native support for indirect draw calls and compute dispatches for both OpenGL ES and Vulkan. For Vulkan, Mali now supports both vkCmdDrawIndirect() with a drawCount higher than 1, and vkCmdDrawIndirectCount() where the draw count parameter is sourced indirectly at runtime from a buffer. Previous best practice recommendations to minimize use of indirect draws no longer apply.
The final change to the front-end is that the hardware interface that the CSF uses to submit workloads to the shader cores has been widened to include a dedicated compute workload hardware queue, making three queues in total:
This decouples submissions for vertex and compute workloads, reducing the number of instances of false dependencies, and allows developers using our Streamline profiler to distinguish the two types of workload. Note that the hardware queue capabilities are not the same as the Vulkan queue capabilities exposed at the API level, and the API queue capabilities are unchanged.
Mali-G71 and the Bifrost architecture introduced Mali's optimized index-driven vertex shading (IDVS) pipeline. Using IDVS splits the vertex shader into position and non-position parts, and only runs the non-position part if vertices are visible after culling. IDVS has been historically usable for the most common types of rendering operation, but there have been a few cases where we needed to fall back to a traditional monolithic vertex shader.
Mali-G710 improves the feature coverage of IDVS to support vertex shaders that emit multiple positions, as found in Vulkan layered rendering and OpenGL ES multi-view rendering. These features are commonly used in AR and VR use cases, which need to render a per-eye view with subtly different object positions in each. To get the best performance out of IDVS we recommend that vertex position attributes are tightly packed in a separate buffer region to the non-position attributes. This allows the position calculation to run without polluting the cache with non-position data. This significantly reduces vertex bandwidth because, typically, even in well written content half of the primitives are culled by the facing test.
The shader core includes many hardware improvements to improve efficiency that are not developer visible, but we want to highlight two improvements that impact specific application behaviors.
The first is improved fragment thread scheduling in scenarios where there are tile access dependencies between layers of pixel. When shaders access tile memory, either through blending or programmable behavior (framebuffer fetch, pixel local storage (PLS), or subpasses), the hardware has to ensure that the shader sees the correct data. Conceptually, later fragments that read from an attachment must wait until there are no earlier writers at that coordinate, and later fragments that write to an attachment must wait until there are no earlier readers or writers at that coordinate. If a fragment attempts to access an attachment while an older fragment still has accesses pending, then the hardware will need to stall it and wait for that older fragment to complete.
In older Mali hardware we used a single dependency tracker per pixel location for color/PLS data. A fragment would stall on first tile access until it became the oldest fragment at that pixel location. For complex shaders with multiple resources stored in the tile memory — such as multiple-render targets, OpenGL ES pixel local storage structures, or merged Vulkan subpass attachments — a single tracker is conservative and can result in higher levels of serialization between layers. If too many fragments get blocked on dependencies, the shader core can run out of warps to execute and shader execution performance starts to degrade.
When developers have used PLS or multiple subpasses on Mali, they have often found that bandwidth significantly improved due to in-tile data exchange, and overall energy efficiency improved due to fewer DRAM accesses, but the clock-for-clock performance was often worse due to the conservative tile access dependencies between layers.
The diagram below shows how fragment shaders for later layers must stall on tile access (orange) until earlier layers have made their final tile access:
This generation of hardware can track dependencies on a finer granularity, allowing this style of shader to make pipelined tile access across layers. This results in fewer stalls, and overall better performance for data-exchange via tile memory. The diagram below shows how the layer scheduling inside the shader core might look in Mali-G710. The good news is that this is completely automatic and there is nothing you need to change in your application, so it is a good time to try in-tile shading techniques!
Using multiple sub-passes in Vulkan is not always an easy drop-in for an existing rendering pipeline, so to make in-tile shading more accessible we have also released a new Vulkan extension to provide SPIR-V fragment shaders with framebuffer-fetch-like programmatic access to attachment data VK_ARM_rasterization_order_attachment_access. This new extension should start rolling out in devices later in 2022.
Mali-G77, first shipped in 2020, provided our initial support for Vulkan bindless resource access, via the VK_EXT_descriptor_indexing extension. This is a technique which allows shaders to select resources at runtime based on a dynamic index provided by the shader code, instead of using constant bindings selected at draw-time. Using bindless techniques removes material resources from the static state, making it easier to reduce draw call counts by batching, which remains a useful technique to reduce CPU load.
The latest Mali drivers have implemented a number of improvements to the implementation of this extension, reducing the shader bindless lookup overhead in the common case where the index values are dynamically uniform across a warp. The changes here are mostly compiler code generation improvements, so older Mali-G77 and Mali-G78 devices may also see improvements if driver updates are available for them.
General best practice recommendations for bindless descriptor access remain otherwise unchanged. Here is a reminder:
This generation of hardware introduces AFBC lossless compression for fp16 render targets, allowing HDR rendering pipelines using floating point rendering to benefit from reduced framebuffer bandwidth.
It should be noted that compressed fp16 attachments will still be larger than compressed framebuffers using a narrower color format, so best performance and lowest bandwidth is still achieved using a 32-bpp format such as RG11B10f.
Mali-G710 introduces some major changes that make the GPU more efficient, and which give application developers more flexible options for rendering with it. We are looking forward to seeing what you can do now the GPU is on silicon in premium smartphones.
If you would like to learn more, come and see the Oppo Find X5 Pro, powered by the MediaTek Dimensity 9000 and Mali-G710, in action at GDC. You can find us on the Arm booth at #S756 in the Moscone Center in San Francisco.
You can also find more information about Mali best practices and the free-of-charge Arm Mobile Studio profiling tools on our developer website.
Download Arm Mobile Studio today
1: The speed-of-light performance documented here applies to the largest configuration of the shader core. The Mali-G310 and Mali-G510 designs are configurable to allow silicon designs to be optimized for targeted use cases, and can have variable performance based on the chosen configuration.