The new Arm® Immortalis™ - G715 GPU, and its smaller Arm Mali siblings, are now available in consumer devices and accessible to developers. Every new generation of Arm GPUs brings improvements, providing faster and more energy-efficient cores, as well as new features and extensions. This blog explores what is new and exciting this year, and how you can effectively use these features.
The biggest new feature in Immortalis-G715 is hardware-accelerated ray tracing. Ray tracing gives developers freedom to create effects, such as non-planar mirror reflections and multi-layer refractions, which are difficult to emulate using rasterization alone. Vulkan ray query and the full Vulkan ray tracing pipeline are supported in the hardware, although some early drivers may only expose the ray query extension.
Ray query (VK_KHR_ray_query) gives access to ray tracing functionality from within existing pipeline stages. This is the method we expect most mobile content to choose when starting out with ray tracing. It allows developers to create a single renderer that works on all devices, but which can be augmented with ray-traced effects on suitable devices.
The ray tracing pipeline (VK_KHR_ray_tracing_pipeline) replaces the existing graphics pipeline mode in the API. It allows developers to create a fully ray-traced renderer, using ray generation instead of rasterization to trigger pixel coloring. Bypassing the traditional rasterization-based pipeline gives some creative freedom, but also comes with downsides, such as loss of framebuffer compression. We recommend starting with ray query, as it is likely to give the best performance on mobile.
The ray tracing hardware is fully supported in the Arm Mobile Studio suite of profiling tools. Extensive counter instrumentation is available in our Streamline profiler, and shader best practice static analysis is available when using Mali Offline Compiler.
Ray traversal is handled on a per-warp basis, so the best performance is achieved by keeping the rays within each warp coherent. Minimize traversal divergence by keeping rays pointed in similar directions, and with similar ray origin and length. To ensure that the compiler can statically analyze traversals for warp-uniformity, we recommend performing a single rayQueryProceed() per rayQueryInitialize(). We also recommend avoiding queries inside loops or complex conditional control flow.
rayQueryProceed()
rayQueryInitialize()
Traversal flags also influence ray traversal efficiency. Traversal is faster if both gl_RayFlagsSkipAABB and gl_RayFlagsOpaque are unconditionally set at compile time. For shadowing and occlusion tests, we only care about the existence of intersecting geometry, but not exactly which geometry is intersected. For these use cases, use the gl_RayFlagsTerminateOnFirstHit flag because it allows traversal to be terminated early without resolving a specific closest-hit.
gl_RayFlagsSkipAABB
gl_RayFlagsOpaque
gl_RayFlagsTerminateOnFirstHit
The topology of the acceleration structure can impact ray tracing performance. To minimize the number of nodes that must be traversed, aim to minimize the bounding volume of each node in the tree. Minimize overlapping bottom-level acceleration structures (BLAS) by splitting bottom-level nodes that contain large spatial gaps into multiple BLAS entries. Minimize the bounding volume of nodes inside each BLAS by ensuring that geometry is axially aligned, using a BLAS transform to apply any necessary rotation.
The second significant feature this year is Variable Rate Shading (VRS), which is available for both Vulkan (VK_KHR_fragment_shading_rate) and OpenGL ES (GL_EXT_fragment_shading_rate). Using VRS, an application can reduce fragment shader evaluation frequency, improving performance for shader-heavy workloads. A single fragment shader invocation can write output to a region larger than a single pixel, while keeping accurate sample coverage to preserve edge accuracy.
The Arm GPUs implement the full VRS implementation, equivalent to the DirectX tier 2 feature set. This exposes shading for 1x1, 1x2, 2x1, 2x2, 2x4, 4x2, and 4x4 pixel regions. Rate selection can be provided per-draw, per-primitive, or using a screen-space rate control attachment. Logical combiners are also provided, allowing an application to merge decisions from these three sources.
The best visual results are achieved with VRS when using it to reduce shading rate for objects that contain little high frequency data. This minimizes the visual error introduced by the shading rate reduction.
One example of a less detailed draw call is a skybox. This is a good candidate for pipeline-level VRS, because the lack of detail is uniform over the whole object. However, skyboxes tend to use simple shaders, so the benefit is likely to be small.
More usefully, VRS can be used on rendered objects that are known to have dynamically less detail at runtime, based on how they are being used. VRS can be used to reduce the cost of objects that are going to be blurred anyway, for example, due to depth-of-field or motion-blur. VRS can also be controlled using a rate control image attachment that adjusts the cost based on screen location. This can be useful to reduce cost in areas that are smooth or dark, where fine detail may not be visible.
The benefit of VRS is going to depend on your content. VRS guarantees that shader evaluation frequency is reduced, however rasterization, ZS testing, and blending are still evaluated at the original sample density. Applications using simple shaders should see an energy efficiency improvement from VRS, but may not see a performance benefit. The benefits of VRS are not linear, and it tends to become less effective with larger region sizes as non-shader costs start to become the bottleneck.
Content with complex layering can sometimes see an unexpected slowdown when using 2x4, 4x2, or 4x4 regions. To avoid this, try limiting your VRS implementation to region sizes of 2x2 or smaller, using a size that is statically determinable by the driver.
gl_PrimitiveShadingRateEXT
OP_MIN
You can measure your actual content shading rate using hardware counters accessible in our Streamline profiler. The new shading rate counter reports the number of fragments shaded as a percentage of the number of pixels covered. A shading rate of 100% indicates a 1:1 shading rate, and a shading rate under 100% shows use how much VRS is being used to reduce shading rate. Note that use of sample-rate shading will also show in this counter, and a value over 100% indicates use of sample-rate shading on a multi-sampled framebuffer. .
This year sees our premium GPU configurations get access to Arm Fixed Rate Compression (AFRC), a visually lossless fixed bit-rate compression for LDR images.
Our existing compression scheme, Arm Frame Buffer Compression (AFBC), is a lossless compression. This means we can enable it transparently, if the application follows best practice guidelines around API usage. However, because it is lossless the compression ratio is unpredictable. For game-like images, we typically achieve around 2:1 compression with AFBC, but the driver must assume no compression and always allocates sufficient storage for a full-sized image.
By comparison, AFRC is a lossy fixed-rate compression that allows up to 4:1 compression ratio. This doubles the average compression ratio compared to AFBC. The fixed-rate nature also means that the driver can guaranteed the lower storage requirement, reducing the image memory footprint in proportion to the compression ratio. AFRC is available for 1-4 component 8-bit per component UNORM images, with support for both linear and sRGB color formats. The table below shows the compression ratios that are available in the Arm Immortalis-G715 this year.
The obvious question is what does "visually lossless" mean in practice? The good news is that although the compression is lossy, the achieved image quality is very high, with excellent preservation of hard edges and chroma transitions. It is very hard to see differences even when using the highest compression ratio and comparing two images side-by-side (high resolution image).
Because this is a lossy compression scheme, applications must enable it on a per-image basis using VK_EXT_image_compression_control for application-managed images, and VK_EXT_image_compression_control_swapchain for swapchain-managed images.
DRAM memory accesses are one of the most power-intensive operations that an application does, so efficient use of bytes is critical to getting great battery life. In modern high-end content intermediate render-to-texture passes, and their subsequent use as textures, is usually the highest contributor to frame memory bandwidth. We view the high compression ratio of AFRC as a significant enabler for application energy efficiency and therefore encourage you to give it a go.
The AFRC format follows the same use case design as our AFBC format. It is only available on the fragment path for compression, and using the texture unit for decompression. To allow use of AFRC, images must conform to the following API usage restrictions:
VK_IMAGE_TILING_OPTIMAL
VK_IMAGE_USAGE_STORAGE_BIT
VK_IMAGE_USAGE_FRAGMENT_SHADING_RATE_ATTACHMENT_BIT
VK_IMAGE_CREATE_ALIAS_BIT
Note that AFRC may not always be the most efficient format choice. It always gives the best memory footprint and a guaranteed reduction in worst-case bandwidth, due to the fixed bitrate. However, if a compressed image is very simple AFBC can beat AFRC for bandwidth. For render passes that write a high proportion of simple block-color, you may find that AFBC gives the best memory bandwidth efficiency.
The heart of any GPU is the shader core, and this year introduces our most energy-efficient and highest performance shader core yet. The Immortalis-G715 core is an evolution of the Mali-G710 core, giving an average performance improvement of 15% across a wide range of applications.
The biggest change in the programmable core is a doubling of the fused-multiply accumulate (FMA) throughput. The new shader core supports 128 fp32 FMAs or 256 fp16 FMAs per clock cycle helping high complexity shaders to really shine. In addition, subgroup operations and integer bitwise operations move out of the special functions unit (SFU) and into the main pipeline, which gives four times the performance for these instruction types.
We are seeing more shaders using indirect loads from uniform buffers and storage buffers, for example indexing properties by material ID. This generation of hardware adds a new optimized load path for dynamically warp-uniform loads and warp-sequential loads. This path gives up to 4 times faster memory access for these common access patterns.
We have also made improvements on the texturing path for cubemap lookups, and for textureLod() lookups with a uniform level-of-detail, with up to 2x performance for both. We have seen this change give a large performance boost for content using small cubemaps to store light probe data.
textureLod()
In addition to the major features, there have been several smaller improvements to help performance and developer experience. These have been prioritized based on issues we have seen in the field, and using feedback from developers and our ecosystem team. Your feedback really does have an impact, please continue to share your thoughts and any problems you encounter.
We continue to see a steady adoption of High Dynamic Range (HDR) rendering, where content renders into floating-point attachments for later processing and tone mapping. This generation of GPUs introduces hardware-based blending and multi-sample resolve for fp16 formats, complementing the AFBC compression support for that was added last year in Mali-G710.
Draw calls with a high volume of geometry with simple shaders could end up limited by tiler binning performance. This generation of hardware supports a new tiler that is three times faster, allowing us to complete the vertex processing phase more quickly once a position result is available for binning to use.
Shader programs using both ZS attachments, and that also used late-ZS or shader access to ZS attachments, could see a high number of stalls due to false dependencies. Shader ZS dependency management has been improved, and shaders using both will significantly improve core utilization during fragment shading.
Rasterization based on traditional sample points inside a pixel can leave data holes for some techniques when a triangle hits a pixel but misses all the sample points inside it. This can result in incorrect results when using rasterization to drive techniques that require gap-free coverage, such as light tiling in clustered renderers, or voxelization. This generation of hardware gains support for Vulkan conservative rasterization (VK_EXT_conservative_rasterization), which provides two new trigger conditions for shading. The overestimation mode shades the entire pixel if the primitive intersects the pixel. The underestimation mode shades the pixel only if the pixel is fully covered by the primitive.
Developers know that optimizing their shaders to use fp16 data types is important for mobile. For Arm GPUs, using lower precision allows more variables to be stored in registers concurrently and enables vec2 SIMD arithmetic operations. Historically the only way to check shaders for efficient use of fp16 has been to use the Mali Offline Compiler. This hardware generation introduces a new hardware performance counter, which allows you to directly measure the number of narrow arithmetic operations executed. This is reported in our Streamline profiler reports.
This blog has given a brief overview of the new features, and some of the best practice highlights to get the most performance out of them. For more detailed advice, our Mali Best Practices Developer Guide has been updated to include all the recommendations for the Immortalis-G715 generation of hardware.
If you are at the Games Developer Conference this year, come and see us in the expo. We can be found at South Hall #S550, where you can find out more about Arm GPUs and our Arm Mobile Studio developer tools. You can see the Arm Immortalis-G715 in action, with multiple demos running on the vivo X90 Pro smartphone, powered by the MediaTek Dimensity 9200.