The Mali GPU: An Abstract Machine, Part 4 - The Bifrost Shader Core

August 26, 2016

9 minute read time.

We have recently announced the first GPU in the Mali Bifrost architecture family, the Mali-G71. While the overall rendering model it implements is similar to previous Mali GPUs – the Bifrost family is still a deeply pipelined tile-based renderer (see the first two blogs in this series The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining and The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering for more information) – there are sufficient changes in the programmable shader core to require a follow up to the original "Abstract Machine" blog series.

In this blog, I introduce the block-level architecture of a stereotypical Bifrost shader core, and explain what performance expectations application developers should have of the hardware when it comes to content optimization and understanding the hardware performance counters exposed via tools such as DS-5Streamline. This blog assumes you have read the first two parts in the series, so I would recommend starting with those if you have not read them already.

GPU Architecture

The top-level architecture of a Bifrost GPU is the same as the earlier Midgard GPUs.

The Shader Cores

Like Midgard, Bifrost is a unified shader core architecture, meaning that only a single class of shader core which is capable of executing all types of shader programs and compute kernels exists in the design.

The exact number of shader cores present in a particular silicon chip varies; our partners can choose how many shader cores they implement based on their performance needs and silicon area constraints. The Mali-G71 GPU can scale from a single core for low-end devices all the way up to 32 cores for the highest performance designs.

Work Dispatch

The graphics work for the GPU is queued in a pair of queues, one for vertex/tiling/compute workloads and one for fragment workloads, with all work for one render target being submitted as a single submission into each queue.

The workload in each queue is broken into smaller pieces and dynamically distributed across all of the shader cores in the GPU, or in the case of tiling workloads to a fixed function tiling unit. Workloads from both queues can be processed by a shader core at the same time; for example, vertex processing and fragment processing for different render targets can be running in parallel (see the first blog for more details on this pipelining methodology).

Level 2 Cache and Memory Bandwidth

The processing units in the system share a level 2 cache to improve performance and to reduce memory bandwidth caused by repeated data fetches. The size of the L2 cache is configurable by our silicon partners depending on their requirements, but is typically 64KB per shader core in the GPU.

The number of bus ports out of the GPU to main memory, and hence the available memory bandwidth, depends on the number of shader cores implemented. In general we aim to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle. The maximum number of AXI ports has been increased over Midgard allowing larger configurations with more than 12 cores to access a higher peak-bandwidth per clock if the downstream memory system can support it.

Note that the available memory bandwidth depends on both the GPU (frequency, AXI port width) and the downstream memory system (frequency, AXI data width, AXI latency). In many designs the AXI clock will be lower than the GPU clock, so not all of the theoretical bandwidth of the GPU is actually available to applications.

The Bifrost Shader Core

All Mali shader cores are structured as a number of fixed-function hardware blocks wrapped around a programmable core. The programmable core is the largest area of change in the Bifrost GPU family, with a number of significant changes over the Midgard "Tripipe" design discussed in the previous blog in this series:

The Bifrost programmable Execution Core consists of one or more Execution Engines – three in the case of the Mali-G71 – and a number of shared data processing units, all linked by a messaging fabric.

The Execution Engines

The Execution Engines are responsible for actually executing the programmable shader instructions, each including a single composite arithmetic processing pipeline as well as all of the required thread state for the threads that the execution engine is processing.

The Execution Engines: Arithmetic Processing

The arithmetic units in Bifrost implement a quad-vectorization scheme to improve functional unit utilization. Threads are grouped into bundles of four, called a quad, and each quad fills the width of a 128-bit data processing unit. From the point of view of a single thread this architecture looks like a stream of scalar 32-bit operations, which makes achieving high utilization of the hardware a relative straight forward task for the shader compiler. The example below shows how a vec3 arithmetic operation may map onto a pure SIMD unit (pipeline executes one thread per clock):

... vs a quad-based unit (pipeline executes one lane per thread for four threads per clock):

The advantages in terms of the ability to keep the hardware units full of useful work, irrespective of the vector length in the program, is clearly highlighted by these diagrams. The power efficiency and performance provided by the narrower than 32-bit types is still critically important for mobile devices, so Bifrost maintains native support for int8, int16, and fp16 data types which can be packed to fill the 128-bit data width of the data unit. A single 128-bit maths unit can therefore perform 8x fp16/int16 operations per clock cycle, or 16x int8 operations per clock cycle.

The Execution Engines: Thread State

To improve performance and performance scalability for complex programs, Bifrost implements a substantially larger general-purpose register file for the shader programs to use. The Mali-G71 provides 64x 32-bit registers while still allowing the maximum thread occupancy of the GPU, removing the earlier trade off between thread count and register file usage described in this blog: ARM Mali Compute Architecture Fundamentals.

The size of the fast constant storage, used for storing OpenGL ES uniforms and Vulkan push constants, has also been increased which reduces cache-access pressure for programs using lots of constant storage.

Data Processing Unit: Load/Store Unit

The load/store unit handles all general purpose (non-texture) memory accesses, including vertex attribute fetch, varying fetch, buffer accesses, and thread stack accesses. It includes 16KB L1 data cache per core, which is backed by the shared L2 cache.

The load/store cache can access a single 64-byte cache line per clock cycle, and accesses across a thread quad are optimized to reduce the number of unique cache access requests required. For example, if all four threads in the quad access data inside the same cache line that data can be returned in a single cycle.

Note that this load/store merging functionality can significantly accelerate many data access patterns found in common OpenCL compute kernels, which are commonly memory access limited, so maximizing its utility in algorithm design is a key optimization objective. It is also with noting that even though the Mali arithmetic units are scalar, the data access patterns will still benefit from well written vector loads, so we still recommend writing vectorized shader and kernel code whenever possible.

Data Processing Unit: Varying Unit

The varying unit is a dedicated fixed-function varying interpolator. It implements a similar optimization strategy to the programmable arithmetic units; it vectorizes interpolation across the thread quad to ensure good functional unit utilization, and includes support for faster fp16 optimization.

The unit can interpolate 128-bits per quad per clock; e.g. interpolating a mediump (fp16) vec4 would take two cycles per four thread quad. Optimization to minimize varying value vector length, and aggressive use of fp16 rather than fp32 can therefore improve application performance.

Data Processing Unit: ZS/Blend

The ZS and Blend unit is responsible for handling all accesses to the tile-memory, both for built-in OpenGL ES operations such as depth/stencil testing and color blending, as well as programmatic access to the tile buffer needed for functionality such as:

https://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_pixel_local_storage.txt
https://www.khronos.org/registry/gles/extensions/ARM/ARM_shader_framebuffer_fetch.txt
and the merged sub-pass functionality in Vulkan.

Unlike the earlier Midgard designs, where the LS Pipe was a monolithic pipeline handling load/store cache access, varying interpolation, and tile-buffer accesses, Bifrost has implemented three smaller and more efficient parallel data units. This means that tile-buffer access can run in parallel to varying interpolation, for example. Graphics algorithms making use of programmatic tile buffer access, which all tended to be very LS Pipe heavy on Midgard, should see a measurable reduction in contention for processing resources.

Data Processing Unit: Texture Unit

The texture unit implements all texture memory accesses. It includes 16KB L1 data cache per core, which is backed by the shared L2 cache. The architecture performance of this block in Mali-G71 is the same as the earlier Midgard GPUs; it can return one bilinear filtered (GL_LINEAR_MIPMAP_NEAREST) texel per clock. For example interpolating a bilinear texture lookup for each thread in a four thread quad would take four cycles.

Some texture access modes require multiple cycles to generate data:

Trilinear filtering (GL_LINEAR_MIPMAP_LINEAR) requires two bilinear samples per texel and so requires two cycles per texel.
Volumetric 3D textures require twice the number of cycles than a 2D texture would require; e.g. trilinear filtered 3D textures would take 4 cycles, bilinear filtered 3D textures would take 2 cycles.
Wide type texture formats (16-bits or more per color channel) may require multiple cycles per pixel.

One exception to the wide format rule, which is a new optimization in Bifrost, is depth texture sampling. Sampling from DEPTH_COMPONENT16 or DEPTH_COMPONENT24 textures, which is commonly needed for both shadow mapping techniques and deferred lighting algorithms, has been optimized and is now a single cycle lookup, doubling the performance relative to GPUs in the Midgard family.

The Bifrost Geometry Flow

In addition to the shader core change, Bifrost introduces a new Index-Driven Vertex Shading (IDVS) geometry processing pipeline. Earlier Mali GPUs processed all of the vertex shading before tiling, often resulting in wasted computation and bandwidth related to the varyings which only related to culled triangles (e.g. outside of the frustum, or failing a facing test).

The IDVS pipeline splits the vertex shader into two halves; one processing the position, and one processing the remaining varyings.

This flow provides two significant optimizations:

The index buffer is read first, and vertex shading is only submitted for small batches of vertices where at least one vertex in each batch is referenced by the index buffer. This allows vertex shading to jump spatial gaps in the index buffer.
Varying shading is only submitted for primitives with survive the clip-and-cull phase; this removes a significant amount of redundant computation and bandwidth for vertices contributing only to triangles which are culled.

To get the most benefit from the Bifrost geometry flow is it useful to deinterleave packed vertex buffers partially; place attributes contributing to position in one packed buffer, and attributes contributing to non-position varyings in a second packed buffer. This means that the non-position varyings are not pulled into the cache for vertices which are culled and never contribute to an on-screen primitive. My colleague stacysmith has written a good blog on optimizing buffer packing to exploit this type geometry processing pipeline.

Performance Counters

Like the earlier Midgard GPUs, Bifrost hardware supports a large number of performance counters to enable application developers to profile and optimize their applications. See more detail on the performance counters available to application developers for the Bifrost architecture.

Comments and questions welcomed as always,

Cheers,

Pete

Parents

Sean Lumly over 7 years ago

Hi Pete!
Drop the first "t" - it's Bifrost
That is embarrassing... Thanks for pointing out the error in spelling!
On any current ARM CPU and GPU cache lines have been 64 bytes, and have been for quite a while (not aware of any with 128 bytes). This is predominantly because AXI uses 64 bytes as it's granule size for the cache coherency protocols, so it's a natural fit with that.
This is very good to know. I recall a previous post that mentioned that ASTC's 128-bit blocks fit inside of cache line. I suspect the author was trying to communicate that ASTCs blocks aligned with caches perfectly. Or my memory could simply be wrong
Vertex shaders are a bit special because they are the only means to feed geometry into the rendering pipeline; compute shaders could do the maths, but you'd still need a vertex shader to "memcopy" the results into the pipeline, so I suspect that would be a non-starter for anything non-trivial....
Understood! But I was not trying to replace vertex/fragment shaders with compute shaders, merely wondering how fast the execution time for a piece of general code (minus fixed function contributions) would be relative to the other on the new architecture. Is there a disparity in code execution for different workloads? For example, I would imagine that scheduling pixels to be a fairly well known problem, which may give a fragment workload an efficiency advantage over a compute workload..
No change here since the later Midgard cores; it's a single logical L2 shared by all shader cores.
That is very good to know, and is quite a relief. The block diagram on the G71 GPU product page features optional segmented L2 blocks denoted by dotted lines, and the "Specification" tab lists that L2 is available in "1-4 slices" which is where my concern stemmed from.
Hard to give a precise answer here on a public forum; we can't discuss micro-architecture internals in any detail. We do expect some operations to be faster than Midgard (fp16 trancendentals in particular), the rest should be "similar" to Mali-T880 for Mali-G71.
Ahh.. My sincere apologies for mindlessly treading into restricted territory!
As always, a very informative and accessible post! I really do get a kick out of getting a [very] high level understanding of some of the features of the architecture. While much of this information doesn't have bearing on my actual work, it is nonetheless helpful in sharpening my intuition, and is incredibly interesting!
Thanks!
Sean
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Sean Lumly over 7 years ago

Hi Pete!
Drop the first "t" - it's Bifrost
That is embarrassing... Thanks for pointing out the error in spelling!
On any current ARM CPU and GPU cache lines have been 64 bytes, and have been for quite a while (not aware of any with 128 bytes). This is predominantly because AXI uses 64 bytes as it's granule size for the cache coherency protocols, so it's a natural fit with that.
This is very good to know. I recall a previous post that mentioned that ASTC's 128-bit blocks fit inside of cache line. I suspect the author was trying to communicate that ASTCs blocks aligned with caches perfectly. Or my memory could simply be wrong
Vertex shaders are a bit special because they are the only means to feed geometry into the rendering pipeline; compute shaders could do the maths, but you'd still need a vertex shader to "memcopy" the results into the pipeline, so I suspect that would be a non-starter for anything non-trivial....
Understood! But I was not trying to replace vertex/fragment shaders with compute shaders, merely wondering how fast the execution time for a piece of general code (minus fixed function contributions) would be relative to the other on the new architecture. Is there a disparity in code execution for different workloads? For example, I would imagine that scheduling pixels to be a fairly well known problem, which may give a fragment workload an efficiency advantage over a compute workload..
No change here since the later Midgard cores; it's a single logical L2 shared by all shader cores.
That is very good to know, and is quite a relief. The block diagram on the G71 GPU product page features optional segmented L2 blocks denoted by dotted lines, and the "Specification" tab lists that L2 is available in "1-4 slices" which is where my concern stemmed from.
Hard to give a precise answer here on a public forum; we can't discuss micro-architecture internals in any detail. We do expect some operations to be faster than Midgard (fp16 trancendentals in particular), the rest should be "similar" to Mali-T880 for Mali-G71.
Ahh.. My sincere apologies for mindlessly treading into restricted territory!
As always, a very informative and accessible post! I really do get a kick out of getting a [very] high level understanding of some of the features of the architecture. While much of this information doesn't have bearing on my actual work, it is nonetheless helpful in sharpening my intuition, and is incredibly interesting!
Thanks!
Sean
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Graphics, Gaming, and VR blog

Coming soon in Arm Frame Advisor

Julie Gaskin

Read about our vision for future feature enhancements in Frame Advisor. We have listened to your feedback and plan to extend the kinds of analyses you can perform. Help us to create more great features…
- March 13, 2024
Using the new custom reporting features in Performance Advisor

Connor Brookes

Explaining the new custom reporting features in Performance Advisor and how to use them.
- March 4, 2024
Beyond Mobile: Arm Mobile Studio is now Arm Performance Studio

Julie Gaskin

We are proud to announce that the latest version of our profiling tool suite for mobile is now available to download and use for free. In this release, we have a few changes to tell you about.
- February 26, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog