We have recently announced the first GPU in the Mali Bifrost architecture family, the Mali-G71. While the overall rendering model it implements is similar to previous Mali GPUs – the Bifrost family is still a deeply pipelined tile-based renderer (see the first two blogs in this series The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining and The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering for more information) – there are sufficient changes in the programmable shader core to require a follow up to the original "Abstract Machine" blog series.
In this blog, I introduce the block-level architecture of a stereotypical Bifrost shader core, and explain what performance expectations application developers should have of the hardware when it comes to content optimization and understanding the hardware performance counters exposed via tools such as DS-5® Streamline. This blog assumes you have read the first two parts in the series, so I would recommend starting with those if you have not read them already.
The top-level architecture of a Bifrost GPU is the same as the earlier Midgard GPUs.
Like Midgard, Bifrost is a unified shader core architecture, meaning that only a single class of shader core which is capable of executing all types of shader programs and compute kernels exists in the design.
The exact number of shader cores present in a particular silicon chip varies; our partners can choose how many shader cores they implement based on their performance needs and silicon area constraints. The Mali-G71 GPU can scale from a single core for low-end devices all the way up to 32 cores for the highest performance designs.
The graphics work for the GPU is queued in a pair of queues, one for vertex/tiling/compute workloads and one for fragment workloads, with all work for one render target being submitted as a single submission into each queue.
The workload in each queue is broken into smaller pieces and dynamically distributed across all of the shader cores in the GPU, or in the case of tiling workloads to a fixed function tiling unit. Workloads from both queues can be processed by a shader core at the same time; for example, vertex processing and fragment processing for different render targets can be running in parallel (see the first blog for more details on this pipelining methodology).
The processing units in the system share a level 2 cache to improve performance and to reduce memory bandwidth caused by repeated data fetches. The size of the L2 cache is configurable by our silicon partners depending on their requirements, but is typically 64KB per shader core in the GPU.
The number of bus ports out of the GPU to main memory, and hence the available memory bandwidth, depends on the number of shader cores implemented. In general we aim to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle. The maximum number of AXI ports has been increased over Midgard allowing larger configurations with more than 12 cores to access a higher peak-bandwidth per clock if the downstream memory system can support it.
Note that the available memory bandwidth depends on both the GPU (frequency, AXI port width) and the downstream memory system (frequency, AXI data width, AXI latency). In many designs the AXI clock will be lower than the GPU clock, so not all of the theoretical bandwidth of the GPU is actually available to applications.
All Mali shader cores are structured as a number of fixed-function hardware blocks wrapped around a programmable core. The programmable core is the largest area of change in the Bifrost GPU family, with a number of significant changes over the Midgard "Tripipe" design discussed in the previous blog in this series:
The Bifrost programmable Execution Core consists of one or more Execution Engines – three in the case of the Mali-G71 – and a number of shared data processing units, all linked by a messaging fabric.
The Execution Engines are responsible for actually executing the programmable shader instructions, each including a single composite arithmetic processing pipeline as well as all of the required thread state for the threads that the execution engine is processing.
The arithmetic units in Bifrost implement a quad-vectorization scheme to improve functional unit utilization. Threads are grouped into bundles of four, called a quad, and each quad fills the width of a 128-bit data processing unit. From the point of view of a single thread this architecture looks like a stream of scalar 32-bit operations, which makes achieving high utilization of the hardware a relative straight forward task for the shader compiler. The example below shows how a vec3 arithmetic operation may map onto a pure SIMD unit (pipeline executes one thread per clock):
... vs a quad-based unit (pipeline executes one lane per thread for four threads per clock):
The advantages in terms of the ability to keep the hardware units full of useful work, irrespective of the vector length in the program, is clearly highlighted by these diagrams. The power efficiency and performance provided by the narrower than 32-bit types is still critically important for mobile devices, so Bifrost maintains native support for int8, int16, and fp16 data types which can be packed to fill the 128-bit data width of the data unit. A single 128-bit maths unit can therefore perform 8x fp16/int16 operations per clock cycle, or 16x int8 operations per clock cycle.
To improve performance and performance scalability for complex programs, Bifrost implements a substantially larger general-purpose register file for the shader programs to use. The Mali-G71 provides 64x 32-bit registers while still allowing the maximum thread occupancy of the GPU, removing the earlier trade off between thread count and register file usage described in this blog: ARM Mali Compute Architecture Fundamentals.
The size of the fast constant storage, used for storing OpenGL ES uniforms and Vulkan push constants, has also been increased which reduces cache-access pressure for programs using lots of constant storage.
The load/store unit handles all general purpose (non-texture) memory accesses, including vertex attribute fetch, varying fetch, buffer accesses, and thread stack accesses. It includes 16KB L1 data cache per core, which is backed by the shared L2 cache.
The load/store cache can access a single 64-byte cache line per clock cycle, and accesses across a thread quad are optimized to reduce the number of unique cache access requests required. For example, if all four threads in the quad access data inside the same cache line that data can be returned in a single cycle.
Note that this load/store merging functionality can significantly accelerate many data access patterns found in common OpenCL compute kernels, which are commonly memory access limited, so maximizing its utility in algorithm design is a key optimization objective. It is also with noting that even though the Mali arithmetic units are scalar, the data access patterns will still benefit from well written vector loads, so we still recommend writing vectorized shader and kernel code whenever possible.
The varying unit is a dedicated fixed-function varying interpolator. It implements a similar optimization strategy to the programmable arithmetic units; it vectorizes interpolation across the thread quad to ensure good functional unit utilization, and includes support for faster fp16 optimization.
The unit can interpolate 128-bits per quad per clock; e.g. interpolating a mediump (fp16) vec4 would take two cycles per four thread quad. Optimization to minimize varying value vector length, and aggressive use of fp16 rather than fp32 can therefore improve application performance.
The ZS and Blend unit is responsible for handling all accesses to the tile-memory, both for built-in OpenGL ES operations such as depth/stencil testing and color blending, as well as programmatic access to the tile buffer needed for functionality such as:
Unlike the earlier Midgard designs, where the LS Pipe was a monolithic pipeline handling load/store cache access, varying interpolation, and tile-buffer accesses, Bifrost has implemented three smaller and more efficient parallel data units. This means that tile-buffer access can run in parallel to varying interpolation, for example. Graphics algorithms making use of programmatic tile buffer access, which all tended to be very LS Pipe heavy on Midgard, should see a measurable reduction in contention for processing resources.
The texture unit implements all texture memory accesses. It includes 16KB L1 data cache per core, which is backed by the shared L2 cache. The architecture performance of this block in Mali-G71 is the same as the earlier Midgard GPUs; it can return one bilinear filtered (GL_LINEAR_MIPMAP_NEAREST) texel per clock. For example interpolating a bilinear texture lookup for each thread in a four thread quad would take four cycles.
Some texture access modes require multiple cycles to generate data:
One exception to the wide format rule, which is a new optimization in Bifrost, is depth texture sampling. Sampling from DEPTH_COMPONENT16 or DEPTH_COMPONENT24 textures, which is commonly needed for both shadow mapping techniques and deferred lighting algorithms, has been optimized and is now a single cycle lookup, doubling the performance relative to GPUs in the Midgard family.
In addition to the shader core change, Bifrost introduces a new Index-Driven Vertex Shading (IDVS) geometry processing pipeline. Earlier Mali GPUs processed all of the vertex shading before tiling, often resulting in wasted computation and bandwidth related to the varyings which only related to culled triangles (e.g. outside of the frustum, or failing a facing test).
The IDVS pipeline splits the vertex shader into two halves; one processing the position, and one processing the remaining varyings.
This flow provides two significant optimizations:
To get the most benefit from the Bifrost geometry flow is it useful to deinterleave packed vertex buffers partially; place attributes contributing to position in one packed buffer, and attributes contributing to non-position varyings in a second packed buffer. This means that the non-position varyings are not pulled into the cache for vertices which are culled and never contribute to an on-screen primitive. My colleague stacysmith has written a good blog on optimizing buffer packing to exploit this type geometry processing pipeline here: Eats, Shoots and Interleaves.
Like the earlier Midgard GPUs, Bifrost hardware supports a large number of performance counters to enable application developers to profile and optimize their applications. More detail on the performance counters available to application developers for the Bifrost architecture can be found here:
Mali Bifrost Family Performance Counters
Comments and questions welcomed as always,
Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard with other engineers to determine how to get the best performance out of combined hardware and software compute sub-systems.
I was pretty excited when I noticed this post!
The "Bifrost" architecture seems very interesting! I'm most intrigued by the changes made with regards to memory. For example, a 64b cacheline (vs. 128b) will ensure that caches are much more dense with useful data in cases of more random access. Having 1-cycle L1 cache access seems quite amazing, and having 64 registers is astounding for reducing cache pressure! I expect that many of the performance improvements of Bifrost GPUs will be directly attributed to the memory system.
Of course, that's not to say that the ALU blocks aren't a significant development either! I expect its far easier to keep them full if up to 4 operations per block can executed in a cycle, rather than needing to fill an entire line to avoid waste. I expect ALU performance will jump quite a bit! But I can appreciate suggesting vectorizing operations -- I'm guessing that this will ensure a better distribution of math to memory access, and lead to a greater likelihood of filling the execution engine.
I have some questions (as usual), but I'll wait until I've done a bit of research to ensure that they are good ones!
EDIT: Turned Bitfrost to Bifrost
Ok, here are a few questions!
1) Do vulkan compute threads execute on Bifrost with the same efficiency as vertex or fragment threads? Are there any disparities?
2) Can multiple components of a vector operation be scheduled across more than one lane of an execution engine in a single clock?
3) Are individual L2 cache blocks associated with a cluster of Bifrost cores? If so, what is a typical configuration? Or is a single L2 cache accessible by all cores and its size varies with respect to the number of implemented cores?
4) Have the cycle costs for complex operations (eg. div, sqrt, trig, etc) changed?
5) Does Bifrost still use a pre-core round-robin to execute an associated list of threads?
Drop the first "t" - it's Bifrost
(From your first post) For example, a 64b cacheline (vs. 128b)
On any current ARM CPU and GPU cache lines have been 64 bytes, and have been for quite a while (not aware of any with 128 bytes). This is predominantly because AXI uses 64 bytes as its granule size for the cache coherency protocols, so it's a natural fit with that.
Vertex shaders are a bit special because they are the only means to feed geometry into the rendering pipeline; compute shaders could do the maths, but you'd still need a vertex shader to "memcopy" the results into the pipeline, so I suspect that would be a non-starter for anything non-trivial. (that said, compute shaders can do something smarter than geometry shaders and tessellation shaders for some use cases, but would then still need a simple vertex shader to feed the results of that into the pipeline).
On the fragment side of things, you do lose some things that you would get for free in fixed-function logic in fragment shaders, such as varying interpolation, but this is not new in Vulkan; OpenGL ES has exactly the same problem. As always there is some responsibility of developers to pick the correct tool for the job they are trying to do here - use the fixed function paths if you need the features they provide.
Each thread always sees 32-bit of data path processing capability, e.g. a scalar fp32 operation, a vec2 fp16 operation, or a vec4 int8. There is therefore still likely some benefit to writing vector shader code for the narrow types, although shorter vectors make life relatively easy for the compiler here.
No change here since the later Midgard cores; it's a single logical L2 shared by all shader cores.
Hard to give a precise answer here on a public forum; we can't discuss micro-architecture internals in any detail. We do expect some operations to be faster than Midgard (fp16 trancendentals in particular), the rest should be "similar" to Mali-T880 for Mali-G71.
At the 10,000 foot level, it's common to have threads blocked waiting on memory when they miss in the cache, and strict round-robin would hurt if you wasted a cycle trying to schedule a thread which could not run due to the data dependency. So to avoid that, there is some magic in the scheduling to try and keep things as busy as possible, but as above I can't really give any details here, sorry!
That is embarrassing... Thanks for pointing out the error in spelling!
On any current ARM CPU and GPU cache lines have been 64 bytes, and have been for quite a while (not aware of any with 128 bytes). This is predominantly because AXI uses 64 bytes as it's granule size for the cache coherency protocols, so it's a natural fit with that.
This is very good to know. I recall a previous post that mentioned that ASTC's 128-bit blocks fit inside of cache line. I suspect the author was trying to communicate that ASTCs blocks aligned with caches perfectly. Or my memory could simply be wrong
Vertex shaders are a bit special because they are the only means to feed geometry into the rendering pipeline; compute shaders could do the maths, but you'd still need a vertex shader to "memcopy" the results into the pipeline, so I suspect that would be a non-starter for anything non-trivial....
Understood! But I was not trying to replace vertex/fragment shaders with compute shaders, merely wondering how fast the execution time for a piece of general code (minus fixed function contributions) would be relative to the other on the new architecture. Is there a disparity in code execution for different workloads? For example, I would imagine that scheduling pixels to be a fairly well known problem, which may give a fragment workload an efficiency advantage over a compute workload..
That is very good to know, and is quite a relief. The block diagram on the G71 GPU product page features optional segmented L2 blocks denoted by dotted lines, and the "Specification" tab lists that L2 is available in "1-4 slices" which is where my concern stemmed from.
Ahh.. My sincere apologies for mindlessly treading into restricted territory!
As always, a very informative and accessible post! I really do get a kick out of getting a [very] high level understanding of some of the features of the architecture. While much of this information doesn't have bearing on my actual work, it is nonetheless helpful in sharpening my intuition, and is incredibly interesting!
For example, I would imagine that scheduling pixels to be a fairly well known problem, which may give a fragment workload an efficiency advantage over a compute workload..
The main one compute shaders suffer from is lack of good data locality in 2D or 3D data sets; you need to take extreme algorithmic care to feed them to the GPU in a manner which won't thrash the data caches or the TLBs.
All GPUs do addressing magic on texture data to automatically block interleave the data on upload to avoid those problems, and this is transparent to the application, but compute shaders (Open CL especially) tend to treat big data arrays in the same way a CPU does - simply big 1D/2D/3D flat arrays in memory which can cause problems. For example, if you get a naive implementation of something like SGEMM then you end up with really nasty TLB thrashing effects because you only read one matrix element from each "column-wise" input cache line and each MMU page, before it is evicted. There are normally "nice things" developers can do to help - like matrix transposition on the column-wise input - which is not something the driver could do automatically as proving it is safe is difficult (and sometimes impossible) with arbitrary input code.
The more generic things get (e.g. OpenCL has pointers) the harder it is for drivers to give locality tricks for free, so the downside of GPUGPU is that it does push some burden on to the application developers as a price to pay for the awesome amount of computational power they get in return .
The block diagram on the G71 GPU product page features optional segmented L2 blocks denoted by dotted lines, and the "Specification" tab lists that L2 is available in "1-4 slices" which is where my concern stemmed from.
Yup; that's purely to give more bandwidth in designs with more cores - every slice can be accessed in parallel, and can access main memory in parallel (subject to main memory system design and bandwidth limits; that's out of our control). From a behavioural point of view all slices are part of a single logical cache which all shader cores can use and share data coherently across.
The main one compute shaders suffer from is lack of good data locality in 2D or 3D data sets; you need to take extreme algorithmic care to feed them to the GPU in a manner which won't thrash the data caches or the TLBs.......but compute shaders (Open CL especially) tend to treat big data arrays in the same way a CPU does - simply big 1D/2D/3D flat arrays in memory which can cause problems.
This precisely corroborates my understanding, and may be why despite ample cycles, and large computational resources, ray-tracers on the GPU tend to be so limited in performance. Due large spatial divergence of cast rays (before the first intersection), the data-set that must be considered during ray propagation is far too large for local caches on all but the simplest scenes and likely causes large stalls where the GPU is waiting on external memory to feed it data. And it's not just the scene geometry that must be considered, but also the acceleration structure used to subdivide the scene which can be considerable as the scene increases in complexity, and the surface characteristics (eg. normals, shaders) that dictate the ray reflection!
A better approach could involve moving rays along the acceleration structure in "batches" that exploit spatial locality, and by deferring intersection tests until all rays have been moved and surface intersection must be tested. Thus the core solution becomes grouping the rays by some bounding structure they are moving through (for example), or by the surface that they are intersecting with, making it much more likely that the requisite data will remain on-chip during ray operations.
The problem that I am working on is also fundamentally memory limited, and I've discovered how tricky it can be to organize data so as to make good use of caches and avoid stalls!