The Mali GPU: An Abstract Machine, Part 3 - The Midgard Shader Core

EDIT: Updated March 2015 to include more information on the GPU memory system to help developers optimizing compute shaders.

In the first two blogs of this series I introduced the frame-level pipelining [The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining] and tile based rendering architecture [The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering] used by the Mali GPUs, aiming to develop a mental model which developers can use to explain the behavior of the graphics stack when optimizing the performance of their applications.

In this blog I will finish the construction of this abstract machine, forming the final component: a stereotypical Mali "Midgard" GPU programmable core.  This blog assumes you have read the first two parts in the series, so I would recommend starting with those if you have not read them already.

GPU Architecture

The "Midgard" family of Mali GPUs  (the Mali-T600 and Mali-T700 series) use a unified shader core architecture, meaning that only a single type of shader core exists in the design. This single core can execute all types of programmable shader code, including vertex shaders, fragment shaders, and compute kernels.

The exact number of shader cores present in a particular silicon chip varies; our silicon partners can choose how many shader cores they implement based on their performance needs and silicon area constraints. The Mali-T760 GPU can scale from a single core for low-end devices all the way up to 16 cores for the highest performance designs, but between 4 and 8 cores are the most common implementations.

mali-top-level.png

The graphics work for the GPU is queued in a pair of queues, one for vertex/tiling workloads and one for fragment workloads, with all work for one render target being submitted as a single submission into each queue. Workloads from both queues can be processed by the GPU at the same time, so vertex processing and fragment processing for different render targets can be running in parallel (see the first blog for more details on this pipelining methodology). The workload for a single render target is broken into smaller pieces and distributed across all of the shader cores in the GPU, or in the case of tiling workloads (see the second blog in this series for an overview of tiling) a fixed function tiling unit.

The shader cores in the system share a level 2 cache to improve performance, and to reduce memory bandwidth caused by repeated data fetches. Like the number of cores, the size of the L2 is configurable by our silicon partners, but is typically in the range of 32-64KB per shader core in the GPU depending on how much silicon area is available. The number and bus width of the memory ports this cache has to external memory is configurable, again allowing our partners to tune the implementation to meet their performance, power, and area needs. In general we aim to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle.

The Midgard Shader Core

The Mali shader core is structured as a number of fixed-function hardware blocks wrapped around a programmable "tripipe" execution core. The fixed function units perform the setup for a shader operation - such as rasterizing triangles or performing depth testing - or handling the post-shader activities - such as blending, or writing back a whole tile's worth of data at the end of rendering. The tripipe itself is the programmable part responsible for the execution of shader programs.

mali-top-core.png

The Tripipe

There are three classes of execution pipeline in the tripipe design: one handling arithmetic operations, one handling memory load/store and varying access, and one handling texture access. There is one load/store and one texture pipe per shader core, but the number of arithmetic pipelines can vary depending on which GPU you are using; most silicon shipping today will have two arithmetic pipelines, but GPU variants with up to four pipelines are also available.

Massively Multi-threaded Machine

Unlike a traditional CPU architecture, where you will typically only have a single thread of execution at a time on a single core, the tripipe is a massively multi-threaded processing engine. There may well be hundreds of hardware threads running at the same time in the tripipe, with one thread created for each vertex or fragment which is shaded. This large number of threads exists to hide memory latency; it doesn't matter if some threads are stalled waiting for memory, as long as at least one thread is available to execute then we maintain efficient execution.

Arithmetic Pipeline: Vector Core

The arithmetic pipeline (A-pipe) is a SIMD (single instruction multiple data) vector processing engine, with arithmetic units which operate on 128-bit quad-word registers. The registers can be flexibly accessed as either 2 x FP64, 4 x FP32, 8 x FP16, 2 x int64, 4 x int32, 8 x int16, or 16 x int8. It is therefore possible for a single arithmetic vector task to operate on 8 "mediump" values in a single operation, and for OpenCL kernels operating on 8-bit luminance data to process 16 pixels per SIMD unit per clock cycle.

While I can't disclose the internal architecture of the arithmetic pipeline, our public performance data for each GPU can be used to give some idea of the number of maths units available. For example, the Mali-T760 with 16 cores is rated at 326 FP32 GFLOPS at 600MHz. This gives a total of 34 FP32 FLOPS per clock cycle for this shader core; it has two pipelines, so that's 17 FP32 FLOPS per pipeline per clock cycle. The available performance in terms of operations will increase for FP16/int16/int8 and decrease for FP64/int64 data types.

Texture Pipeline

The texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory, so requires a second clock cycle to complete.

Load/Store Pipeline

The load/store pipeline (LS-pipe) is responsible for all memory accesses which are not related to texturing.  For graphics workloads this means reading attributes and writing varyings during vertex shading, and reading varyings during fragment shading. In general every instruction is a single memory access operation, although like the arithmetic pipeline they are vector operations and so could load an entire "highp" vec4 varying in a single instruction.

Early ZS Testing and Late ZS Testing

In the OpenGL ES specification "fragment operations" - which include depth and stencil testing - happen at the end of the pipeline, after fragment shading has completed. This makes the specification very simple, but implies that you have to spend lots of time shading something, only to throw it away at the end of the frame if it turns out to be killed by ZS testing. Coloring fragments just to discard them would cost a huge amount of performance and wasted energy, so where possible we will do ZS testing early (i.e. before fragment shading), only falling back to late ZS testing (i.e. after fragment shading) where it is unavoidable (e.g. a dependency on fragment which may call "discard" and as such has indeterminate depth state until it exits the tripipe).

In addition to the traditional early-z schemes, we also have some overdraw removal capability which can stop fragments which have already been rasterized from turning into real rendering work if they do not contribute to the output scene in a useful way. My colleague seanellis has a great blog looking at this technology - Killing Pixels - A New Optimization for Shading on ARM Mali GPUs - so I won't dive into any more detail here.

Memory System

This section is an after-the-fact addition to this blog, so if you have read this blog before and don't remember this section, don't worry you're not going crazy. We have been getting a lot of questions from developers writing OpenCL kernels and OpenGL ES compute shaders asking for more information about the GPU cache structure, as it can be really beneficial to lay out data structures and buffers to optimize cache locality. The salient facts are:

  • Two 16KB L1 data caches per shader core; one for texture access and one for generic memory access.
  • A single logical L2 which is shared by all of the shader cores. The size of this is variable and can be configured by the silicon integrator, but is typically between 32 and 64 KB per instantiated shader core.
  • Both cache levels use 64 byte cache lines.

If you are new to optimization of massively multi-threaded algorithms on massively multi-threaded architectures I would heartily recommend the SGEMM matrix multiplication video on our Mali Developer portal here:

... as the overall system behavior can be very different to what you are used to if you are coming from a traditional CPU background.

GPU Limits

Based on this simple model it is possible to outline some of the fundamental properties underpinning the GPU performance.

  • The GPU can issue one vertex per shader core per clock
  • The GPU can issue one fragment per shader core per clock
  • The GPU can retire one pixel per shader core per clock
  • We can issue one instruction per pipe per clock, so for a typical shader core we can issue four instructions in parallel if we have them available to run
    • We can achieve 17 FP32 operations per A-pipe
    • One vector load, one vector store, or one vector varying per LS-pipe
    • One bilinear filtered texel per T-pipe
  • The GPU will typically have 32-bits of DDR access (read and write) per core per clock [configurable]

If we scale this to a Mali-T760 MP8 running at 600MHz we can calculate the theoretical peak performance as:

  • Fillrate:
    • 8 pixels per clock = 4.8 GPix/s
    • That's 2314 complete 1080p frames per second!
  • Texture rate:
    • 8 bilinear texels per clock = 4.8 GTex/s
    • That's 38 bilinear filtered texture lookups per pixel for 1080p @ 60 FPS!
  • Arithmetic rate:
    • 17 FP32 FLOPS per pipe per core = 163 FP32 GFLOPS
    • That's 1311 FLOPS per pixel for 1080p @ 60 FPS!
  • Bandwidth:
    • 256-bits of memory access per clock = 19.2GB/s read and write bandwidth1.
    • That's 154 bytes per pixel for 1080p @ 60 FPS!

OpenCL and Compute

The observant reader will have noted that I've talked a lot about vertices and fragments - the staple of graphics work - but have mentioned very little about how OpenCL and RenderScript compute threads come into being inside the core. Both of these types of work behave almost identically to vertex threads - you can view running a vertex shader over an array of vertices as a 1-dimensional compute problem. So the vertex thread creator also spawns compute threads, although more accurately I would say the compute thread creator also spawns vertices .

Performance Counters

A document explaining the Midgard family performance counters, which map onto the block architecture described in this blog, can be found here:

Next Time ...

This blog concludes the first chapter of this series, developing the abstract machine which defines the basic behaviors which an application developer should expect to see for a Mali GPU in the Midgard family. Over the rest of this series I'll start to put this new knowledge to work, investigating some common application development pitfalls, and useful optimization techniques, which can be identified and debugged using the Mali integration into the ARM DS-5 Streamline profiling tools.

EDIT: Next blog now available:

Comments and questions welcomed as always,

TTFN,

Pete

Footnotes

  1. ... 19.2GB/s subject to the ability of the rest of the memory system outside of the GPU to give us data this quickly. Like most features of an ARM-based chip, the down-stream memory system is highly configurable in order to allow different vendors to tune power, performance, and silicon area according to their needs. For most SoC parts the rest of the system will throttle the available bandwidth before the GPU runs out of an ability to request data. It is unlikely you would want to sustain this kind of bandwidth for prolonged periods, but short burst performance is important.

Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

  • Thanks peterharris,

    As promised, I have come to brighten your day with a few questions! But first a bit of praise about the Mali GPU. I'm extremely impressed that the A-Pipe is flexible enough to do 64, 32, 16, and 8-bit vec4 workloads! I'm not sure if this flexibility is normal, but it's amazing to think that a T760MP16 running a mediump workload has a peak that is not only far beyond last-gen consoles, but is actually approaching the marketed performance current gen consoles. In addition to that, the memory bandwidth gains enabled by ASTC, TBDR, Transaction Elimination, AFBC, Pixel Local Storage, Forward Pixel Kill, etc, should also cease to make comparisons 1:1 -- I bet, when taken into account the potential is there to move significantly more data than peak bandwidth would suggest. The performance over duration delta between outlet (mains) powered hardware vs. mobile seems to be shrinking thanks to the focus of mobile engineers on efficiency, rather than just a smaller process node and a larger die. I guess this is what ARM meant by a hyper-Moore's Law curve..

    Ok, onto the questions:

    1) Does the compiler attempt to pack the Vec4 as much as possible in the event that there are different same-precision operations that can be combined? Is it possible to mix unit precision in a single vec4 operation?

    2) Do the figures for doing 1 interpolated texel per core (1 T-Pipe per core), per clock apply equally to ETC2 an ASTC for a given amount of bandwidth? Assuming no caches, I would guess that ASTC needs more data considering that the block sizes are different (8bytes vs 4bytes for ETC2).

    3) Is the the Mali T760 MP16 suitable for high performance smartphones or tablets at 28nm (~100mm2 chip)? Is it's power and area consumption suitable for the current process node? If this is the case, there must have been a drastic reduction of wiring to accommodate these performance gains!

    4) Are OpenCL (or ES31 Compute Shader) computations roughly as efficient as those of a Vertex Shader? Could geometry transformation in compute yield similar results as OpenGL?

    5) What is the L2 cache latency?

    6) How well does does the A-Pipe deal with SQRTs, DIVs, TRIG, etc that typically require a number of steps?

    Thank you for considering these. Please only answer what you can.

    Sean

  • 1) Does the compiler attempt to pack the vec4 as much as possible in the event that there are different same-precision operations that can be combined?

    Yes - packing code into a vec4/8/16 is essential to getting the best out of the architecture so the compiler tries very hard to get close to an optimal packing.

    Is it possible to mix unit precision in a single vec4 operation?

    No - the core SIMD units are like a classic SIMD block that you might find in NEON - all vector lanes perform an identical operation. However there is more than one vector unit in the hardware, and they are not all constrained to being the same precision.

    2) Do the figures for doing 1 interpolated texel per core (1 T-Pipe per core), per clock apply equally to ETC2 an ASTC for a given amount of bandwidth? Assuming no caches, I would guess that ASTC needs more data considering that the block sizes are different (8bytes vs 4bytes for ETC2).

    Yes.

    The most stressful case for the memory system is actually 2D UI - that's still commonly 32bpp uncompressed with a 1:1 pixel-to-texture mapping (and has simple shaders so they are short with a high density of texture accesses compared to gaming). Anything compressed and mipmapped is actually relatively light on both memory system and caches.

    3) Is the the Mali T760 MP16 suitable for high performance smartphones or tablets at 28nm (~100mm2 chip)? Is it's power and area consumption suitable for the current process node? If this is the case, there must have been a drastic reduction of wiring to accommodate these performance gains!

    It's technically feasible on some of the upcoming process nodes - the main questions are more economic than technical (it's a hefty piece of silicon if not all of your users are heavily into gaming on their smartphones). I can't comment on partner plans - so you'll have to wait and see what emerges on the market .

    4) Are OpenCL (or ES31 Compute Shader) computations roughly as efficient as those of a Vertex Shader? Could geometry transformation in compute yield similar results as OpenGL?

    From the shader pipeline point of view the two workloads are almost identical, but there is no means in OpenGL ES to not have a vertex shader, so if you swapped it out you'd end up needing a simple vertex shader anyway to feed the graphics pipeline in a format it could understand.

    5) What is the L2 cache latency?

    It's in the same ballpark as a CPU L2 - although the usual definition you would use on CPU has little meaning on a GPU - as described in the main article we are massively multi-threaded so we can have a lot of threads queued on memory misses (L2 or main memory) - and for the most part we can absorb a moderate number of cache misses (even a hundred cycles to main memory can be hidden completely in many cases) without losing performance as long as there are other threads available to execute.

    Cheers,
    Pete

  • Awesome. This was very helpful to me, and hopefully to someone else that had similar questions!

    Sean

  • Hi Peter Harris,

    you says

    "

    • We can achieve 17 FP32 operations per A-pipe

    "

    It is exactly the situation below:

    For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle.http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:

    • 7: dot product (4 Muls, 3 adds)
    • 1: scalar add
    • 4: vec4 add
    • 4: vec4 multiply
    • 1: scalar multiply

    But according to my measurement, I can't process dot product and vec4 mad together. How can I achieve that 17 flops?

  • chrisvarns seems to have done a good job of answering it in your other question here:

    What is the GLops of Mali T628MP6? Can't get 17 flops per pipe using OpenGL.

    ... so I won't regurgitate it here. Hope that answers your question,

    Cheers,
    Pete

  • Do stack variables require a Load/Store cycle for access? Suppose I had a 256 element array, allocated as a local variable (chosen arbitrarily as something too large to fit in registers), would random access of each element require a Load/Store operation and as such, a 1-cycle penalty? Or is there a data-cache for non-heap, non-register data per GPU-core that reduces access penalty?

  • Yes, stack access is done via the load/store pipe (stack is really just another kind of generic buffer access). Whether it visibly costs you a cycle in terms of performance overhead depends on the content - if you are not limited by load/store pipeline (i.e. arithmetic or texturing is critical path) then you may not see any performance loss. Stack writes are allocated into the cache, so read backs a few cycles later don't necessarily need to transit via main memory.

    Cheers,
    Pete

  • Thanks! This is extremely helpful.

    It's fantastic that the load/store can pull in a large amount of data (ie. 256bits -- a highp vec4) in a single read, and so data partitioning is something very useful to consider, even for stack variables, when trying to minimize bandwidth and maximize performance.

    It seems that at 4K resolutions and high frame-rates (eg. 60fps), cycles available per pixel become something of a scarcity. That isn't to say that  many meaningful things can't be done, but that careful consideration must be given to design choices. Memory access under these circumstances already implies a significant hike in external bandwidth, but even cache friendly, per-pixel accesses need to be minimized -- memory accesses should be carefully balanced with alu-maximizing math where applicable..

  • It seems that at 4K resolutions and high frame-rates (eg. 60fps), cycles available per pixel become something of a scarcity.

    Indeed - although I'm not sure many people are expecting to do AAA title 3D gaming on their phone at 4k2k resolution. There is going to have to be some pragmatic middle ground between "quality per pixel" and "number of pixels" (both in terms of pure pixel count per frame and framerate). Console GPUs are really already at this point in the cost/benefit tradeoff, with many AAA titles choosing more cycles per pixel at 720p with 30 FPS updates, rather than 1080p60, and let alone 4K60, and they are attached to a mains power connection ...

  • I think you're right about more traditional mobile gaming.  VR specifically demands both only high resolutions but also rock-solid sub-16ms rendering. This case prioritizes "number of pixels" over "quality per pixel" scenario that you speak of, and is one that is getting a tremendous amount of attention on mobile specifically, thanks to Samsung and their GearVR. This is an area of large interest to me personally!

    For all other games/3d-interactive-media, I fully agree with you, targeting something akin to 1080p30 is more than adequate, and in such case modern mobiles like the incredible Mali T760MP8 powered Galaxy S6  (huge congrats, by the by -- that is a major win!) almost seem overkill for such a task, amounting to a whopping 90 cycles per pixel at the previously stated resolution/framerate! Of course, computer graphics is more about compromises than room. No matter how many cycles, or how much bandwidth, building a graphics app seems to be additive for a short time, and subtractive for the majority!

    While VR is bound to be a niche market for a while to come, I'm expecting that 4K will make it into handsets by the end of this year or the end of next at the latest. I know mobiles will be up to the task as they can already drive these resolutions for arguably simple content. But I'm excited to see what's next. I would also love to get more memory units rather than more ALUs, and bigger caches. I think I would be happy to see 32 texel fetches per cycle and 32 L/Ss. Too much? Maybe at 10nm?

  • I think I would be happy to see 32 texel fetches per cycle and 32 L/Ss. Too much? Maybe at 10nm?

    I'm sure it will happen at some point - the questions are "how quickly" and "what process node". The wider ARM industry ecosystem and our physical IP team keep doing amazing stuff in this area, so if the silicon processes can give the energy per operation improvements then I'm sure our GPU team can find interesting uses for the transistors .

  • Thanks peterharris for all the detailed info but I cannot see anywhere the "No. of mul-add" and "No. of mul units"

    I was just curious that how powerful is this GPU in Exynos 7420 which has Mali-T760 MP8 clocked at 772 Mhz rather than 700 Mhz, The one clocked at 700 Mhz has 34.3 Gflops per core but what will it be when the clock is 772 Mhz?

    I tried to find out this and I required "No. of mul-add" and "No. of mul units" and also the "SIMD units" .

    Can you please provide me the above details or else the Gflops?

    Also does the 14nm FinFET process changes anything?

    Anandtech addresses that it uses 200mV to 300mV less power.

  • The one clocked at 700 Mhz has 34.3 Gflops per core but what will it be when the clock is 772 Mhz?

    As the article states we support 17 flops per core per clock. Multiply up from there based on core count and frequency.

    Also does the 14nm FinFET process changes anything?

    It's the same GPU design regardless of the silicon process - so the same behaviour per core per clock. The energy efficiency improvements will allow more cores and/or more frequency in the same power budget.

    HTH,

    Pete

  • Thanks Peter for a very informative description of the Mali Architecture.

    I have a couple of doubts regarding OpenCL compute on the shader cores. First, a little background about our work: We are trying to execute highly data-parallel OpenCL kernels on a Exynos 5422 with the Mali T628 GPU. For some of the kernels, the memory bandwidth requirement is very large compared to the others we have in the benchmark set. I tried to alleviate this problem by manipulating the workgroup size for the OpenCL computation on the hunch that the current workgroup size might not be letting the data get stored in the Cache hierarchy. Luckily for the set of applications that we are looking at, this seems to be the problem for the excessive memory transaction. Now, since we cannot go exhaustively trying to find the best workgroup size, the recommended trick is to leave the parameter NULL and let the OpenCL runtime figure it out. Unfortunately, the returned work group size still requires a significant amount of memory bandwidth.

    Therefore I am trying to find a way to find this automatically by analyzing the kernel and the Cache configuration of the Mali GPU. This is where I need help. The information available regarding this is a bit confusing. Could you please let me know the L1$ and L2$ size and configuration for the Mali T628, specifically the one implemented in Exynos 5422, if possible? I am relatively new to OpenCL and Mali, so please excuse me if I missing something very obvious here.

    Another question is regarding the thread execution itself. I would like to confirm my understanding on this. I read in the Mali OpenCL developer guide and verified that the Mali T600 GPUs can handle a max of 256 threads in a workgroup. So, when I am executing 256 threads on the shader core (which is a compute unit for OpenCL), these 256 threads are being executed one after the other? Or is the Mali job manager taking these 256 threads and repackaging them to fit nicely into the 2 arithmetic and 1 LS pipelines? Assuming all the threads have absolutely no dependency amongst them, can we estimate how many threads will the compiler take at a time, to pack them as best as possible into the Arithmetic and LS pipelines?

    Thank you very much for your time!

  • Unfortunately, the returned work group size still requires a significant amount of memory bandwidth.

    Why do you think memory bandwidth has anything to do with workgroup size? The total problem space is the same number of work items, irrespective of how you choose to slice it up into workgroups, so input and output bandwidth should be very similar in all cases.

    The information available regarding this is a bit confusing. Could you please let me know the L1$ and L2$ size and configuration for the Mali T628, specifically the one implemented in Exynos 5422, if possible?

    Two 16KB L1 caches (one texture and one other data) per shader core. L2 cache size varies - typically between 32 and 64KB per core in the system. For Exynos 5422 it has a 256KB L2.

    So, when I am executing 256 threads on the shader core (which is a compute unit for OpenCL), these 256 threads are being executed one after the other?

    It's a massively multi-threaded architecture - there are up to 256 threads live in the shader core, and the shader core hardware schedules eligible threads down each of the pipelines based on availability, so "lots" of different threads can be running concurrently in different pipelines. I can't say anything about how that actually works internally on the public forums, sorry.

    Cheers,

    Pete