EDIT: Updated March 2015 to include more information on the GPU memory system to help developers optimizing compute shaders.
In the first two blogs of this series I introduced the frame-level pipelining [The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining] and tile based rendering architecture [The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering] used by the Mali GPUs, aiming to develop a mental model which developers can use to explain the behavior of the graphics stack when optimizing the performance of their applications.
In this blog I will finish the construction of this abstract machine, forming the final component: a stereotypical Mali "Midgard" GPU programmable core. This blog assumes you have read the first two parts in the series, so I would recommend starting with those if you have not read them already.
The "Midgard" family of Mali GPUs (the Mali-T600, Mali-T700, and Mali-T800 series) use a unified shader core architecture, meaning that only a single type of shader core exists in the design. This single core can execute all types of programmable shader code, including vertex shaders, fragment shaders, and compute kernels.
The exact number of shader cores present in a particular silicon chip varies; our silicon partners can choose how many shader cores they implement based on their performance needs and silicon area constraints. The Mali-T760 GPU can scale from a single core for low-end devices all the way up to 16 cores for the highest performance designs, but between 4 and 8 cores are the most common implementations.
The graphics work for the GPU is queued in a pair of queues, one for vertex/tiling workloads and one for fragment workloads, with all work for one render target being submitted as a single submission into each queue. Workloads from both queues can be processed by the GPU at the same time, so vertex processing and fragment processing for different render targets can be running in parallel (see the first blog for more details on this pipelining methodology). The workload for a single render target is broken into smaller pieces and distributed across all of the shader cores in the GPU, or in the case of tiling workloads (see the second blog in this series for an overview of tiling) a fixed function tiling unit.
The shader cores in the system share a level 2 cache to improve performance, and to reduce memory bandwidth caused by repeated data fetches. Like the number of cores, the size of the L2 is configurable by our silicon partners, but is typically in the range of 32-64KB per shader core in the GPU depending on how much silicon area is available. The number and bus width of the memory ports this cache has to external memory is configurable, again allowing our partners to tune the implementation to meet their performance, power, and area needs. In general we aim to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle.
The Mali shader core is structured as a number of fixed-function hardware blocks wrapped around a programmable "tripipe" execution core. The fixed function units perform the setup for a shader operation - such as rasterizing triangles or performing depth testing - or handling the post-shader activities - such as blending, or writing back a whole tile's worth of data at the end of rendering. The tripipe itself is the programmable part responsible for the execution of shader programs.
There are three classes of execution pipeline in the tripipe design: one handling arithmetic operations, one handling memory load/store and varying access, and one handling texture access. There is one load/store and one texture pipe per shader core, but the number of arithmetic pipelines can vary depending on which GPU you are using; most silicon shipping today will have two arithmetic pipelines, but the Mali-T880 has three.
Unlike a traditional CPU architecture, where you will typically only have a single thread of execution at a time on a single core, the tripipe is a massively multi-threaded processing engine. There may well be hundreds of hardware threads running at the same time in the tripipe, with one thread created for each vertex or fragment which is shaded. This large number of threads exists to hide memory latency; it doesn't matter if some threads are stalled waiting for memory, as long as at least one thread is available to execute then we maintain efficient execution.
The arithmetic pipeline (A-pipe) is a SIMD (single instruction multiple data) vector processing engine, with arithmetic units which operate on 128-bit quad-word registers. The registers can be flexibly accessed as either 2 x FP64, 4 x FP32, 8 x FP16, 2 x int64, 4 x int32, 8 x int16, or 16 x int8. It is therefore possible for a single arithmetic vector task to operate on 8 "mediump" values in a single operation, and for OpenCL kernels operating on 8-bit luminance data to process 16 pixels per SIMD unit per clock cycle.
While I can't disclose the internal architecture of the arithmetic pipeline, our public performance data for each GPU can be used to give some idea of the number of maths units available. For example, the Mali-T760 with 16 cores is rated at 326 FP32 GFLOPS at 600MHz. This gives a total of 34 FP32 FLOPS per clock cycle for this shader core; it has two pipelines, so that's 17 FP32 FLOPS per pipeline per clock cycle. The available performance in terms of operations will increase for FP16/int16/int8 and decrease for FP64/int64 data types.
The texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory, so requires a second clock cycle to complete.
The load/store pipeline (LS-pipe) is responsible for all shader memory accesses which are not related to texturing.
For graphics workloads this means reading per-vertex attribute inputs and writing computed per-vertex outputs during vertex shading, and reading the per-vertex outputs values that were written by the vertex shader during fragment shading so they can be interpolated as a varying value.
In general every instruction is a single memory access operation, although like the arithmetic pipeline they are vector operations and so could load an entire "highp" vec4 varying in a single cycle.
In the OpenGL ES specification "fragment operations" - which include depth and stencil testing - happen at the end of the pipeline, after fragment shading has completed. This makes the specification very simple, but implies that you have to spend lots of time shading something, only to throw it away at the end of the frame if it turns out to be killed by ZS testing. Coloring fragments just to discard them would cost a huge amount of performance and wasted energy, so where possible we will do ZS testing early (i.e. before fragment shading), only falling back to late ZS testing (i.e. after fragment shading) where it is unavoidable (e.g. a dependency on fragment which may call "discard" and as such has indeterminate depth state until it exits the tripipe).
In addition to the traditional early-z schemes, we also have some overdraw removal capability which can stop fragments which have already been rasterized from turning into real rendering work if they do not contribute to the output scene in a useful way. My colleague seanellis has a great blog looking at this technology - Killing Pixels - A New Optimization for Shading on Arm Mali GPUs - so I won't dive into any more detail here.
This section is an after-the-fact addition to this blog, so if you have read this blog before and don't remember this section, don't worry you're not going crazy. We have been getting a lot of questions from developers writing OpenCL kernels and OpenGL ES compute shaders asking for more information about the GPU cache structure, as it can be really beneficial to lay out data structures and buffers to optimize cache locality. The salient facts are:
Based on this simple model it is possible to outline some of the fundamental properties underpinning the GPU performance.
If we scale this to a Mali-T760 MP8 running at 600MHz we can calculate the theoretical peak performance as:
The observant reader will have noted that I've talked a lot about vertices and fragments - the staple of graphics work - but have mentioned very little about how OpenCL and RenderScript compute threads come into being inside the core. Both of these types of work behave almost identically to vertex threads - you can view running a vertex shader over an array of vertices as a 1-dimensional compute problem. So the vertex thread creator also spawns compute threads, although more accurately I would say the compute thread creator also spawns vertices .
A document explaining the Midgard family performance counters, which map onto the block architecture described in this blog, can be found on my blog on the Midgard family.
This blog concludes the first chapter of this series, developing the abstract machine which defines the basic behaviors which an application developer should expect to see for a Mali GPU in the Midgard family. Over the rest of this series I'll start to put this new knowledge to work, investigating some common application development pitfalls, and useful optimization techniques, which can be identified and debugged using the Mali integration into the Arm DS-5 Streamline profiling tools.
My next blog on Mali performance is available below.
[CTAToken URL = "https://community.arm.com/graphics/b/blog/posts/mali-performance-1-checking-the-pipeline" target="_blank" text="Read Mali Performance: Checking the Pipeline" class ="green"]
Comments and questions welcomed as always,
TTFN,
Pete
19.2GB/s subject to the ability of the rest of the memory system outside of the GPU to give us data this quickly. Like most features of an Arm-based chip, the down-stream memory system is highly configurable in order to allow different vendors to tune power, performance, and silicon area according to their needs. For most SoC parts the rest of the system will throttle the available bandwidth before the GPU runs out of an ability to request data. It is unlikely you would want to sustain this kind of bandwidth for prolonged periods, but short burst performance is important.
timhar01 Chris Varnsverry peterharris Thank you all for your time and suggestions. I watched the video on sgemm workgroup size manipulation. I believe this could be the problem with my application. I will verify and report. I wish the automatic workgroup size selection produced some better results.