1 2 3 Previous Next

ARM Mali Graphics

193 posts

In his book "How music works", David Byrne points out that music is created to fit a given context: music that would sound great in a symphony hall would likely sound unconvincing in a stadium.  Similarly, OpenCL programs are often created with a particular context in mind.  In particular, OpenCL kernels are often optimised for a particular compute device (e.g. a GPU in the programmer's desktop machine).


I am continuing my blog series by presenting the fundamentals of the ARM Midgard architecture underpinning the ARM® Mali-T600 and Mali-T700 GPU series.


Architecture Overview

The number of Mali cores in a system-on-chip (SoC) can be scaled to satisfy the performance requirements for this SoC.  For example, the Mali-T624 can be scaled from one to four cores. Each core contains a tri-pipe consisting of two arithmetic (A) pipelines, one load-store (LS) pipeline and one texturing (T) pipeline:


Thus, the peak throughput of each core is two A instruction words, one LS instruction word and one T instruction word per cycle.


Try as I might, I cannot refer the Midgard architecture to a single class:

  • Midgard is a Very Long Instruction Word (VLIW) architecture, such that each pipe contains multiple units and most instruction words contain instructions for multiple units.
  • Midgard is also a Single Instruction Multiple Data (SIMD) architecture, such that most instructions operate on multiple data elements packed in 128-bit vector registers.
  • Finally, Midgard is a Fine-Grain Multi-Threaded (FGMT) architecture, such that each core runs its threads in a round-robin fashion, on every cycle switching to the next ready-to-execute thread. What's interesting, each thread has its individual program counter (unlike warp-based designs, where threads in a warp share the same program counter).


Guidelines for Optimising Compute Kernels

So what do the Midgard architectural features actually mean for optimising compute kernels? I recommend:

  • Having sufficient instruction level parallelism in kernel code to allow for dense packing of instructions into instruction words by the compiler. (This addresses the VLIW-ness of the architecture.)
  • Using vector operations in kernel code to allow for straightforward mapping to vector instructions by the compiler. (I will have much more to say on vectorisation later, as it's one of my favourite topics.)
  • Having a balance between A and LS instruction words. Without cache misses, the ratio of 2:1 of A-words to LS-words would be optimal; with cache misses, a higher ratio is desirable. For example, a kernel consisting of 15 A-words and 7 LS-words is still likely to be bound by the LS-pipe.
  • Using a sufficient number of concurrently executing (or active) threads per core to hide the execution latency of instructions (which is the depth of a corresponding pipeline). The maximum number of active threads I is determined by the number of registers R that the kernel code uses: I = 256, if 0 < R ≤ 4; I = 128, if 4 < R ≤ 8; I = 64, if 8 < R ≤ 16. For example, kernel A that uses 5 registers and kernel B that uses 8 registers can both be executed by running no more than 128 threads per core. This means that it may be preferable to split complex, register-heavy kernels into a number of simpler ones. (For compiler people among us, this also means the backend may decide to spill a value to memory rather than use an extra register to hold it when the number of used registers approaches 4, 8, or 16.)


In some respects, writing high performance code for the Mali GPUs embedded in SoCs is easier than for GPUs found in desktop machines:

  • The global and local OpenCL address spaces get mapped to the same physical memory (the system RAM), backed by caches transparent to the programmer. This often removes the need for explicit data copying and associated barrier synchronisation.
  • Since all threads have individual program counters, branch divergence is less of an issue than for warp-based architectures.


Fasten your Seat Belt!

With this theoretical knowledge under the belt, we can now look at optimising some kernels! Can you guess where we'll start?.. (Or use comments to vote for your favourite kernel!)

Last month I was at Game Developer Conference (GDC) where I had a fabulous time attending various talks and roundtables, visiting exhibitors and I had a particularly good time showing and explaining to people the latest technologies developed within ARM, such as ASTC 3D HDR textures and Transaction Elimination, as well as compute shaders.


With regards to the last one, many of you have been curious about how to get this piece of technology incorporated into your software. With that in mind, I decided to write this blog to help you write a simple program with compute shaders. I hope this blog will help you to create more advanced applications based on this technology.


So, what are compute shaders? Compute shaders introduce heterogeneous GPU Compute from within the OpenGL® ES API; the same API and shading language which are used for graphics rendering. Now that compute shaders have been introduced to the API, developers do not have to learn another API in order to make use of GPU Compute. The compute shader is just another type of shader in addition to the already broadly known vertex and fragment shaders.


Compute shaders give a lot of freedom to developers to implement complex algorithms and make use of GPU parallel programming. Although the contemporary graphics pipeline is very flexible, developers still tend to stumble on some restrictions. The compute shaders feature, however, makes life easier for us to not think about pipeline stages as we are used to thinking about vertex and fragment. We are no longer restricted by the inputs and outputs of certain pipeline stages. The Shader Storage Buffer Object (SSBO) feature for instance has been introduced along with compute shaders and that gives additional possibilities for exchanging data between pipeline stages, as well as being flexible input and output for compute shaders.

Below you can find a simple example of how to implement compute shaders within your application. The example calculates a coloured circle with a given radius; the radius is a uniform parameter passed by the application and is updated every frame in order to animate the radius of the circle. The whole circle is drawn using points, which are stored as vertices within a Vertex Buffer Object (VBO). The VBO is mapped onto SSBO (without any extra copy in memory) and passed to the compute shader.


Let’s start by writing the OpenGL ES Shading Language (ESSL) compute shader code first:

#version 310 es


// The uniform paramters which is passed from application for every frame.

uniform float radius;


// Declare custom data struct, which represents either vertex or colour.

struct Vector3f


      float x;

      float y;

      float z;

      float w;



// Declare the custom data type, which represents one point of a circle.

// And this is vertex position and colour respectively.

// As you may already noticed that will define the interleaved data within

// buffer which is Vertex|Colour|Vertex|Colour|…

struct AttribData


      Vector3f v;

      Vector3f c;



// Declare input/output buffer from/to wich we will read/write data.

// In this particular shader we only write data into the buffer.

// If you do not want your data to be aligned by compiler try to use:

// packed or shared instead of std140 keyword.

// We also bind the buffer to index 0. You need to set the buffer binding

// in the range [0..3] – this is the minimum range approved by Khronos.

// Notice that various platforms might support more indices than that.

layout(std140, binding = 0) buffer destBuffer


      AttribData data[];

} outBuffer;


// Declare what size is the group. In our case is 8x8, which gives

// 64 group size.

layout (local_size_x = 8, local_size_y = 8, local_size_z = 1) in;


// Declare main program function which is executed once

// glDispatchCompute is called from the application.

void main()


      // Read current global position for this thread

      ivec2 storePos = ivec2(gl_GlobalInvocationID.xy);


      // Calculate the global number of threads (size) for this

      uint gWidth = gl_WorkGroupSize.x * gl_NumWorkGroups.x;

      uint gHeight = gl_WorkGroupSize.y * gl_NumWorkGroups.y;

      uint gSize = gWidth * gHeight;


      // Since we have 1D array we need to calculate offset.

      uint offset = storePos.y * gWidth + storePos.x;


      // Calculate an angle for the current thread

      float alpha = 2.0 * 3.14159265359 * (float(offset) / float(gSize));


      // Calculate vertex position based on the already calculate angle

      // and radius, which is given by application

      outBuffer.data[offset].v.x = sin(alpha) * radius;

      outBuffer.data[offset].v.y = cos(alpha) * radius;

      outBuffer.data[offset].v.z = 0.0;

      outBuffer.data[offset].v.w = 1.0;


      // Assign colour for the vertex

      outBuffer.data[offset].c.x = storePos.x / float(gWidth);

      outBuffer.data[offset].c.y = 0.0;

      outBuffer.data[offset].c.z = 1.0;

      outBuffer.data[offset].c.w = 1.0;


Once the compute shader code has been written, it is time to make it work in our application. Within the application you need to create a compute shader, which is just a new type of shader (GL_COMPUTE_SHADER), and the other calls related to the initialisation remain the same as for vertex and fragment shaders. See below for a snippet of code which creates the compute shader and also checks for both compilation and linking errors:

// Create th compute program, to which the compute shader will be assigned

gComputeProgram = glCreateProgram();


// Create and compile the compute shader

GLuint mComputeShader = glCreateShader(GL_COMPUTE_SHADER);

glShaderSource(mComputeShader, 1, computeShaderSrcCode, NULL);



// Check if there were any issues when compiling the shader

int rvalue;

glGetShaderiv(mComputeShader, GL_COMPILE_STATUS, &rvalue);

if (!rvalue)


       glGetShaderInfoLog(mComputeShader, LOG_MAX, &length, log);

       printf("Error: Compiler log:\n%s\n", log);

       return false;



// Attach and link the shader against to the compute program

glAttachShader(gComputeProgram, mComputeShader);



// Check if there were some issues when linking the shader.

glGetProgramiv(gComputeProgram, GL_LINK_STATUS, &rvalue);

if (!rvalue)


       glGetProgramInfoLog(gComputeProgram, LOG_MAX, &length, log);

       printf("Error: Linker log:\n%s\n", log);

       return false;



So far we have created the compute shader on the GPU. Now we need to set up handlers, which will be used for setting up inputs and outputs for the shader. In our case we need to retrieve the radius uniform handle and set the gIndexBufferBinding (the integer variable) to 0, as the binding was hardcoded within binding = 0. Using this index we will be able to bind the VBO to that index and write data from within the compute shader to the VBO:


// Bind the compute program in order to read the radius uniform location.



// Retrieve the radius uniform location

iLocRadius = glGetUniformLocation(gComputeProgram, "radius");

// See the compute shader: “layout(std140, binding = 0) buffer destBuffer”

gIndexBufferBinding = 0;


Okay, so far so good. Now we are ready to kick off the compute shader and write data to the VBO. The snippet of code below shows how to bind the VBO to the SSBO and submit a compute job to the GPU:

// Bind the compute program



// Set the radius uniform

glUniform1f(iLocRadius, (float)frameNum);


// Bind the VBO onto SSBO, which is going to filled in witin the compute

// shader.

// gIndexBufferBinding is equal to 0 (same as the compute shader binding)

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, gVBO);


// Submit job for the compute shader execution.



// As the result the function is called with the following parameters:

// glDispatchCompute(2, 2, 1)


                                   (NUM_VERTS_H % GROUP_SIZE_WIDTH  + NUM_VERTS_H) / GROUP_SIZE_WIDTH,

                                   (NUM_VERTS_V % GROUP_SIZE_HEIGHT + NUM_VERTS_V) / GROUP_SIZE_HEIGHT,



// Unbind the SSBO buffer.

// gIndexBufferBinding is equal to 0 (same as the compute shader binding)

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, 0);


As you may have already noticed, for the glDispatchCompute function we pass the number of groups rather than number of threads to be executed. In our case we execute 2x2x1  groups, which gives 4. However the real number of threads (kernels) executed will be 4 x [8 x 8] which results with the number of 256 threads. The numbers 8x8 come from the compute shader source code, as we hardcoded those numbers within the shader.


So far we have written the compute shader source code, compiled, linked, initialised handlers and dispatched the job for compute. Now it’s time to render the results on screen. However, before we do that we need to remember that all jobs are submitted and executed on the GPU in parallel, so we need to make sure the compute shader will finish the job before the actual draw command starts fetching data from the VBO buffer, which is updated by the compute shader. In this example you won't see much difference in runtime with and without synchronisation but once you implement more complex algorithms with more dependencies, you may notice how important it is to have synchronisation.


// Call this function before we submit a draw call, which uses dependency

// buffer, to the GPU



// Bind VBO

glBindBuffer( GL_ARRAY_BUFFER, gVBO );


// Bind Vertex and Fragment rendering shaders





// Draw points from VBO

glDrawArrays(GL_POINTS, 0, NUM_VERTS);


In order to present the VBO results on screen you can use vertex and fragment programs, which are shown below.


Vertex shader:


attribute vec4 a_v4Position;

attribute vec4 a_v4FillColor;

varying vec4 v_v4FillColor;

void main()


      v_v4FillColor = a_v4FillColor;

      gl_Position = a_v4Position;



Fragment shader:


varying vec4 v_v4FillColor;

void main()


      gl_FragColor = v_v4FillColor;


I think that’s all for this blog and hopefully I will be able to cover more technical details in the future. I believe you will find compute shaders friendly and easy to use in your work. I personally enjoyed implementing the Cloth Simulation demo, one of ARM’s latest technical demos, which was released at GDC. The important thing in my view is that now, once a developer is used to OpenGL ES, it is easy to move on to GPU Compute using just one API. More than that, exchanging data between graphics and compute buffers appears to be done in a clean and transparent way for developers. You shouldn’t limit your imagination to this blog’s application of how you might want to use compute shaders - this blog is only to help developers learn how to use them. I personally can see a real potential in image processing, as you can implement algorithms that will be executed on the chip using internal memory, which must reduce traffic on the bus between memory and chip.

You can also have a look at our latest Cloth Simulation demo, which has been implemented with compute shaders. See the video below:


April 1, 2014 was the date for the Israel Machine Vision Conference (IMVC) in Tel Aviv.  I’m always slightly wary of attending events held on April 1.  I never know for sure that after all the queuing, checking in, travelling, waiting for bags, finding the hotel, the venue, someone doesn’t just say “April Fool!” when you get there.  Well, not to worry… IMVC was very real.  It’s the annual get-together for the prolific Israeli computer vision community and packed in a day of fascinating talks about the latest developments in this exciting subject.  It was great to see so much innovation in evidence and to hear from companies large and small working in this area.

As with many areas of technology there was much talk about mobile.  Computer vision in mobile devices is a hot topic, particularly as the energy efficient yet powerful processors required are quickly coming of age.  Roberto Mijat and I were there to talk about ARM’s central role in this area and in particular the advantages of using the GPU to offload some of the processing required to enable these sorts of features.

     In full flow talking about GPU Compute on Mali


Devices containing the ARM® Mali™-T600 series of GPUs have been providing general purpose compute capabilities for a couple of years now and there are many examples of the benefits of using the GPU for both graphics and non-graphics computation.  I showcased a few of these in my talk, including GPU-accelerated gesture recognition from Israeli company eyeSight® Technologies and face detection and analysis from PUX (Panasonic), both of which have been optimised to run on the Mali-T604 GPU using OpenCL™.   In these and many other cases we see the GPU making sufficient difference to enable computer vision algorithms to run in real time.  Better still the GPU gives us the additional compute bandwidth which allows the use of more sophisticated algorithms that have been shown to enhance the user experience significantly.  eyeSight’s low-light gesture detection is a great example.  And equally as  important is that we can do all this whilst burning much less energy – a crucial requirement for mobile devices.



eyeSight's gesture recognition in action (as shown at CES 2014)

Another area of discussion – both in my talk and elsewhere at the conference – compared the different ways of achieving computer vision on mobile.  As well as using GPUs, ARM’s CPU processor technology already offers heterogeneous features through big.LITTLE™ and NEON™ technology, and there are custom DSPs designed for specific image processing jobs that can sit alongside the ARM CPU.  A DSP is hard to beat when it comes to area and power, but the downside is its lack of flexibility.  As new algorithms come along you need new DSPs – and this is where the programmable GPU really scores.  It allows existing hardware to take on powerful new capabilities.



    Demonstrating PUX face detection and analysis demo at IMVC

We met with many interesting companies and discussed some compelling new computer vision use cases boding well for what we’ll see emerging over the next few months.  The conference ended with some interesting stargazing from both Google and Microsoft.  And then it was all over for another year.  Our hosts and conference organiser was SagivTech Ltd, a company dedicated to computer vision research, development and education.  Our thanks to them for inviting us along, and for organising such a great event.


For more information, please visit:





Over the first few blogs in this series I have introduced the high level rendering model which the Mali "Midgard" GPU family uses. In the remainder of this series I will explain how to use DS-5 Streamline, a system-level profiling tool from ARM, to identify areas where an application is not getting the best performance out of a Mali-based system.

In this blog we will look at debugging issues around macro-scale pipelining, the means by which we keep the GPU busy all of the time, and some of the common reasons for that frame level pipeline to stall. If you are new to this series I would recommend reading at least the first blog, as it introduces the concepts which we will be investigating in more detail this time around.


Note: I'm assuming you already have DS-5 Streamline up and running on your platform. If you are yet to do this, there are some work guides posted on the community for getting set up on a variety of Mali-based consumer devices.



The examples in this blog were captured using DS-5 v5.16.


What does good content look like?


Before we dive into diagnosing performance problems it is useful to understand what we are aiming for, and what this looks like in Streamline. There are two possible "good" behaviors depending on the performance of the system and the complexity of the content.


  • One for content where the GPU is the bottleneck
  • One for content where the vsync is the bottleneck


The counters needed for this experiment are:


  • Mali Job Manager Cycles: GPU cycles
    • This counter increments any clock cycle the GPU is doing something
  • Mali Job Manager Cycles: JS0 cycles
    • This counter increments any clock cycle the GPU is fragment shading
  • Mali Job Manager Cycles: JS1 cycles
    • This counter increments any clock cycle the GPU is vertex shading or tiling


The GPU is the bottleneck


If we successfully create and maintain the frame-level rendering pipeline needed for content where the GPU is the bottleneck (e.g. the rendering is too complex to hit 60 FPS), then we would expect one of the GPU workload types (vertex or fragment processing) to be running at full capacity all of the time.


In nearly all content the fragment processing is the dominant part of the GPU execution; applications usually have one or two orders of magnitude more fragments to shade than vertices. In this scenario we would therefore expect JS0 to be active all of the time, and both the CPU and JS1 to be going idle for at least some of the time every frame.


When using Streamline to capture this set of counters we will see three activity graphs which are automatically produced by the tool, in addition to the raw counter values for GPU. We can see that the "GPU Fragment" processing is fully loaded, and that both the "CPU Activity" and the "GPU Vertex-Tiling-Compute" workloads are going idle for a portion of each frame. Note - you need to zoom in down close to the 1ms or 5ms zoom level to see this - we are talking about quite short time periods here.


The vsync signal is the bottleneck


In systems which are throttled by vsync then we would expect the CPU and the GPU to go idle every frame, as they cannot render the next frame until the vsync signal occurs and a window buffer swap happens. The graph below shows what this would look like in Streamline:




If you are a platform integrator rather than an application developer, testing cases which are running at 60FPS can be a good way to review the effectiveness of your system's DVFS frequency choices. In the example above there is a large amount of time between each burst of activity. This implies that the DVFS frequency selected is too high and that the GPU is running much faster than it needs to, which reduces energy efficiency of the platform as a whole.


Content issue #1: Limited by vsync but not hitting 60 FPS


In a double-buffered system it is possible to have content which is not hitting 60 FPS, but which is still limited by vsync. This content will look much like the graph above, except the time between workloads will be a multiple of one frame period, and the visible framerate will be an exact division of the maximum screen refresh rate (e.g. a 60 FPS panel could run at 30 FPS, 20 FPS, 15 FPS, etc).


In a double-buffered system which is running at 60 FPS the GPU successfully manages to produce frames in time for each vsync buffer swap. In the figure below we see the lifetime of the two framebuffers (FB0 and FB1), with periods where they are on-screen in green, and periods where they are being rendered by the GPU in blue.




In a system where the GPU is not running fast enough to do this, we will miss one or more vsync deadlines, so the current front-buffer will remain on screen for another vsync period. At the point of the orange line in the diagram below the front-buffer is still being displayed on the screen, and the back-buffer is queued for display, the GPU has no more buffers to render on to and goes idle. Our performance snaps down to run at 30 FPS, despite having a GPU which is fast enough to run the content at over 45 FPS.




The Android windowing system typically uses triple buffering, so avoids this problem as the GPU has a spare buffer available to render on to, but this is still seen in some X11-based Mali deployments which are double buffered. If you see this issue it is recommended that you disable vsync while doing performing optimization; it is much easier to determine what needs optimizing without additional factors clouding the issue!


Content issue #2: API Calls Break the Pipeline


The second issue which you may see is a pipeline break. In this scenario at least one of the CPU or GPU processing parts are busy at any point, but not at the same time; some form of serialization point has been introduced.


In the example below the content is fragment dominated, so we would expect the fragment processing to be active all the time, but we see an oscillating activity which is serializing GPU vertex processing and fragment processing.



The most common reason for this is the use of an OpenGL ES API function which enforces the synchronous behavior of the API, forcing the driver to flush all of the pending operations and drain the rendering pipeline in order to honor the API requirements. The most common culprits here are:


  • glFinish(): explicitly request a pipeline drain.
  • glReadPixels(): implicitly request a pipeline drain for the current surface.
  • glMapBufferRange() without GL_MAP_UNSYNCHRONIZED_BIT set: explicit pipeline drain for all pending surfaces using the data resource being mapped.


It is almost impossible to make these API calls fast due to their pipeline draining semantics, so I would suggest avoiding these specific uses wherever possible. It is worth noting that OpenGL ES 3.0 allows glReadPixels to target a Pixel Buffer Object (PBO) which can do the pixel copy asynchronously. This no longer causes a pipeline flush, but may mean you have to wait a while for your data to arrive, and the memory transfer can still be relatively expensive.


Content issue #3: Not GPU limited at all


The final issue I will talk about today is one where the GPU is not the bottleneck at all, but which often shows up as poor graphics performance.


We can only maintain the pipeline of frames if the CPU can produce new frames faster than the GPU consuming them. If the CPU takes 20ms to produce a frame which the GPU takes 5ms to render, then the pipeline will run empty each frame. In the example below the GPU is going idle every frame, but the CPU is running all of the time, which implies that the CPU cannot keep up with the GPU.


"Hang on" I hear you say, "that says the CPU is only 25% loaded". Streamline shows the total capacity of the system as 100%, so if you have 4 CPU cores in your system with one thread maxing out a single processor then this will show up as 25% load. If you click on the arrow in the top right of the "CPU Activity" graph's title box it will expand giving you separate load graphics per CPU core in the system:


As predicted we have one core maxed at 100% load, so this thread is the bottleneck in our system which is limiting the overall performance. There can be many reasons for this, but in terms of the graphics behavior rather than application inefficiency, the main two are:


  • Excessive amounts of glDraw...() calls
  • Excessive amounts of dynamic data upload


Every draw call has a cost for the driver in terms of building control structures and submitting them to the GPU. The number of draw calls per frame should minimized by batching together drawing of objects with similar render state, although there is a balance to be struck between larger batches and efficient culling of things which are not visible. In terms of a target to aim for: most high-end 3D content on mobile today uses around 100 draw calls per render target, with many 2D games coming in around 20-30.


In terms of dynamic data upload be aware that every data buffer uploaded from client memory to the graphics server requires the driver to copy that data from a client buffer into a server buffer. If this is a new resource rather than sub-buffer update then the driver has to allocate the memory for the buffer too. The most common offender here is the use of client-side vertex attributes. Where possible use static Vertex Buffer Objects (VBOs) which are stored persistently in graphics memory, and use that buffer by reference in all subsequent rendering. This allows you to pay the upload cost once, and amortize that cost over many frames of rendering.


It some cases it may not be Mali graphics stack which is limiting the performance at all. We do sometimes get support cases where the application logic itself is taking more than 16.6ms, so the application could not hit 60 FPS even if the OpenGL ES calls were infinitely fast. DS-5 Streamline contains a very capable software profiler which can help you identify precisely where the bottlenecks are in your code, as well as helping you load balance workloads across multiple CPU cores in your system if you want to parallelize your software using multiple threads, but as this is not directly related to the Mali behavior I'm not going to dwell on it this time around.


Next Time ...


Next time I will be reviewing the Mali driver's approach to render target management, and how to structure your application's use of Frame Buffer Objects (FBOs) to play nicely with this model.


Comments and questions welcome,




Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

ARM has recently published a set of OpenGL® ES extensions. Here we explain some of the background that led us to develop these and show how they can be used to implement some common graphics algorithms more efficiently.


Locality of reference


Many algorithms in computer science can be implemented more efficiently by exploiting locality of reference. That is, efficiency can be gained by making the memory access patterns of an algorithm more predictable. This is also true for computer graphics and is an underlying principle behind the tile-based architectures of the ARM Mali GPUs.


But the locality principle applies beyond tiles. Many graphics algorithms have locality at the level of individual pixels: a value written to a pixel in one operation may be read or modified by a later operation working at the exact same pixel location. Blending is a basic example of this principle that is supported on current GPUs. The value of a pixel is written while rendering a primitive and later read and modified while rendering another primitive. But there are also more complex examples, such as deferred shading, where this principle is not yet exploited. These more complex algorithms require storing multiple values per pixel location, which are finally combined in an application-specific way to produce the final pixel value. On today’s graphics APIs, these algorithms are typically implemented by a multi-pass approach. Pixel values are first written to a set of off-screen render targets, for example using the Multiple Render Target (MRT) support in OpenGL® ES 3.0. In a second pass, these render targets are read as textures and used to compute the final pixel value that is written to the output framebuffer.


One obvious issue with the multi-pass approach is that the intermediate values must be written back to memory. This is far from ideal since keeping memory bandwidth – and thereby power - down is very important for mobile GPUs.


A more efficient approach is possible on the ARM Mali GPUs. As mentioned above, ARM Mali GPUs have a tile-based architecture. As described in a previous blog post by Peter Harris (The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering), this means that we perform fragment shading on 16x16 tiles at a time. All memory required to store the framebuffer values for a tile is stored on-chip until all fragment shading for the tile is complete. This property led us to develop a set of extensions that enables applications to better exploit the locality principle, or what we generally refer to as pixel local storage. The first two extensions, ARM_shader_framebuffer_fetch and ARM_shader_framebuffer_fetch_depth_stencil add the ability to return the current color, depth, and stencil values of a pixel to the fragment shader. The third extension, EXT_shader_pixel_local_storage, enables applications to store custom data per pixel.

Shader Framebuffer Fetch


ARM_shader_framebuffer_fetch enables applications to read the current framebuffer color from the fragment shader. This is useful for techniques such as programmable blending. An example of how this would be used is shown below.

#extension GL_ARM_shader_framebuffer_fetch : enable
precision mediump float;
uniform vec4 uBlend0;
uniform vec4 uBlend1;

void main(void)
     vec4 color = gl_LastFragColorARM;
     color = max(color, uBlend0, Color.w * uBlend0.w) ;
     color *= uBlend1;

     gl_FragColor = color;


ARM_shader_framebuffer_fetch_depth_stencil additionally allows applications to read the current depth and stencil values from the framebuffer. This enables use-cases such as programmable depth and stencil testing, modulating shadows, soft particles and creating variance shadow maps in a single render pass. Example code for the last two uses-cases is included in the Bandwidth Efficient Graphics with ARM Mali GPUs white paper.

Shader Pixel Local Storage


EXT_shader_pixel_local_storage enables applications to store and retrieve arbitrary values at a given pixel location. This is a powerful principle that enables algorithms such as deferred shading to be implemented without incurring a large bandwidth cost. The amount of storage per pixel is implementation defined, but the extension guarantees that there is storage for at least 16 bytes per pixel.


You will notice that this is an “EXT” extension rather than a vendor-specific “ARM” extension. In OpenGL ES parlance, an “EXT” means multi-vendor. In this case, we worked with other industry players to define the extension, in order to ensure that it works well on their hardware as well as on ours.


So how does it work? Let’s look at a deferred shading example. A typical implementation of this technique using EXT_shader_pixel_local_storage splits the rendering into three passes: a G-Buffer generation pass where the properties (diffuse color, normal, etc.) of each pixel are stored in pixel local storage, a Shading pass where lighting is calculated based on the stored properties and accumulated in pixel local storage, and a Combination pass that uses the values in pixel local storage to calculate the final value of the pixel. These passes are outlined below. For a complete example and further descriptions of the algorithm, refer to the code sample at the Mali Developer Center.


In the G-Buffer generation pass, instead of writing to regular color outputs, the fragment shader would declare a pixel local storage output block:


__pixel_local_outEXT FragData
     layout(rgba8) highp vec4 Color;
     layout(rg16f) highp vec2 NormalXY;
     layout(rg16f) highp vec2 NormalZ_LightingB;
     layout(rg16f) highp vec2 LightingRG;
} gbuf;

void main()
     gbuf.Color = calcDiffuseColor();
     vec3 normal = calcNormal();
     gbuf.NormalXY = normal.xy;
     gbuf.NormalZ_LightingB.x = normal.z;


The shader would use this block to store the G-Buffer values in the pixel local storage. The image below illustrates what the contents of the pixel local storage might look like at the end of this pass. Keep in mind that that only one tile’s worth of data would be stored at any given time.



In the Shading pass, the same pixel local storage block would be used to accumulate lighting. In this case, the pixel local storage block would be both read from and written to:


__pixel_localEXT FragData
     layout(rgba8) highp vec4 Color;
     layout(rg16f) highp vec2 NormalXY;
     layout(rg16f) highp vec2 NormalZ_LightingB;
     layout(rg16f) highp vec2 LightingRG;
} gbuf;

void main()
     vec3 lighting = calclighting(gbuf.NormalXY.x,
     gbuf.LightingRG += lighting.xy;
     gbuf.NormalZ_LightingB.y += lighting.z;


At this point, the contents of the pixel local storage would also include the accumulated lighting (see image below):




Finally, the Combination pass would read from the pixel local storage and calculate the final pixel value:


__pixel_local_inEXT FragData
     layout(rgba8) highp vec4 Color;
     layout(rg16f) highp vec2 NormalXY;
     layout(rg16f) highp vec2 NormalZ_LightingB;
     layout(rg16f) highp vec2 LightingRG;
} gbuf;

out highp vec4 fragColor;

void main()
     fragColor = resolve(gbuf.Color,

We now have our final image (see below) and the pixel local storage is no longer valid.



The important point here is that the pixel local storage data is never written back to memory! The memory for the pixel local storage is kept on-chip throughout and incurs no bandwidth cost. This is significantly more efficient than existing solutions that would require writing 16 bytes of data per pixel for the G-Buffer pass and subsequently read the same amount of data back again in the Shading and Combination passes.


It is also worth pointing out that the above example does not store the depth value in pixel local storage. This is not necessary since ARM_shader_framebuffer_fetch_depth_stencil works well in combination with pixel local storage, effectively increasing the amount of application specific data that can be stored per pixel.




We are very excited about the possibilities opened up by these extensions. These pave the way for algorithms such as deferred shading to be implemented efficiently on mobile GPUs.


And it’s not just about efficiency: these extensions allow you to express the algorithm more directly compared to using an approach based around MRTs. Support for framebuffer fetch from MRTs could avoid some of the bandwidth issues for deferred shading, but would require a more complex implementation. In addition to creating and managing the textures and render buffers for the off-screen render passes, the application would have to provide the appropriate hints, like glInvalidateFramebuffer, to prevent the off-screen render targets from being written to memory. It would also have to rely on clever driver heuristics to avoid the memory being allocated in the first place. Using the extensions presented here, these complexities go away. Everything happens in the fragment shader, allowing you to focus on the core of your algorithm rather than complex state management.


ARM will support these extensions on all GPUs based on the Midgard Architecture. Support for ARM_shader_framebuffer_fetch and ARM_shader_framebuffer_fetch_depth_stencil is also expected to be added to the Mali-400 series of GPUs later this year.

What ideas do you have for using these extensions? We'd be interested in hearing, let us know in the comments below.

Anton Lokhmotov

Introducing PyOpenCL

Posted by Anton Lokhmotov Mar 27, 2014

I will start my blog series with a subseries on technology that I know and come to love best - OpenCL from the Khronos Group. OpenCL aims to provide functional portability across computer systems comprised of general-purpose processors (e.g. CPUs) and special-purpose accelerators (e.g. GPUs), to which I will now and then refer to as accelerated systems.


OpenCL is often criticised for presenting a steep learning curve for software engineers but I don't think that's fair: the verbosity of the OpenCL API is really a consequence of its generality. Once you have written a couple of OpenCL programs, you realise that most of the code running on the host processor is actually boilerplate.


I will sidestep this verbosity of OpenCL altogether by using PyOpenCL - a neat Python module written by Andreas Klöckner. (If you are reading this Andreas, keep up the good work!)


Just install PyOpenCL and NumPy - and you are ready to roll!

In [1]:

import pyopencl as cl

import numpy as np



Trivial example

Suppose we want to create an array containing integers from 0 to 15.

In [2]:
N = 16


That's trivial using NumPy:

In [3]:

np_range = np.arange(N, dtype=np.int32)



array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15], dtype=int32)


but our PyOpenCL example will fill in a similar array using OpenCL:

In [4]:

cl_range = np.zeros(N, dtype=np.int32)



array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)



Gimme some context!

Creating a context could hardly be easier:

In [5]:
In [6]:



Building and running

An OpenCL C program equivalent to np.arange(N) follows:

In [7]:

source = '''

kernel void arange(global int * buffer)


    const size_t gid = get_global_id(0); 

    buffer[gid] = convert_int(gid);




The kernel will be launched as N work-items over a one-dimensional range [0, N-1]. Each work-item will get its unique index gid in the range (that is, an integer between 0 and N-1 inclusive) and write it into argument buffer at offset gid.


Let's build the program:

In [8]:
program = cl.Program(context, source).build()


allocate a memory buffer:

In [9]:
memory_flags = cl.mem_flags.WRITE_ONLY | cl.mem_flags.ALLOC_HOST_PTR
memory = cl.Buffer(context, flags=memory_flags, size=cl_range.nbytes)


launch the kernel:

In [10]:
kernel = program.arange(queue, [N], None, memory)


and copy the results from the buffer to cl_range:

In [11]:

cl.enqueue_copy(queue, cl_range, memory, wait_for=[kernel])



array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15], dtype=int32)


Finally, let's confirm that arrays np_range and cl_range match element-wise:

In [12]:
np.all(np_range == cl_range)




Concluding remarks

I am very interested in getting feedback. I am aiming primarily at software engineers already familiar with OpenCL so I can jump straight to more advanced topics. But if you feel you like a little more explanation, just ask me: I will be happy to explain, or point you to some excellent learning resources out there, or indeed consider filling in details in my later posts...


Also, I prepared this post using the wonderful IPython Notebook environment. I haven't seen much in the way of using PyOpenCL and IPython together, so would be very grateful for any links.


In my next post, I will start discussing how to optimise OpenCL kernels for the ARM® Mali-T600 GPU series.

In this first blog of a series about GPU Compute we look at the one of the biggest challenges facing the future development and evolution of smart devices.

Isn’t technology wonderful?  It’s incredible to think that the processing power in mobile devices has increased 12 times in the last 4 years.  Screen resolutions have increased by over 13 times in the same period. And as our smart devices are capable of more and more, we’re doing more and more with them.  Study after study shows a continued shift away from desktops and laptops as internet, gaming and entertainment go increasingly mobile. But with all this innovation there’s a problem.  An engineering elephant in the room.  In the same 4 years, whilst everything else has increased by an order of magnitude, battery technology has only increased by a factor of 2.  In engineering terms this presents a massive challenge.  All that processing capacity at our finger tips cruelly grabbed away at the last minute.


tech improvements.png
Processing power information source: McKinsey&Company, “Making smartphones brilliant: ten trends” http://goo.gl/rkSP4


So if we could invent better batteries, we’d be OK, right?  Well, although better batteries would be very welcome, sadly it’s not that simple.  The bigger problem than battery power alone is the one of thermal dissipation.  Not the most glamorous subject maybe – I don’t think anyone wrote a Thermal Dissipation folk song for example – but it’s a critical issue facing mobile engineers today.   Put simply, even if we had the power to run our processors harder they would melt because there’s no way to get rid of all the heat they would generate.  This elephant is not only getting in the way, he’s about to do something unpleasant in the corner.


So to tackle this issue we have to think long and hard about processing efficiency. One way to do this is to add more CPU cores.  Indeed a mixture of faster and more energy efficient cores (big.LITTLE Processing - ARM) allows devices to ramp up and down depending on the demand.  But just adding CPU cores doesn’t scale efficiently – after a while we see diminishing returns.


The key to all this – and a very effective way to tackle processing efficiency – is to think heterogeneously.  The idea of heterogeneous computing is to spread the computing load not only across multiple processors, but across different types of processor.  That involves distributing individual parts of your programme to processors that are best suited to run them.  So, for example, general purpose programme flow would sit with the CPU, whilst a complex image processing algorithm might run on a specialist processor designed to cope efficiently with highly parallel workloads.


One such processor is of course the GPU.  Designed to process millions of vertices and pixels to create user interfaces, games and applications for modern smart devices, the GPU is a master at doing things efficiently in parallel.  Early generations of mobile GPUs were limited to graphics only, but back in November 2012 Google’s Nexus 10 – based on the ARM® Mali™-T604 GPU – became the first mobile device capable of running GPU-accelerated general purpose compute.



Google’s Nexus 10 with Mali-T604 GPU


Since then the true benefit of designing applications to run heterogeneously has been demonstrated time after time.  Not only can mobile GPUs speed up certain activities – like image processing, computer vision, video decoding etc. – they can usually do it significantly more efficiently.  And using less power to achieve the same thing is all part of tackling that elephant.


But creating applications that make good use of compute on GPUs can be daunting for software engineers used to traditional programming techniques.  It not only requires a new way of thinking, but new tools and APIs as well.  And understanding the capabilities of the processors at your disposal is a key step to getting the best out of a platform.  In this series of blogs we’ll be going into plenty of detail on this brave new elephant-banishing world.  We’ll be covering the Mali-T600 and T700 GPU architectures in detail, explaining how they do what they do so you can get the best out of them.  We’ll be looking at optimization techniques, software tools and languages that can help you along the way.  This will include blogs on Google’s RenderScript, OpenCL™, ARM NEON™ technology, and much more.


So stay tuned for more on the world of compute on Mali GPUs and let us know in the comments any particular areas of interest you would like to us to cover.


If you have a Samsung Chromebook you can try OpenCL on Mali for yourself.  Check out this guide on the Malideveloper website: http://malideveloper.arm.com/develop-for-mali/features/graphics-and-compute-development-on-samsung-chromebook/

And if you are interested in RenderScript on the Nexus10, here’s a good place to start: http://developer.android.com/guide/topics/renderscript/compute.html

If you're looking for a short summary of the ARM-related news from GDC, look no further than here! The hugely exciting Expo Floor has opened, the attendees are flooding in, the major announcements have all been made and we have a round up of it here just for you:


OpenGL® ES 3.1


OpenGL ES is the API that is used in nearly all the world's mobile devices and on Monday it was announced that the latest version, through its inclusion of features such as compute shaders and indirect drawing, will enable developers to use cutting edge features traditionally reserved for the PC space on mobile devices.Tom Olson, our Director of Graphics Research and more importantly in this matter, Chair of the OpenGL ES Working Group, has already written a great blog celebrating this subject, Here comes OpenGL® ES 3.1! and presented on the subject to a large audience at GDC yesterday in his talk "Getting the most out of OpenGL ES". For those wanting to see a visual example of what OpenGL ES 3.1 can do, ARM released this demo earlier in the week and has been showing it on the booth:



For more information on the OpenGL ES 3.1 API, check out the Khronos Website.


Geomerics Global Illumination technology is going....well, even more global


Whilst the ARM team was settling into its second day at the Smartphone Summit, our yearly warm up for the main booth, Geomerics was at the heart of one of the biggest announcements of the week for worldwide game developers. Unity 5, the award winning, multiplatform engine, announced on Tuesday that amid the number of new features that will make the creation of stunning games that much easier, its key update was the integration of Enlighten technology, the real-time lighting system developed by Geomerics. This is fantastic news for the gaming industry as it will enable the vast numbers of developers who use Unity to do things like animate lights, generate emissive material properties and control the environment lighting in real-time. Enlighten is the only real-time global illumination technology optimized to deliver fully dynamic lighting in game on today's PCs, consoles, and mobile platforms.


Geomerics' Transporter demo, being shown on the ARM Booth at GDC, showcases the latest features of Enlighten technology and the beautiful lighting effects that can be achieved on mobile.



Cocos2D-x introduces 3D


The ARM Booth is hosting a great selection of partners this year and one of them, Cocos2D-x, announced last week their new Cocos3D-x Engine. Cocos3d-x is a multi-platform game framework for building 3D games and applications using C++, based on the popular cocos2d-x engine which has played a part in over 1000 games. Its 3D application development framework is for iOS, Android, and MacOSX platforms.



Epic Games releases subscription model


We managed to catch up with Epic Games on their booth and they discussed the new subscription for Unreal Engine 4 - all of its leading-edge tools, features and complete C++ source code are becoming more readily available to the entire development community, no matter what their size, budget or aspiration.



Meanwhile, on the ARM Booth


Yesterday saw the Expo Floor open up to all attendees and the ARM Booth buzzed with activity for the day. The ARM Lecture Theater, which continues over Thursday and Friday, offered valuable development tips and ideas to attendees with many of our partners joining us to give a talk on their mobile gaming solutions. The Gaming Wall saw avid gamers compete to get the fastest lap in the Sports Car Challenge whilst beautiful designs appeared on the Artist Competition wall. All in all it was a great day on the ARM Booth and I will have more news for you tomorrow!


photo 4.JPGGoo Technologies' Victor Sand wraps up a presentation on the ARM Lecture Theatre


Three attendees discover who is the fastest in the Sports Car Challenge car


Sylwester Bala is on hand to explain the features of Geomerics' Transporter demo


The demos on the ARM wall exemplify how ARM is expanding the mobile gaming experience

GDC is always a blast – it brings technologists, gaming enthusiasts, game developers and artists under one roof. At ARM, we are thriving to ensure we have something for everyone attending GDC. My colleagues Gemma Paris and Phill Smith recently published their blogs -  ARM and the ARM® Mali™ Ecosystem are ready to rock at GDC! and Expanding the Mobile Experience at GDC 2014 highlighting ARM’s activity line-up at GDC.


To add to our massive line-up of activities at GDC, ARM in collaboration with Samsung Developers and Autodesk are running a unique competition for artists – “The Artists Competition @ GDC”.


Thanks to the innovation by ARM and the partner ecosystem, mobile devices today are extremely powerful, support appropriate tools and interface needed for artists to create designs while maintaining power efficiency that ARM has been always known for. Autodesk’s SketchbookPro application for example, helps artists create designs on-the-go on their Samsung Galaxy Note devices.


Renee Busse (@reneedicherri) from the Autodesk SketchbookPro team created some of these awesome designs on her Note.


Autodesk sketchbookpro1.png


Autodesk sketchbookpro2.png


Artists, come play and participate in the ongoing artists competition at the ARM Booth #1616. Artists have an opportunity to win a Galaxy Note 8 for a best design chosen at the end of every day through March 19-21 based on designs created on Samsung Galaxy NotePRO 12.2. based on ARM Mali-T628 with Autodesk SketchPro application. The Galaxy Note 8 are sponsored by Samsung Developers.


Look forward to some more cool designs at GDC…

Tom Olson

Here comes OpenGL® ES 3.1!

Posted by Tom Olson Mar 17, 2014

As I write this, I’m packing my bag for a week in San Francisco, where I’ll be attending the Game Developer’s Conference (GDC). GDC is always a blast, especially now that there’s a big ARM® Mali™ presence, but never more so (for me at least) than this year. The reason is simple; in my role as Khronos OpenGL ES Working Group chair, I’ll have the honor of announcing OpenGL ES 3.1, the latest version of the world’s most widely deployed 3D graphics API. It’s the culmination of a year of very hard work – mostly by other people, but I’m proud to say I was in the room when the work got done. And, of course,  I’m deeply grateful to those who did it. In this blog, I’ll give a personal perspective on what OpenGL ES 3.1 is and why it's important.

What’s in it?


OpenGL ES 3.1 isn’t as obviously a big deal as its predecessor, OpenGL ES 3.0, which added over two dozen major features, and extended both the API and the shading language in almost every possible direction. After all, ES 3.0 took five years to create, and was intended to drive hardware requirements for a new generation of mobile GPUs. ES 3.1, on the other hand, was done in about a year, and is explicitly designed to run on most if not all existing ES 3.0-capable hardware. It’s no wonder that by comparison, it looks like a relatively modest advance. But is it? Here’s my view:


Many of the features in the new API amount to filling in gaps in ES 3.0 (bitfield operations in the shading language! Multidimensional arrays!), and continuing our efforts (which began in ES 3.0) to tighten the specification, improve application portability across implementations, and reduce application and driver overhead.  Don’t get me wrong, these features are very important – they make life much better for programmers, leading ultimately to more, better, and cooler applications for everyone. And I can tell you, specifying and testing them is hard (and essential) work.  But they’re kind of hard to appreciate unless you’re a standards geek, or a graphics programmer.

However, I claim that OpenGL ES 3.1’s headline features are going to change the way we do mobile graphics, in ways that will be obvious to everyone. For my money, there are two that stand out; first, it adds compute shaders, which allow the GPU to be used for general-purpose computing, tightly coupled with GPU-based graphics rendering.  Second, it adds indirect drawing commands, which allow the GPU to read drawing command parameters from memory instead of receiving them directly from the CPU. I’ll explain why that’s important in a moment.


GPU Compute in OpenGL ES 3.1


Compute support in OpenGL ES 3.1 consists of a handful of new features that sound minor when considered individually, but have huge implications when combined.  (This happens all the time in the tech industry. Hypertext is a way of linking related documents and data (remember HyperCard™?), and the internet is a (large) group of networked computers that agree to exchange data using a standard set of protocols.  Put them together, and you get the World-Wide Web, which is a different animal altogether.) 


The first critical compute feature OpenGL ES 3.1 adds is direct access to memory: shader programs can read and write arbitrary data stored in memory buffers or texture images.  The second critical feature is a set of synchronization primitives that allow applications to control the ordering of memory accesses by different threads running in parallel on the GPU, so that results don’t depend on what order the threads run in. The third is the ability to create and dispatch compute shaders, programs for the GPU whose invocations correspond to iterations of a nested loop rather than to graphics constructs like pixels or vertices.


With these features, you can do things like this: Create a 2D array in GPU memory representing points on a piece of cloth, and global data representing objects or forces acting on the cloth. Dispatch a compute shader that creates a thread for every point in the array.  Each thread reads the position and velocity of its point on the cloth, and updates them based on the forces acting on the cloth.



Figure 1: A rather nice carpet, animated by an ES 3.1-style compute shader, has a frightening encounter with a big shiny flying donut. Photo (and demo) courtesy of Sylwester Bala, Mali Demo Team. You can watch the video here.

Indirect Drawing


Indirect drawing sounds even more innocent than the various features that support GPU computing; it just means that the GPU can accept a drawing command whose parameters (such as how many items to draw, and where to find their vertices) are stored in memory, rather than passed as function-call arguments by the CPU.  What makes this interesting is that the memory buffer containing the parameters is fully accessible to the GPU – which means that a compute shader can write them. So for example, an application can fire off a compute shader that generates geometry data into a vertex buffer object, and also fills in an indirect drawing command that describes that data. After the compute shader finishes, the GPU can proceed to render the geometry as described in the buffer, without any additional work by the application or the CPU.


Winding Up


There’s other interesting stuff in OpenGL ES 3.1, but I’m out of space to talk about it. By the time you read this, the official specification will be available in the Khronos OpenGL ES registry, and there’ll be lots of information floating around following GDC presentations by myself and my fellow Working Group members. Incidentally, if you’re attending GDC, I hope you’ll stop by the ARM booth or one of our technical talks, and/or come to the Khronos OpenGL ES session, where we’ll walk through the OpenGL ES 3.1 specification in detail.


When will you see ES 3.1 in consumer devices? It’s up to the device makers, of course; but the Khronos conformance test should be up and running by this summer, and the API is meant to run on existing OpenGL ES 3.0 hardware, so it shouldn’t be terribly long. It will certainly be supported* on the ARM Mali Midgard GPUs.

As always – got comments or questions? Drop me a line…


Tom Olson is Director of Graphics Research at ARM. After a couple of years as a musician (which he doesn't talk about), and a couple more designing digital logic for satellites, he earned a PhD and became a computer vision researcher. Around 2001 he saw the coming tidal wave of demand for graphics on mobile devices, and switched his research area to graphics.  He spends his working days thinking about what ARM GPUs will be used for in 2016 and beyond. In his spare time, he chairs the Khronos OpenGL ES Working Group.

*Based on the published Khronos specification. Product is expected to pass the Khronos Conformance Testing Process when available. Current conformance status can be found at http://www.khronos.org/conformance.

I’m asked quite a lot about how I feel about benchmarks. When I sit down to write these blogs I usually go searching for suitable quotes. For this one I found a quote that perfectly sums up my feeling. 

This is from business leadership consultant Tom Peters:

"... I hate Benchmarking! Benchmarking is Stupid!”

Yep, I’m with Tom on this one, but we may need to qualify that a bit more… back to Tom:

“Why is it stupid?"

"Because we pick the current industry leader and then we launch a five year program, the goal of which is to be as good as whoever was best five years ago, five years from now."

While this statement was originally aimed at business leadership and strategy it is equally true of any type of performance benchmarking.

I’ve spent the last three years directly involved in and most of my 20 year career indirectly involved in the mire that is synthetic benchmarking of GPU's. Everything I've seen leads me to come to the conclusion that GPU benchmarks are a reinforcement of the above statement. They do nothing but focus the attention on macroscopic subsections of performance while purporting to tell you about the holistic performance of a GPU.

It seems a logical statement to say, that in order to provide valuable input to an end consumer’s purchasing decision it is better for GPU benchmarks to reflect real-world use-cases. Understanding how readily a GPU delivers the graphics of a user’s favorite game and the length of time that they can be played at a suitable FPS would be useful information for both consumers and OEMs alike. However, is this really the data that popular benchmarks deliver at the moment?

Desktop GPU benchmarking went through a similar evolution to the one that mobile GPUs are currently undergoing. In its earliest days it consisted of extremely theoretical and somewhat woolly comparisons of architectural triangles/second and pixels/second rates. This later developed into actual applications that purportedly measured tri/s and pix/s before arbitrary spinning objects (spinning tori/donuts, Utah Teapots and Venus de Milo’s) entered the scene, which then led to the stage that the mobile GPU benchmarking scene is at currently: the stage where benchmarks consist of synthetic game scenes designed specifically to test a GPU’s maximum compute capacity. The next development, and where the PC market currently stands, is the comparison of metrics garnered by running actual content - real games - and assessing each GPU’s merits based on that. Well there’s a novel concept! Actually using the content that people are running and care about? Shocker!


What’s wrong with current benchmarks?

Before we go any further, I feel an explanation as to why current benchmarks are not the best representation of GPU performance is needed. Current popular benchmarks claim to stress-test GPUs to discover the maximum number of frames they can deliver in a certain time period. In many ways this seems reasonable – all benchmarking really requires in order to be effective is a single figure derived from a test that is the same for all contenders and maximum compute performance of the GPU fits into this category.

However, there are a number of issues with the way GPU benchmarks do this at the moment. Informing consumers that the device is capable of delivering 300+ frames of a particular content in a fixed time period may be a useful metric in certain circumstances, but it is not when there is no content that the consumer would normally use on his/her device which exercises the GPU in the way the GPU benchmarks currently do.

To the consumer, the figure delivered by benchmarks is completely arbitrary and does not correspond to any experience that he might have of the device. It is easily possible to deliver exactly the same visual experience which the benchmarks use at much higher frame rates or, more appropriate to embedded devices, at a fraction of the energy cost and computing resources if the benchmarks were coded in a more balanced way.

Surely, when the quality of graphics is the same between a benchmark and a popular game, it is better for a consumer to know how well the GPU delivers content that uses regular techniques and balanced workloads rather than an irregularly coded benchmark?

Later we'll look at my "Tao of GPU benchmarks" and discuss what guidelines a benchmark should follow, but first lets take a look under the hood of popular content and the benchmarks that are supposed to mirror them.

But benchmarks look exactly like popular games, so what’s the difference?

As an internal project, ARM has been running in excess of 1M frames of real content from top OpenGL® ES – enabled games on the App Store, including titles such as Angry Birds, Asphalt 7 and Temple Run. We analyse multiple performance areas including CPU load, frames per second, uArch data and a tonne of GPU agnostic API usage and render flow composition data.

When you look at some examples of the data we gather in this sort of analysis, the results are quite striking. Looking at say the imagery in Asphalt 7 and T-Rex HD on the same ARM® Mali™-based 1080p device, you'd see that they appear to show similar levels of graphical user experience. This would leave a user to believe that they are constructed from a broadly similar level of workload. When we look at the results which compare a selection of popular benchmarks and a selection of popular games, we see the following:


Benchmark A

Benchmark B

Benchmark C

Asphalt 7

NFS Most Wanted


Avg. Vert./Frame







Avg Tris./Frame







Avg. Frags./Frame 







Avg. Vert FLOPS/Frame  







Avg. Frag FLOPS/Frame







The first and most striking observation is that whilst the fragment count for benchmarks is similar to that of popular games, the vertex count goes through the roof!  And in fact, when we look more closely at Benchmark C, the use of vertices is in no way efficient.

“Do not use a hatchet to remove a fly from your friend's forehead” - Chinese proverb

The global average for primitive to fragment ratio in this benchmark at 1080p is 1:13.1 which is close to (but just the right side of) our low watermark of 1:10 which we defined in the “Better Living Through (Appropriate) Geometry” blog, compared to a ratio of 1:53 in Asphalt 7. However, examining the content draw call by draw call, 50% of Benchmark C draw calls have a ratio of less than 1:1 primitive to fragment and an additional 24% have a ratio of less than 1:10 - against a recommended guideline of more than 1:10! The same is true for Benchmark B where 66% of the draw calls are producing micropolygons.

Real games are more balanced and consistent with less micro triangles and the majority of draw calls handling more than ten fragments per triangle.


Benchmark providers admit that they use high vertex counts in order to stress GPUs with the justification being that it provides the users with “realistic” feedback on how their GPU will respond to future content. However, as demonstrated, such stress testing is not realistic as it doesn’t accurately reflect the balance of fragment and geometry used in applications that are being used by consumers on a daily basis. While the fragment rate and vertex rate of the real games shows variation, the ratios stay pretty consistent.


Benchmarks vs Real Apps: Bandwidth

One of the major effects of the geometry imbalance shown above is it does not take into account by far the most limiting factor in terms of mobile device performance: the bandwidth. It’s extremely easy to break the bandwidth limit in an instant with these high cost/low visual yield micro polygons (as discussed in “PHENOMENAL COSMIC POWERS! Itty-bitty living space!).

Let’s take a look at the benchmarks and see what the relative bandwidth looks like when compared to the real applications:

Test Name

Frame Buffer



Benchmark A




Benchmark B




Benchmark C




Asphalt 7




NFS Most Wanted





As you can see, again, the real world applications are more consistent in the balance of bandwidth used across the rendering. “Benchmark A” starts off pretty well, but unfortunately it goes off the rails pretty quickly. What we see here is 3-8x more bandwidth being used for the geometry (which, as discussed in “Better living through (appropriate) geometry”, is supposed to be a container for the samples) meaning there is less bandwidth available for fragment generation - which is what the user will actually see.


The future of mobile benchmarking

So, what’s the conclusion? Well, GPU benchmarks generally still have a long way to go, mobile one more so. I am looking forward to the time when, like for desktop and console games, mobile game developers release their own benchmarks using sections from real application workloads, allowing for a far more well-rounded view of the GPU.

Until then, I have a couple of suggestions that will not only make GPU benchmarking a lot more informative for consumers but it will also leave semiconductor companies with more time to worry about how to improve GPU performance for consumer content rather than how to impress customers in the next important benchmark rankings.

I have produced the following “Tao of GPU benchmarks” as a guide which I hope people will follow:


  1. Apply Moore’s Law.
    • Moore’s Law (compute potential doubles every 18 months) applies to GPUs as much as it does CPUs.
    • Year on year the average workload represented in a benchmark should not exceed double the previous year’s and it should remain balanced. This way you don’t attempt to outstrip Moore’s law.
  2. Make it a GPU test not a bandwidth test.
    • The raw bandwidth per frame at 60fps should not exceed the available bandwidth.
    • The baseline for bandwidth should be set at a typical mobile device for the next 24 months
    • Make the objective of the test as independent as possible from whether the device has high bandwidth capacity or not.
  3. Tests should use recognized techniques.
    • Techniques should be aligned with current best practice
    • These techniques should also be relevant to the mobile market
  4. Excessive geometry is not an acceptable proxy for workload.
    • Primitive to fragment ratio per draw call should be balanced.
    • Lots of benchmarks at present have far too much geometry.
    • The 10 frags/prim rule should be the lowest water mark for this.
  5. Overdraw is not an acceptable proxy for workload.
    • Keep it real! An overdraw average in excess of 2x on any surface is not representative.

Continuing on from It’s Just Criminal! Examples of performance thrown away in real Apps (Part 1)” lets look at more criminal behavior and for me what has to be the crime of the year...

“Possession is nine tenths of the law.” Or, “The Untouchables.”

Client side buffers really shouldn’t be your first choice on a mobile platform. Any buffer administered by the client has an unknown history. The driver has no idea what has been done between uses unless it scans it. This can be an extremely costly affair so is mostly not an approach that driver implementers take, preferring to recommend Vertex Buffer Objects (VBOs) instead. Because the status of a client side buffer is not deterministic in a deferred rendering GPU (which is, as previously discussed, about 90% of the mobile market) the driver will have to take a copy of the client side buffer being referenced. This has a bandwidth overhead and also a memory footprint overhead.


VBOs on the other hand have a prescribed interface in the API so it is possible (to a certain extent) to track the provenance of updates to the buffer meaning the driver will only take a copy when it needs to and can often “dirty patch” the updates so it only requires the difference between the pre and post modified buffer. This can save a lot of potential bandwidth.

One of the biggest offences we’ve seen in this category is using VBOs, but uploading the contents of a client side buffer into the VBO for each draw call. This kind of defeats the object. Similarly, overusing dynamic VBO or index buffer object updates using glBufferSubData() etc. causes the same issue. We’ve seen a couple of applications recently which are tweaking several small (in the order of 10-15 vertices) within a larger VBO, which are not localized, on each new draw call. This is not as bad as client side buffers, but if the object is that dynamic it really should be in its own draw call and VBO. 

See this blog for more details of vertex buffer objects in action.

You also need to pay similar attention to the use of glTexSubImage() updates. Remember: in a deferred renderer no draw activity happens until eglSwapBuffers() or similar is called. If you update a texture several times within a frame that means that all possible forms of that texture must exist at the time of rendering. Overuse of partial texture updates can have a detrimental effect on bandwidth and working footprint.

“Up on multiple counts.”

Multiple Render Targets (MRTs), or the ability to target more than one output buffer with a single draw call, is a common way of reducing the need to send expensive geometry multiple times to build up secondary buffers and is often used in app-side deferred render flows (not to be confused with the deferred hardware model). Since this technique is new for OpenGL® ES 3.0 I’ll apply some leniency, but of the applications so far we’ve still seen some suspicious behavior!


MRTs can be implemented very efficiently on deferred rendering GPUs, if you can keep everything in the tile. Guidance from most of the GPU vendors with deferred architectures (i.e. tile based) is to make sure that the sum of bits/fragment fits within the maximum  width on tile storage – bear in mind that each GPU provider will have different criteria, but consensus seems to be 128 bit is a good number to work to. Also keep an eye on the alignment of field for each target (it’s unlikely hardware will allow you to do arbitrary bit field assignments).

As I said there are limited numbers of OpenGL ES 3.0 applications available today, but we have seen at least a couple which use four MRTs (the primary colour buffer and three secondary buffers). In OpenGL® & OpenGL ES all the targets in a MRT need to be the same size and format as the primary. For this use case we had 4xRGBA buffers, which is fine, but when we examined the buffers only 1-2 channels from each target were being used. “So what?” you may say, “It’s all in the tile so I use a little more, big deal”, but at some point you will want to write those buffers back to memory and read them back when you perform your resolve/consolidation pass. It’s going to be a lot cheaper to pack them into two MRTs at full resolution than have to write and read back four.


If you want the optimal implementation of the deferred rendering model and you don’t mind using an extension you might want to take a look at this paper from Sam Martin of Geomerics . By using the extension described, for most cases you can eliminate the need to write back the tile and then read it back as a texture source for the resolve/consolidation pass, saving even more bandwidth. 

Obstructing the GPU in the course of its duties”

Deferred GPUs pipeline the operations required to create a frame. This means that frames move through stages which build a command stream, perform vertex shading and finally perform fragment shading and output. Which means that there are actually three frames in flight and the one you are working on app side is actually Frame N+2. Within this pipeline commands such as glReadPixels(), glCopyTexImage() and Occlusion Queries can block the pipeline and degrade performance if not used carefully… and unfortunately pretty much every app I’ve seen using these mechanisms has committed this crime.

Firstly, if using the command glReadPixels() make sure you use it with pixel buffer objects (PBOs). This schedules the actually pixel read back from the buffer asynchronously (often hardware accelerated) and the glReadPixels command returns to the calling thread immediately without stalling the application. To read the content of the buffer you need to bind and map the PBO (see glMapBuffer()). At the point at which you attempt the map operation if the rendering to the buffer isn’t complete the map operation will still stall until rendering is complete. Therefore the best advice is to pipeline these read backs where possible such that you are using the results from frame N in frame N+2 or, if that’s not possible, to separate the dependent operations as much as possible and then use fence and sync to ensure coherence.You might consider using a shared context and placing the wait for read back on an asynchronous thread. I’d also apply the same advice to glCopyTexImage().

The advice for Occlusion Queries is very similar. Polling for the result of an occlusion query immediately creates a stall (this is true on all GPUs, not just deferred). Therefore the advice is to always pipeline occlusion queries.


“Putting the squeeze on the perp.’”

Not compressing your textures is a bit like speeding. We’ve pretty much all done it, it’s easily done, we don’t think about the consequences and everyone has an excuse, but there isn’t one. However, unlike speeding I think that not compressing textures should be a capital offense. Compressing your textures has massive impact on bandwidth, reducing it 2x, 4x, 8x or more, and is an essential part of authoring for maximum performance in mobile devices.


So what’s the excuse? Back in the days of ETC1, there was the defense of “but it doesn’t do Alpha m’ lud”, that, however, could be worked around (see this Mali Developer article). With the introduction of OpenGL ES 3.0 that defense has been eliminated by the inclusion of ETC2 which now has Alpha support. However this has given rise to the “Matrix Defense”; let me explain…


Consider the “Matrix” below which shows the available compression formats in the world that developers have been used to. Only a very narrow selection of input bit rates, pixel formats and encoding bit rates can be compressed. The defence is that in the “Matrix”, developers can’t get the exact format they want…   


Time to take the red pill. With ASTC this is the new reality:


Adaptive Scalable Texture Compression, the standard developed by ARM and officially adopted by The Khronos Group as an extension to both the OpenGL and OpenGL ES graphics APIs, is the best method available, offering increased quality and fidelity, very low bit-rates and just about every input format you may want or need. Independent testing of ASTC has shown that quality levels similar to 2bits per pixel in existing compression schemes can be achieved using the next level down in ASTC, saving further bandwidth for the same level of quality. So now there is no excuse!


“(Not so) Supergrass!”

To close out this blog, I’d like to give you my personal pick of crimes against performance from 2013. We join the court as the prosecution presents its first witness…

PC Ray Caster:

“If I may present the evidence, your honour….

"Whilst proceeding in our usual routine activities we happened upon some suspicious activity. The performance analysis team regularly, as new games or graphics focused applications appear in the app stores, run them and investigate how they are using the GPU. This helps us maintain a view of trends in workload, application composition etc. which helps us shape the future direction of our roadmap.

“Our attention was attracted to this particular app when we noticed that it was consuming an unusually large amount of bandwidth for the visual complexity of the scene. “Hello, hello, hello!” we said to ourselves, “What’s all this then?” Upon further investigation the crime scene revealed itself. The crime scene consisted of a field of long grass.

“We proceeded to assess the crime scene and the grass was found to be made up of imposters*, which is what we expected as this is a fairly standard technique for rendering scrub/foliage etc. In this particular case the imposters were made up of two quads which intersected each other at the mid-point of the X access at 90°  to each other. Again, this is all fairly standard stuff.

“The forensics team used the Mali Graphics Debugger to look for evidence of foul play and immediately the full horrors of this case began to unfold. As we stepped through the first issue became immediately obvious: the imposters were being drawn back to front. We let the frame complete and then checked the stats. The overdraw map showed a peak in double digits and the texture bandwidth was criminal! The grass was accounting for more than half of the total run-time of the scene.

“Continuing the investigation we found that the texture used for the grass/shrubs was also not MIP Mapped or compressed. Given the viewing angle for the scene and distance from the viewer of each shrub imposter, that meant that most of the imposters were very small causing under sampling of the texture (the mapping between texture pixels and screen pixels was less than 1:1) which was thrashing the cache and causing the excessive bandwidth consumption.

“After some more investigation we also found that rather than using “Punch through Alpha”**, the app had turned on Alpha blending, causing all overdrawn pixels to be blended with each other which was causing the engine to force the back to front ordering (alpha blended objects need to observe back to front ordering for visual correctness).


“Once the crime scene was cleaned up your honor, the application performance improved considerably. Clearly this shows a criminal neglect your honor. That concludes the evidence for the prosecution."


*You basically replace a model with a textured 2D quad which rotates to always remain view port aligned.

Imagine a cardboard cut-out of a tree that follows you to always face you and you are there!

**Transparent texels in an RGBA texture are marked with Alpha = 0 and are discarded in the fragment shader acting as a mask.

All other texels have an Alpha of >0 and are written as opaque pixels, the alpha is not used for blending.

A cheaper way to do this is also to use only RGB texture and pick either black (0:0:0) or white (1.0:1.0:1.0) as the mask value.

Judge Hugh Harshly:


"I believe I've heard enough...

“I find the defendant guilty on the charge of fraudulent use of Alpha modes liable to cause excessive bandwidth consumption, being several times over the legal limit of overdraw while in charge of a GPU, cache abuse, extortion of bandwidth, applying a texture without due care and attention and finally failure to compress a texture... a most heinous crime.

“Do you have anything to say for yourself before I pass sentence?"



"Its a fit up! Society's to blame! What chance did I have growing up with a desktop GPU, I don't know no different do I?"

Judge Hugh Harshly:

Very well… clerk, hand me my black cap would you, there’s a good fellow..."

Here at ARM we continuously work to increase the quality of our driver software. Thorough testing is a significant proportion of our driver production process and its efficiency enables us to deliver drivers that meet and exceed our partners’ quality expectations  sooner than we would otherwise be able  to. You might have seen the announcement made in fall 2013: ARM Ltd., the leading semiconductor intellectual property (IP) supplier, expands its quality assurance processes with the adoption of the OpenCL™ and OpenGL® ES 3.0 test modules of the drawElements Quality Program – the dEQP™. This subscription was all part of ensuring that as the graphics industry evolves, the drivers that we deliver continue to be of the highest standard possible.

Based on our experience with the ARM® Mali™-400 GPU series we certainly had confidence in our test coverage for OpenGL ES 2.0 which we built up over multiple releases. Despite the fact that the ARM Mali-T600 GPU series is a radically new design comprising a unified shader architecture, the pool of test cases targeting the API  version 2.0  was easy to re-use right from the start. But for OpenGL ES 3.0, being a new API, there was barely anything out there - real world OpenGL ES 3.0 content was still to come. We based our initial testing on the conformance test package from Khronos and, to a much larger extent, on in-house testing of the new features. However, we wanted to take the quality of the driver higher than these two processes allow in order to exterminate any stubborn bugs. To do this, an external testing suite was in order.  Why? Well, it’s good that you asked.


For brand new features our in-house testing is typically what you might refer to as "white box" testing. Engineers familiar with the driver’s details develop targeted tests against new features, based on the OpenGL ES 3.0 specification from Khronos. If you want to factor in the inflow of new colleagues one might be willing to shift it into the "gray"-zone, but certainly the tests are not of the "black box" kind. Whereas such internal driver knowledge makes it possible to write tests targeting even very specific driver behaviour it ends up creating a one-sided view of driver performance. Engineers just "know more than they should" for developing black-box tests. Yet such black-box tests are vital to perform because the ultimate end-user, our partner, will not have the same intricate knowledge of ARM software as our engineers and so their actions and experience will be quite different. 


Still, one might raise the question “Your driver passed the conformance tests - what else is left to test then?” There's a short summary written up here describing how one obtains confidence on a per-feature basis from the conformance test package. But ARM is more interested in the combination of features - which is what one typically uses in real world applications – and this has less coverage. So even though we passed conformance, if we did not perform additional in-house testing there could be a higher number of bugs going out and impacting our partners and our only method for finding and fixing them would be partner feedback. Hardly an ideal situation.

So, what were our expectations when licensing an external test suite, adding more than 30,000 new test cases to our in-house testing? Pass all of them? That would have been really cool and deserved plenty of cake (our replacement for beer here in Norway). The reality was that, when running the latest Mali-T600 GPU driver on an Exynos 5250 based platform running Linux with dEQP version 2013.4.1, we happily passed 98.5% of the OpenGL ES 3.0 functional test group and an even larger part for OpenCL 1.1, although we did not pass all of them   - which, at the very least, proved to us the value in the drawElements’ testing suite.

If your testing group tells you that there are roughly a hundred new cases waiting for further investigation, your first response certainly is not "Yeah - great!". But thinking a bit more broadly, maybe it should have been.  Getting over a hundred failing test cases "in" all of a sudden certainly has an impact on daily work and schedules. But that's what we bought them for - to prevent partners and developers from discovering these issues over time. It's better to see all potential issues in one go than waiting for them to trickle in across one or two years from the partner or even the developer community.  Within ARM’s release schedule which is, due to the target market, quite different from what you might be used to from your desktop GPU vendor, there is no room for a "quick-fix" once a driver is out. So everything we find and fix in our products upfront is very important to ARM and our partners.

dEQP provides challenging test cases for a variety of areas. The ones most interesting to us are:

"Exhaustive positive and negative testing of individual API features"

The situation of positive testing is quite clear: if our driver did not allow something the specification requires, we have a severe bug. Luckily for us we passed that hurdle well.

On negative testing the situation is a bit different: In the case that our driver allows things it should not, is this really a problem? Isn't that perhaps more of a feature, given it works in a sane way? Actually, it is a problem as it causes fragmentation on the market and leads to the unfortunate situations of "But this works with Vendor A!".  Those issues will hit developers when they start to migrate from a single development platform into the wild world to test their apps. If "outside the spec" behaviour is considered to be valuable it can always be properly captured in an extension specification.

Similarly, negative testing involves testing error handling by executing API calls which are not even supposed to work due to, for example, using the wrong parameters. Even though it is not standard practice to  base  application behaviour on the specific error code returned, we recognize the importance of correct error codes to assist debugging problems during development (you might want to read further about the debugging extension which eases the situation a bit). Long story short – with the help of dEQP we greatly improved our ability to return the right error codes.


Stress tests”

There is one error an application should always be aware of: The famous GL_OUT_OF_MEMORY. This one is raised whenever there are no system resources left to successfully complete an API call. One scarce resource is the amount of available (and free to use) memory. The drawElements’ test suite covers that part by forcefully driving the system into a low memory state to check how stable the driver can handle the situation.

As we saw during testing, this is a difficult situation to operate in. The Android™ OS, for example, has a low-memory process killer triggering at a higher threshold than the one on plain Linux, sometimes not even leaving time for the application to properly close before it is killed by the system underneath.  Passing these tests on every platform is a challenge, but a challenge that we are overcoming more rapidly with the help of drawElements’ testing suite.

"Precision tests"

Due to the way precision is specified by OpenGL ES 3.0, testing for it is a challenge. Rounding behaviour and INF/NAN handling are implementation defined and only the least amount of precision to maintain is specified. We realize it is challenging to come up with stable test cases as soon as they touch any of these "implementation defined" areas. And a few tests do  touch on these areas . So when it comes to answering the question of whether unexpected (failing) test results are still valid results within the constraints of the specification we spent quite some time verifying that our driver, the GPU compiler backend and finally the GPU all treat 'mediump' and 'lowp' precision qualifiers as mandated by the specification. In the end, the effort between us and drawElements was well spent on those cases. For example, 'mediump' is a great tool for saving memory and bandwidth and to reduce cache pressure where ever possible. But bear in mind that it is up to the developer to ensure that calculations stay within the least limits. For more details I refer you to Tom Olson's detailed series.

"Cut-down real-world use cases like shadow algorithms, post-processing chains for SSAO, complex transform feedback chains"

These test case are the most difficult ones to investigate due to their complexity. We take them one by one and as we build up confidence in the earlier areas we get better and better at pointing out which assumptions might be wrong in the complex test cases.  Sometimes we might even consider test cases as "overly complex/complicated to achieve a specific goal", but the question of "why should one do this" is no excuse if they don't work on our driver.


So far ARM has closely investigated around 130 test failures reported by drawElements’ test suite for OpenGL ES 3.0 which were neither covered by ARM’s existing test set nor by the Khronos conformance test suite. Compare that number to the amount of passed tests, which is over 35000! Roughly half of these failures were real bugs in our drivers, whereas the other half we found targeting behaviour outside of the specification. And what happened with those tests we found to be in conflict with the specification? Well, there are excellent engineers working at drawElements who take feedback seriously and certainly won't accept it blindly. Brief e-mail exchange was usually enough to decide whether the fixes were needed in the test case or the implementation. If a case is really ambiguous and not easily decidable based on the current specification we can raise the problem together within the Khronos Group to find an agreement amongst all participating members.

Last but not least - such a big test suite is also valuable for automated regression testing. Whenever you pass you must remember it was just a snapshot of your code base which was okay. New features, optimizations, refactoring, bug fixes - all carry a risk of breaking something unwontedly. With dEQP now part of our regression testing safety net, ARM’s confidence in the quality of our releases is even stronger.

In the first two blogs of this series I introduced the frame-level pipelining [The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining] and tile based rendering architecture [The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering] used by the Mali GPUs, aiming to develop a mental model which developers can use to explain the behavior of the graphics stack when optimizing the performance of their applications.


In this blog I will finish the construction of this abstract machine, forming the final component: the Mali GPU itself.  This blog assumes you have read the first two parts in the series, so I would recommend starting with those if you have not read them already.


GPU Architecture


The "Midgard" family of Mali GPUs  (the Mali-T600 and Mali-T700 series) use a unified shader core architecture, meaning that only a single type of shader core exists in the design. This single core can execute all types of programmable shader code, including vertex shaders, fragment shaders, and compute kernels.


The exact number of shader cores present in a particular silicon chip varies; our silicon partners can choose how many shader cores they implement based on their performance needs and silicon area constraints. The Mali-T760 GPU can scale from a single core for low-end devices all the way up to 16 cores for the highest performance designs, but between 4 and 8 cores are the most common implementations.



The graphics work for the GPU is queued in a pair of queues, one for vertex/tiling workloads and one for fragment workloads, with all work for one render target being submitted as a single submission into each queue. Workloads from both queues can be processed by the GPU at the same time, so vertex processing and fragment processing for different render targets can be running in parallel (see the first blog for more details on this pipelining methodology). The workload for a single render target is broken into smaller pieces and distributed across all of the shader cores in the GPU, or in the case of tiling workloads (see the second blog in this series for an overview of tiling) a fixed function tiling unit.


The shader cores in the system share a level 2 cache to improve performance, and to reduce memory bandwidth caused by repeated data fetches. Like the number of cores, the size of the L2 is configurable by our silicon partners, but is typically in the range of 32-64KB per shader core in the GPU depending on how much silicon area is available. The number and bus width of the memory ports this cache has to external memory is configurable, again allowing our partners to tune the implementation to meet their performance, power, and area needs. In general we aim to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle.


Mali GPU Shader Core


The Mali shader core is structured as a number of fixed-function hardware blocks wrapped around a programmable "tripipe" execution core. The fixed function units perform the setup for a shader operation - such as rasterizing triangles or performing depth testing - or handling the post-shader activities - such as blending, or writing back a whole tile's worth of data at the end of rendering. The tripipe itself is the programmable part responsible for the execution of shader programs.




The Tripipe


There are three classes of execution pipeline in the tripipe design: one handling arithmetic operations, one handling memory load/store and varying access, and one handling texture access. There is one load/store and one texture pipe per shader core, but the number of arithmetic pipelines can vary depending on which GPU you are using; most silicon shipping today will have two arithmetic pipelines, but GPU variants with up to four pipelines are also available.


Massively Multi-threaded Machine


Unlike a traditional CPU architecture, where you will typically only have a single thread of execution at a time on a single core, the tripipe is a massively multi-threaded processing engine. There may well be hundreds of hardware threads running at the same time in the tripipe, with one thread created for each vertex or fragment which is shaded. This large number of threads exists to hide memory latency; it doesn't matter if some threads are stalled waiting for memory, as long as at least one thread is available to execute then we maintain efficient execution.


Arithmetic Pipeline: Vector Core


The arithmetic pipeline (A-pipe) is a SIMD (single instruction multiple data) vector processing engine, with arithmetic units which operate on 128-bit quad-word registers. The registers can be flexibly accessed as either 2 x FP64, 4 x FP32, 8 x FP16, 2 x int64, 4 x int32, 8 x int16, or 16 x int8. It is therefore possible for a single arithmetic vector task to operate on 8 "mediump" values in a single operation, and for OpenCL kernels operating on 8-bit luminance data to process 16 pixels per SIMD unit per clock cycle.


While I can't disclose the internal architecture of the arithmetic pipeline, our public performance data for each GPU can be used to give some idea of the number of maths units available. For example, the Mali-T760 with 16 cores is rated at 326 FP32 GFLOPS at 600MHz. This gives a total of 34 FP32 FLOPS per clock cycle for this shader core; it has two pipelines, so that's 17 FP32 FLOPS per pipeline per clock cycle. The available performance in terms of operations will increase for FP16/int16/int8 and decrease for FP64/int64 data types.


Texture Pipeline


The texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory, so requires a second clock cycle to complete.


Load/Store Pipeline


The load/store pipeline (LS-pipe) is responsible for all memory accesses which are not related to texturing.  For graphics workloads this means reading attributes and writing varyings during vertex shading, and reading varyings during fragment shading. In general every instruction is a single memory access operation, although like the arithmetic pipeline they are vector operations and so could load an entire "highp" vec4 varying in a single instruction.


Early ZS Testing and Late ZS Testing


In the OpenGL ES specification "fragment operations" - which include depth and stencil testing - happen at the end of the pipeline, after fragment shading has completed. This makes the specification very simple, but implies that you have to spend lots of time shading something, only to throw it away at the end of the frame if it turns out to be killed by ZS testing. Coloring fragments just to discard them would cost a huge amount of performance and wasted energy, so where possible we will do ZS testing early (i.e. before fragment shading), only falling back to late ZS testing (i.e. after fragment shading) where it is unavoidable (e.g. a dependency on fragment which may call "discard" and as such has indeterminate depth state until it exits the tripipe).


In addition to the traditional early-z schemes, we also have some overdraw removal capability which can stop fragments which have already been rasterized from turning into real rendering work if they do not contribute to the output scene in a useful way. My colleague Sean Ellis has a great blog looking at this technology - Killing Pixels - A New Optimization for Shading on ARM Mali GPUs - so I won't dive into any more detail here.


GPU Limits


Based on this simple model it is possible to outline some of the fundamental properties underpinning the GPU performance.


  • The GPU can issue one vertex per shader core per clock
  • The GPU can issue one fragment per shader core per clock
  • The GPU can retire one pixel per shader core per clock
  • We can issue one instruction per pipe per clock, so for a typical shader core we can issue four instructions in parallel if we have them available to run
    • We can achieve 17 FP32 operations per A-pipe
    • One vector load, one vector store, or one vector varying per LS-pipe
    • One bilinear filtered texel per T-pipe
  • The GPU will typically have 32-bits of DDR access (read and write) per core per clock [configurable]


If we scale this to a Mali-T760 MP8 running at 600MHz we can calculate the theoretical peak performance as:


  • Fillrate:
    • 8 pixels per clock = 4.8 GPix/s
    • That's 2314 complete 1080p frames per second!
  • Texture rate:
    • 8 bilinear texels per clock = 4.8 GTex/s
    • That's 38 bilinear filtered texture lookups per pixel for 1080p @ 60 FPS!
  • Arithmetic rate:
    • 17 FP32 FLOPS per pipe per core = 163 FP32 GFLOPS
    • That's 1311 FLOPS per pixel for 1080p @ 60 FPS!
  • Bandwidth:
    • 256-bits of memory access per clock = 19.2GB/s read and write bandwidth1.
    • That's 154 bytes per pixel for 1080p @ 60 FPS!


OpenCL and Compute


The observant reader will have noted that I've talked a lot about vertices and fragments - the staple of graphics work - but have mentioned very little about how OpenCL and RenderScript compute threads come into being inside the core. Both of these types of work behave almost identically to vertex threads - you can view running a vertex shader over an array of vertices as a 1-dimensional compute problem. So the vertex thread creator also spawns compute threads, although more accurately I would say the compute thread creator also spawns vertices .


Next Time ...


This blog concludes the first chapter of this series, developing the abstract machine which defines the basic behaviors which an application developer should expect to see for a Mali GPU in the Midgard family. Over the rest of this series I'll start to put this new knowledge to work, investigating some common application development pitfalls, and useful optimization techniques, which can be identified and debugged using the Mali integration into the ARM DS-5 Streamline profiling tools.


EDIT: Next blog now available:


Comments and questions welcomed as always,






  1. ... 19.2GB/s subject to the ability of the rest of the memory system outside of the GPU to give us data this quickly. Like most features of an ARM-based chip, the down-stream memory system is highly configurable in order to allow different vendors to tune power, performance, and silicon area according to their needs. For most SoC parts the rest of the system will throttle the available bandwidth before the GPU runs out of an ability to request data. It is unlikely you would want to sustain this kind of bandwidth for prolonged periods, but short burst performance is important.


Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

The annual Game Developers Conference (GDC) is fast approaching. The ARM team is in the midst of finalizing preparations to try and make this GDC our best one yet – it will definitely be our biggest! We have some great activities planned for the week and will be joined by a host of ARM partners at both the ARM booth (#1616) and the speaking sessions in West Hall Room 3014, many of whom have announcements which we can’t wait to share!


ARM GDC Booth.png

Joining us on the ARM Booth...

There is a wealth of variety in the tools, game engines and middleware available to developers who are looking to expand into the mobile gaming market.  These can help ease the effort of development, improve the performance of your app or even add differentiating features.  GDC is a great opportunity for developers to discover these and at the ARM booth this year we will be joined by an array of Ecosystem partners who offer such solutions, all of which have been optimized for ARM-based devices.  We are especially excited to be joined by several Chinese partners (Tencent, Sohu, Perfect World and Chukong/Cocos2d-x) who will display their latest gaming technology and demonstrate the huge opportunities that are emerging in the Chinese mobile gaming market. Between these partners there are some exciting developments which it is worth checking out: Sohu has recently released an open source version of their Genesis-3d engine; Chukong/Cocos2d-x have launched their new Cocos3d-x engine; the most popular Chinese 3D mobile games will be demonstrated by Tencent and Perfect World will be there with their ECHO-es engine which powers their famous 3D adventure and fantasy MMORPG.


Also on the ARM booth will be PlayCanvas and Goo Technologies who are showcasing their HTML5 solutions, optimized for mobile, which can enable you to make rich and engaging browser-based games (which could be fantastic for the DTV market, for example). Geomerics Enlighten™ technology, which has already been revolutionizing the console gaming experience and is set to rapidly accelerate the transition to photo realistic graphics in mobile, will also be on hand to explain to developers how you can get access to their solution. The Marmalade staff will detail the benefits of their SDK, which allows developers to code once and distribute cross platform, across any mobile device, as well as Marmalade Quick, Web Marmalade and Marmalade Juice which all build on top of the SDK, providing alternatives for developers using Lua, HTML or Objective-C® software.  You can also learn how to boost you code’s performance at Umbra’s pod where they will show their smart culling technology.  The Havok™ Anarchy™ engine will be on a pod and at Simplygon’s pod developers can discover how to optimize their content for ARM-based, mobile platforms to deliver the AAA game which you aimed for.  Finally, the Samsung Developers pod will host the Samsung Chord SDK which is designed to help developers who have no knowledge of networking develop local information-sharing applications and apply them to games, like Asphalt 8 by Gameloft.



Project Anarchy - vForge.png

Screenshot from Project Anarchy

Find out more about ARM technology


ARM will be presenting five times at GDC (all in West Hall Room 3014!), covering key subjects of the moment for mobile gamers. We will discuss ARM’s recent game middleware technology acquisition (the Enlighten solution by Geomerics) and how this solution is developing, along with how to efficiently use and develop for a deferred rendering architecture (the architecture mainly used for battery constrained mobile devices). Gameloft will join us for another talk to go through the different methods which you could use to optimize AAA games for mobile. The next talk will cover OpenGL® ES 3.0 and beyond, with a good sideline on ASTC Full Profile and how its flexibility and quality can make the world of difference to your games. Covered in our fourth talk will be how you can best maximize the performance of the ARM Cortex® CPU, the IP within over 95% of mobile devices, by using ARM NEON™ technology and/or multithreading. Finally our last (but not least) talk of the show covers the proven tools ARM offers to game developers and, together with our partner, EA, we will show attendees how best to use these tools to create fantastic, high-performing games.

The full talk abstracts and schedule can be found here.


Discuss the latest techniques with industry experts


For those who want to learn a little more about the solutions on the ARM booth or who can’t make our talks above, ARM engineers and our partners are going to be presenting at the ARM Lecture Theatre within our booth. They will be joined by other ARM partners such as Testin and their cloud testing platform, Autodesk live-drawing professional quality digital art using their SketchBook® Pro software optimized for Android devices and Guild Software with their renowned Vendetta Online. The schedule is packed with items of interest to mobile developers, so please check the schedule and topics here to discover what might interest you.

In the link you will also find several other talks from the ARM Developer Education team, covering best practices of developing games for mobile, optimizing your WebGL™ application for mobile, using the ARM tools for game development and the different content preview of the five ARM GDC sponsored sessions.


Game Challenge Wall.png


Show off your skills in the ARM Game Challenge!


Game developers can chill out in ARM’s Gaming Zone where hit apps which have been optimized to deliver great visual effects using the ARM tools will be available – these include such titles as VW Sports Car Challenge, Real Racing 3, and Asphalt 8. If you’re feeling lucky, take part in the Sports Car Challenge 2 for your chance to win a Samsung Galaxy Note 10.1 2014 Edition – prove yourself to be the fastest gamer of the show by getting the fastest lap time and claim your prize at the end of the day!

Discover a new way to create digital art in the ARM Artist Competition


ARM, together with Autodesk and Samsung Developers, will be highlighting the digital art capabilities of the Samsung Galaxy tablets. The Samsung Galaxy NotePRO 12.2 is a well-recognized tool for developers which enables them to design on the go, freeing themselves from their desk – and the SketchBook Pro software by Autodesk is great for this task too. Test the devices on the ARM booth to discover their potential - and if you draw the best picture of the day you will be in with a chance of winning a Samsung Galaxy Note 8!

For more information on the Artist Competition, check out Akshay’s blog, coming here soon.

For information on the ARM demos on show at GDC, check out Phill Smith’s blog Expanding the Mobile Gaming Experience at GDC 2014

Meanwhile, I need to get back to my GDC preparation – I look forward to meeting everyone next week at GDC, please drop by the ARM booth and say hello!

Filter Blog

By date:
By tag: