Get started with compute shaders

Last month I was at Game Developer Conference (GDC) where I had a fabulous time attending various talks and roundtables, visiting exhibitors and I had a particularly good time showing and explaining to people the latest technologies developed within ARM, such as ASTC 3D HDR textures and Transaction Elimination, as well as compute shaders.

With regards to the last one, many of you have been curious about how to get this piece of technology incorporated into your software. With that in mind, I decided to write this blog to help you write a simple program with compute shaders. I hope this blog will help you to create more advanced applications based on this technology.

So, what are compute shaders? Compute shaders introduce heterogeneous GPU Compute from within the OpenGL® ES API; the same API and shading language which are used for graphics rendering. Now that compute shaders have been introduced to the API, developers do not have to learn another API in order to make use of GPU Compute. The compute shader is just another type of shader in addition to the already broadly known vertex and fragment shaders.

Compute shaders give a lot of freedom to developers to implement complex algorithms and make use of GPU parallel programming. Although the contemporary graphics pipeline is very flexible, developers still tend to stumble on some restrictions. The compute shaders feature, however, makes life easier for us to not think about pipeline stages as we are used to thinking about vertex and fragment. We are no longer restricted by the inputs and outputs of certain pipeline stages. The Shader Storage Buffer Object (SSBO) feature for instance has been introduced along with compute shaders and that gives additional possibilities for exchanging data between pipeline stages, as well as being flexible input and output for compute shaders.

Below you can find a simple example of how to implement compute shaders within your application. The example calculates a coloured circle with a given radius; the radius is a uniform parameter passed by the application and is updated every frame in order to animate the radius of the circle. The whole circle is drawn using points, which are stored as vertices within a Vertex Buffer Object (VBO). The VBO is mapped onto SSBO (without any extra copy in memory) and passed to the compute shader.


Let’s start by writing the OpenGL ES Shading Language (ESSL) compute shader code first:

#version 310 es

// The uniform paramters which is passed from application for every frame.

uniform float radius;

// Declare custom data struct, which represents either vertex or colour.

struct Vector3f


      float x;

      float y;

      float z;

      float w;


// Declare the custom data type, which represents one point of a circle.

// And this is vertex position and colour respectively.

// As you may already noticed that will define the interleaved data within

// buffer which is Vertex|Colour|Vertex|Colour|…

struct AttribData


      Vector3f v;

      Vector3f c;


// Declare input/output buffer from/to wich we will read/write data.

// In this particular shader we only write data into the buffer.

// If you do not want your data to be aligned by compiler try to use:

// packed or shared instead of std140 keyword.

// We also bind the buffer to index 0. You need to set the buffer binding

// in the range [0..3] – this is the minimum range approved by Khronos.

// Notice that various platforms might support more indices than that.

layout(std140, binding = 0) buffer destBuffer


      AttribData data[];

} outBuffer;

// Declare what size is the group. In our case is 8x8, which gives

// 64 group size.

layout (local_size_x = 8, local_size_y = 8, local_size_z = 1) in;

// Declare main program function which is executed once

// glDispatchCompute is called from the application.

void main()


      // Read current global position for this thread

      ivec2 storePos = ivec2(gl_GlobalInvocationID.xy);

      // Calculate the global number of threads (size) for this

      uint gWidth = gl_WorkGroupSize.x * gl_NumWorkGroups.x;

      uint gHeight = gl_WorkGroupSize.y * gl_NumWorkGroups.y;

      uint gSize = gWidth * gHeight;

      // Since we have 1D array we need to calculate offset.

      uint offset = storePos.y * gWidth + storePos.x;

      // Calculate an angle for the current thread

      float alpha = 2.0 * 3.14159265359 * (float(offset) / float(gSize));

      // Calculate vertex position based on the already calculate angle

      // and radius, which is given by application[offset].v.x = sin(alpha) * radius;[offset].v.y = cos(alpha) * radius;[offset].v.z = 0.0;[offset].v.w = 1.0;

      // Assign colour for the vertex[offset].c.x = storePos.x / float(gWidth);[offset].c.y = 0.0;[offset].c.z = 1.0;[offset].c.w = 1.0;


Once the compute shader code has been written, it is time to make it work in our application. Within the application you need to create a compute shader, which is just a new type of shader (GL_COMPUTE_SHADER), and the other calls related to the initialisation remain the same as for vertex and fragment shaders. See below for a snippet of code which creates the compute shader and also checks for both compilation and linking errors:

// Create th compute program, to which the compute shader will be assigned

gComputeProgram = glCreateProgram();

// Create and compile the compute shader

GLuint mComputeShader = glCreateShader(GL_COMPUTE_SHADER);

glShaderSource(mComputeShader, 1, computeShaderSrcCode, NULL);


// Check if there were any issues when compiling the shader

int rvalue;

glGetShaderiv(mComputeShader, GL_COMPILE_STATUS, &rvalue);

if (!rvalue)


       glGetShaderInfoLog(mComputeShader, LOG_MAX, &length, log);

       printf("Error: Compiler log:\n%s\n", log);

       return false;


// Attach and link the shader against to the compute program

glAttachShader(gComputeProgram, mComputeShader);


// Check if there were some issues when linking the shader.

glGetProgramiv(gComputeProgram, GL_LINK_STATUS, &rvalue);

if (!rvalue)


       glGetProgramInfoLog(gComputeProgram, LOG_MAX, &length, log);

       printf("Error: Linker log:\n%s\n", log);

       return false;


So far we have created the compute shader on the GPU. Now we need to set up handlers, which will be used for setting up inputs and outputs for the shader. In our case we need to retrieve the radius uniform handle and set the gIndexBufferBinding (the integer variable) to 0, as the binding was hardcoded within binding = 0. Using this index we will be able to bind the VBO to that index and write data from within the compute shader to the VBO:

// Bind the compute program in order to read the radius uniform location.


// Retrieve the radius uniform location

iLocRadius = glGetUniformLocation(gComputeProgram, "radius");

// See the compute shader: “layout(std140, binding = 0) buffer destBuffer”

gIndexBufferBinding = 0;

Okay, so far so good. Now we are ready to kick off the compute shader and write data to the VBO. The snippet of code below shows how to bind the VBO to the SSBO and submit a compute job to the GPU:

// Bind the compute program


// Set the radius uniform

glUniform1f(iLocRadius, (float)frameNum);

// Bind the VBO onto SSBO, which is going to filled in witin the compute

// shader.

// gIndexBufferBinding is equal to 0 (same as the compute shader binding)

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, gVBO);

// Submit job for the compute shader execution.



// As the result the function is called with the following parameters:

// glDispatchCompute(2, 2, 1)


                                   (NUM_VERTS_H % GROUP_SIZE_WIDTH + NUM_VERTS_H) / GROUP_SIZE_WIDTH,

                                   (NUM_VERTS_V % GROUP_SIZE_HEIGHT + NUM_VERTS_V) / GROUP_SIZE_HEIGHT,


// Unbind the SSBO buffer.

// gIndexBufferBinding is equal to 0 (same as the compute shader binding)

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, 0);

As you may have already noticed, for the glDispatchCompute function we pass the number of groups rather than number of threads to be executed. In our case we execute 2x2x1  groups, which gives 4. However the real number of threads (kernels) executed will be 4 x [8 x 8] which results with the number of 256 threads. The numbers 8x8 come from the compute shader source code, as we hardcoded those numbers within the shader.

So far we have written the compute shader source code, compiled, linked, initialised handlers and dispatched the job for compute. Now it’s time to render the results on screen. However, before we do that we need to remember that all jobs are submitted and executed on the GPU in parallel, so we need to make sure the compute shader will finish the job before the actual draw command starts fetching data from the VBO buffer, which is updated by the compute shader. In this example you won't see much difference in runtime with and without synchronisation but once you implement more complex algorithms with more dependencies, you may notice how important it is to have synchronisation.

// Call this function before we submit a draw call, which uses dependency

// buffer, to the GPU


// Bind VBO

glBindBuffer( GL_ARRAY_BUFFER, gVBO );

// Bind Vertex and Fragment rendering shaders




// Draw points from VBO

glDrawArrays(GL_POINTS, 0, NUM_VERTS);

In order to present the VBO results on screen you can use vertex and fragment programs, which are shown below.

Vertex shader:

attribute vec4 a_v4Position;

attribute vec4 a_v4FillColor;

varying vec4 v_v4FillColor;

void main()


      v_v4FillColor = a_v4FillColor;

      gl_Position = a_v4Position;


Fragment shader:

varying vec4 v_v4FillColor;

void main()


      gl_FragColor = v_v4FillColor;


I think that’s all for this blog and hopefully I will be able to cover more technical details in the future. I believe you will find compute shaders friendly and easy to use in your work. I personally enjoyed implementing the Cloth Simulation demo, one of ARM’s latest technical demos, which was released at GDC. The important thing in my view is that now, once a developer is used to OpenGL ES, it is easy to move on to GPU Compute using just one API. More than that, exchanging data between graphics and compute buffers appears to be done in a clean and transparent way for developers. You shouldn’t limit your imagination to this blog’s application of how you might want to use compute shaders - this blog is only to help developers learn how to use them. I personally can see a real potential in image processing, as you can implement algorithms that will be executed on the chip using internal memory, which must reduce traffic on the bus between memory and chip.

You can also have a look at our latest Cloth Simulation demo, which has been implemented with compute shaders. See the video below:

  • Very nice article. On which hardware was this tested and what driver a developer will have to use to start coding compute shaders on OpenGL ES 3.1?

  • Hi nazcaspider, thank you for posting your comment. We have tested the demo on Samsung Galaxy Note 10.1 development platform (the stock device with custom drivers). Compute shaders are supported in the hardware, which is based on Mali T628 but requires a driver update to support OpenGL ES 3.1. We are expecting vendors will release device updates in a few months.

  • Hi!

    a little late to the party but hope you can answer some questions regarding OGL ES 3.1 on Mali products..

    From the broad questions will be better if ARM can make a post answering this questions (i.e. what optional ES 3.1 extensions they will support, +exts on Mali T760 vs T6xx (tesselation?), etc..)

    first seems you are a little late vs competition as both Qualcomm and Imagination are shipping OGL ES 3.1 emulators and Qualcomm is even shipping some ES 3.1 samples..

    can you provide some ETA to ARM Mali Emulator supporting ES 3.1 and SDK having 3.1 samples like shown here?

    also having info about support for new optional 3.1 extensions would be nice.. Intel showed at GDC that they will support all basically:

    1. GL_OES_sample_shading
    2. GL_OES_sample_variables
    3. GL_OES_shader_image_atomic
    4. GL_OES_shader_multisample_interpolation
    5. GL_OES_texture_stencil8
    6. GL_OES_texture_storage_multisample_2d_array

    + tesselation also..

    Nvidia with Tegra K1 from GFXbench reports we know will support all those + tesselation etc.. even things like GL_EXT_texture_view etc..

    Also Qualcoom has said will support tesselation so only ones to confirm support for tesselation and geometry shaders seem Imagination and ARM Mali products..

    and even from Qualcomm and Imagination emulators you can infer what optional extensions they will support..

    so seems ARM is the unique vendor impossible to gather this info.. can you provide this info?

    also from GFXBench seems mali T760 is coming soon can we expect more extensions there i.e. tesselation shaders as Midgard doesn't seem to support..

    and finally one little detail:

    seems core ES 3.1 doesn't need to support image load store (and SBO) in fragment shaders so rendering OIT algorithms using A-buffer by linked list per pixel aren't possible in core.. there is a gl_MaxImagesFragmentShader or something similar and I would like to know support for it in both Mali T6xx and T7XX series..

    Many thanks..

  • Hi oscarbg,

    Thank you for posting all these important questions. I will do my best to answer them.

    Emulator OpenGL ES 3.1 - we are expecting the emulator to be released later this year. We have the OpenGL ES 3.0 emulator, which is conformant, whereas the OpenGL ES 3.1 conformance tests are not available yet.

    In terms of consumer devices, I have already answered on that for the nazcaspider question. But I also would like to add that once the conformance tests for OpenGL ES 3.1 are available we expect current generation of Mali GPU to pass the conformance tests.

    With regards to the extensions, we believe in driving the API to one standard and we are working with Khronos to make this happen. In general we believe that a better route to enable new features is to introduce them in core API rather than through extensions. All we need is to identify the extensions which are essential and reasonable for mobile platforms and introduce them in our implementation.



  • The use of glMemoryBarrier() here is slightly wrong. glMemoryBarrier is supposed to target how the incoherent data is read, not written. GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT would be the correct barrier here since the data is used for vertex drawing after compute (pp. 115 in ES3.1 spec).

  • Hi themaister,

    Very good spot! What you wrote is indeed the proper way of doing it.

    I have updated the snippet of code above according to your comment.

    Thank you!

  • Trafilem tu przypadkowo, szukajac opisu do compute shader-a. Ja sie patrze, a tu Sylwek. Pozdrowionka! Odezwij sie jak bedziesz w Szczecinie.


  • Hello,

    A bit late on this but...

    I have a Galaxy Note 10.1. What do I have to do in order to start using compute shaders on it? are the drivers available yet?



  • Hi Eyal,

    GLES driver updates for stock devices are pushed by the device vendor in the form of OTA updates. We sometimes release Vanilla builds of the userspace driver for certain devices at ARM Mali Midgard GPU User Space Drivers - Mali Developer Center Mali Developer Center but the Note 10.1 is not currently supported directly by us.