Get started with compute shaders

April 17, 2014

8 minute read time.

Last month I was at Game Developer Conference (GDC) where I had a fabulous time attending various talks and roundtables, visiting exhibitors and I had a particularly good time showing and explaining to people the latest technologies developed within Arm, such as ASTC 3D HDR textures and Transaction Elimination, as well as compute shaders.

With regards to the last one, many of you have been curious about how to get this piece of technology incorporated into your software. With that in mind, I decided to write this blog to help you write a simple program with compute shaders. I hope this blog will help you to create more advanced applications based on this technology.

So, what are compute shaders?

Compute shaders introduce heterogeneous GPU Compute from within the OpenGL ES API; the same API and shading language which are used for graphics rendering. Now that compute shaders have been introduced to the API, developers do not have to learn another API in order to make use of GPU Compute. The compute shader is just another type of shader in addition to the already broadly known vertex and fragment shaders.

Compute shaders give a lot of freedom to developers to implement complex algorithms and make use of GPU parallel programming. Although the contemporary graphics pipeline is very flexible, developers still tend to stumble on some restrictions. The compute shaders feature, however, makes life easier for us to not think about pipeline stages as we are used to thinking about vertex and fragment. We are no longer restricted by the inputs and outputs of certain pipeline stages. The Shader Storage Buffer Object (SSBO) feature for instance has been introduced along with compute shaders and that gives additional possibilities for exchanging data between pipeline stages, as well as being flexible input and output for compute shaders.

How do I implement compute shaders within my application?

Below you can find a simple example of how to implement compute shaders within your application. The example calculates a coloured circle with a given radius; the radius is a uniform parameter passed by the application and is updated every frame in order to animate the radius of the circle. The whole circle is drawn using points, which are stored as vertices within a Vertex Buffer Object (VBO). The VBO is mapped onto SSBO (without any extra copy in memory) and passed to the compute shader.

Let’s start by writing the OpenGL ES Shading Language (ESSL) compute shader code first:

#version 310 es

// The uniform parameters which is passed from application for every frame.

uniform float radius;

// Declare custom data struct, which represents either vertex or colour.

struct Vector3f

{

      float x;

      float y;

      float z;

      float w;

};

// Declare the custom data type, which represents one point of a circle.

// And this is vertex position and colour respectively.

// As you may already noticed that will define the interleaved data within

// buffer which is Vertex|Colour|Vertex|Colour|…

struct AttribData

{

      Vector3f v;

      Vector3f c;

};

// Declare input/output buffer from/to wich we will read/write data.

// In this particular shader we only write data into the buffer.

// If you do not want your data to be aligned by compiler try to use:

// packed or shared instead of std140 keyword.

// We also bind the buffer to index 0. You need to set the buffer binding

// in the range [0..3] – this is the minimum range approved by Khronos.

// Notice that various platforms might support more indices than that.

layout(std140, binding = 0) buffer destBuffer

{

      AttribData data[];

} outBuffer;

// Declare what size is the group. In our case is 8x8, which gives

// 64 group size.

layout (local_size_x = 8, local_size_y = 8, local_size_z = 1) in;

// Declare main program function which is executed once

// glDispatchCompute is called from the application.

void main()

{

      // Read current global position for this thread

      ivec2 storePos = ivec2(gl_GlobalInvocationID.xy);

      // Calculate the global number of threads (size) for this

      uint gWidth = gl_WorkGroupSize.x * gl_NumWorkGroups.x;

      uint gHeight = gl_WorkGroupSize.y * gl_NumWorkGroups.y;

      uint gSize = gWidth * gHeight;

      // Since we have 1D array we need to calculate offset.

      uint offset = storePos.y * gWidth + storePos.x;

      // Calculate an angle for the current thread

      float alpha = 2.0 * 3.14159265359 * (float(offset) / float(gSize));

      // Calculate vertex position based on the already calculate angle

      // and radius, which is given by application

      outBuffer.data[offset].v.x = sin(alpha) * radius;

      outBuffer.data[offset].v.y = cos(alpha) * radius;

      outBuffer.data[offset].v.z = 0.0;

      outBuffer.data[offset].v.w = 1.0;

      // Assign colour for the vertex

      outBuffer.data[offset].c.x = storePos.x / float(gWidth);

      outBuffer.data[offset].c.y = 0.0;

      outBuffer.data[offset].c.z = 1.0;

      outBuffer.data[offset].c.w = 1.0;

}

Once the compute shader code has been written, it is time to make it work in our application. Within the application you need to create a compute shader, which is just a new type of shader (GL_COMPUTE_SHADER), and the other calls related to the initialisation remain the same as for vertex and fragment shaders. See below for a snippet of code which creates the compute shader and also checks for both compilation and linking errors:

// Create the compute program, to which the compute shader will be assigned

gComputeProgram = glCreateProgram();

// Create and compile the compute shader

GLuint mComputeShader = glCreateShader(GL_COMPUTE_SHADER);

glShaderSource(mComputeShader, 1, computeShaderSrcCode, NULL);

glCompileShader(mComputeShader);

// Check if there were any issues when compiling the shader

int rvalue;

glGetShaderiv(mComputeShader, GL_COMPILE_STATUS, &rvalue);

if (!rvalue)

{

       glGetShaderInfoLog(mComputeShader, LOG_MAX, &length, log);

       printf("Error: Compiler log:\n%s\n", log);

       return false;

}

// Attach and link the shader against to the compute program

glAttachShader(gComputeProgram, mComputeShader);

glLinkProgram(gComputeProgram);

// Check if there were some issues when linking the shader.

glGetProgramiv(gComputeProgram, GL_LINK_STATUS, &rvalue);

if (!rvalue)

{

       glGetProgramInfoLog(gComputeProgram, LOG_MAX, &length, log);

       printf("Error: Linker log:\n%s\n", log);

       return false;

}

So far we have created the compute shader on the GPU. Now we need to set up handlers, which will be used for setting up inputs and outputs for the shader. In our case we need to retrieve the radius uniform handle and set the gIndexBufferBinding (the integer variable) to 0, as the binding was hardcoded within binding = 0. Using this index we will be able to bind the VBO to that index and write data from within the compute shader to the VBO:

// Bind the compute program in order to read the radius uniform location.

glUseProgram(gComputeProgram);

// Retrieve the radius uniform location

iLocRadius = glGetUniformLocation(gComputeProgram, "radius");

// See the compute shader: “layout(std140, binding = 0) buffer destBuffer”

gIndexBufferBinding = 0;

Okay, so far so good. Now we are ready to kick off the compute shader and write data to the VBO. The snippet of code below shows how to bind the VBO to the SSBO and submit a compute job to the GPU:

// Bind the compute program

glUseProgram(gComputeProgram);

// Set the radius uniform

glUniform1f(iLocRadius, (float)frameNum);

// Bind the VBO onto SSBO, which is going to filled in witin the compute

// shader.

// gIndexBufferBinding is equal to 0 (same as the compute shader binding)

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, gVBO);

// Submit job for the compute shader execution.

// GROUP_SIZE_HEIGHT = GROUP_SIZE_WIDTH = 8

// NUM_VERTS_H = NUM_VERTS_V = 16

// As the result the function is called with the following parameters:

// glDispatchCompute(2, 2, 1)

glDispatchCompute(

                                   (NUM_VERTS_H % GROUP_SIZE_WIDTH + NUM_VERTS_H) / GROUP_SIZE_WIDTH,

                                   (NUM_VERTS_V % GROUP_SIZE_HEIGHT + NUM_VERTS_V) / GROUP_SIZE_HEIGHT,

                                  1);

// Unbind the SSBO buffer.

// gIndexBufferBinding is equal to 0 (same as the compute shader binding)

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, 0);

As you may have already noticed, for the glDispatchCompute function we pass the number of groups rather than number of threads to be executed. In our case we execute 2x2x1 groups, which gives 4. However the real number of threads (kernels) executed will be 4 x [8 x 8] which results with the number of 256 threads. The numbers 8x8 come from the compute shader source code, as we hardcoded those numbers within the shader.

So far we have written the compute shader source code, compiled, linked, initialised handlers and dispatched the job for compute. Now it’s time to render the results on screen. However, before we do that we need to remember that all jobs are submitted and executed on the GPU in parallel, so we need to make sure the compute shader will finish the job before the actual draw command starts fetching data from the VBO buffer, which is updated by the compute shader. In this example you won't see much difference in runtime with and without synchronisation but once you implement more complex algorithms with more dependencies, you may notice how important it is to have synchronisation.

// Call this function before we submit a draw call, which uses dependency

// buffer, to the GPU

glMemoryBarrier(GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT);

// Bind VBO

glBindBuffer( GL_ARRAY_BUFFER, gVBO );

// Bind Vertex and Fragment rendering shaders

glUseProgram(gProgram);

glEnableVertexAttribArray(iLocPosition);

glEnableVertexAttribArray(iLocFillColor);

// Draw points from VBO

glDrawArrays(GL_POINTS, 0, NUM_VERTS);

In order to present the VBO results on screen you can use vertex and fragment programs, which are shown below.

Vertex shader:

attribute vec4 a_v4Position;

attribute vec4 a_v4FillColor;

varying vec4 v_v4FillColor;

void main()

{

      v_v4FillColor = a_v4FillColor;

      gl_Position = a_v4Position;

}

Fragment shader:

varying vec4 v_v4FillColor;

void main()

{

      gl_FragColor = v_v4FillColor;

}

I think that’s all for this blog and hopefully I will be able to cover more technical details in the future.

I believe you will find compute shaders friendly and easy to use in your work. I personally enjoyed implementing the Cloth Simulation demo, one of Arm’s latest technical demos, which was released at GDC. The important thing in my view is that now, once a developer is used to OpenGL ES, it is easy to move on to GPU Compute using just one API. More than that, exchanging data between graphics and compute buffers appears to be done in a clean and transparent way for developers. You shouldn’t limit your imagination to this blog’s application of how you might want to use compute shaders - this blog is only to help developers learn how to use them. I personally can see a real potential in image processing, as you can implement algorithms that will be executed on the chip using internal memory, which must reduce traffic on the bus between memory and chip.

Eyal Hirsch over 9 years ago

Hello,
A bit late on this but...
I have a Galaxy Note 10.1. What do I have to do in order to start using compute shaders on it? are the drivers available yet?
thanks
Eyal
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Marek Gruszewski over 10 years ago

Trafilem tu przypadkowo, szukajac opisu do compute shader-a. Ja sie patrze, a tu Sylwek. Pozdrowionka! Odezwij sie jak bedziesz w Szczecinie.
Marek
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sylwester Bala over 10 years ago

Hi themaister,
Very good spot! What you wrote is indeed the proper way of doing it.
I have updated the snippet of code above according to your comment.
Thank you!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Hans-Kristian Arntzen over 10 years ago

The use of glMemoryBarrier() here is slightly wrong. glMemoryBarrier is supposed to target how the incoherent data is read, not written. GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT would be the correct barrier here since the data is used for vertex drawing after compute (pp. 115 in ES3.1 spec).
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sylwester Bala over 10 years ago

Hi oscarbg,
Thank you for posting all these important questions. I will do my best to answer them.
Emulator OpenGL ES 3.1 - we are expecting the emulator to be released later this year. We have the OpenGL ES 3.0 emulator, which is conformant, whereas the OpenGL ES 3.1 conformance tests are not available yet.
In terms of consumer devices, I have already answered on that for the nazcaspider question. But I also would like to add that once the conformance tests for OpenGL ES 3.1 are available we expect current generation of Mali GPU to pass the conformance tests.
With regards to the extensions, we believe in driving the API to one standard and we are working with Khronos to make this happen. In general we believe that a better route to enable new features is to introduce them in core API rather than through extensions. All we need is to identify the extensions which are essential and reasonable for mobile platforms and introduce them in our implementation.
HTH,
Sylwester
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Graphics, Gaming, and VR blog

Shader analysis and more in Arm Performance Studio 2024.4

Julie Gaskin

Learn about the new shader analysis features for mobile developers in Frame Advisor, and hear about other Arm Performance Studio changes in this release.
- October 2, 2024
Save your battery while enjoying the modern graphics on mobile with Android Dynamic Performance Framework

Patrick Wang

Save battery and enhance mobile gaming with ADPF and Unreal Engine. Mori shows you how it optimizes graphics based on real-time thermal data, reducing overheating and power consumption.
- September 26, 2024
Introducing Arm Accuracy Super Resolution

arm-phodges

Today we introduce “Arm Accuracy Super Resolution” (Arm ASR), which is a best-in-class open-source solution for upscaling on mobile devices.
- July 10, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Get started with compute shaders

So, what are compute shaders?

How do I implement compute shaders within my application?

Shader analysis and more in Arm Performance Studio 2024.4

Save your battery while enjoying the modern graphics on mobile with Android Dynamic Performance Framework

Introducing Arm Accuracy Super Resolution