# Get started with compute shaders

Last month I was at Game Developer Conference (GDC) where I had a fabulous time attending various talks and roundtables, visiting exhibitors and I had a particularly good time showing and explaining to people the latest technologies developed within Arm, such as ASTC 3D HDR textures and Transaction Elimination, as well as compute shaders.

With regards to the last one, many of you have been curious about how to get this piece of technology incorporated into your software. With that in mind, I decided to write this blog to help you write a simple program with compute shaders. I hope this blog will help you to create more advanced applications based on this technology.

So, what are compute shaders? Compute shaders introduce heterogeneous GPU Compute from within the OpenGL® ES API; the same API and shading language which are used for graphics rendering. Now that compute shaders have been introduced to the API, developers do not have to learn another API in order to make use of GPU Compute. The compute shader is just another type of shader in addition to the already broadly known vertex and fragment shaders.

Compute shaders give a lot of freedom to developers to implement complex algorithms and make use of GPU parallel programming. Although the contemporary graphics pipeline is very flexible, developers still tend to stumble on some restrictions. The compute shaders feature, however, makes life easier for us to not think about pipeline stages as we are used to thinking about vertex and fragment. We are no longer restricted by the inputs and outputs of certain pipeline stages. The Shader Storage Buffer Object (SSBO) feature for instance has been introduced along with compute shaders and that gives additional possibilities for exchanging data between pipeline stages, as well as being flexible input and output for compute shaders.

Below you can find a simple example of how to implement compute shaders within your application. The example calculates a coloured circle with a given radius; the radius is a uniform parameter passed by the application and is updated every frame in order to animate the radius of the circle. The whole circle is drawn using points, which are stored as vertices within a Vertex Buffer Object (VBO). The VBO is mapped onto SSBO (without any extra copy in memory) and passed to the compute shader.

Let’s start by writing the OpenGL ES Shading Language (ESSL) compute shader code first:

Once the compute shader code has been written, it is time to make it work in our application. Within the application you need to create a compute shader, which is just a new type of shader (GL_COMPUTE_SHADER), and the other calls related to the initialisation remain the same as for vertex and fragment shaders. See below for a snippet of code which creates the compute shader and also checks for both compilation and linking errors:

So far we have created the compute shader on the GPU. Now we need to set up handlers, which will be used for setting up inputs and outputs for the shader. In our case we need to retrieve the radius uniform handle and set the gIndexBufferBinding (the integer variable) to 0, as the binding was hardcoded within binding = 0. Using this index we will be able to bind the VBO to that index and write data from within the compute shader to the VBO:

 // Bind the compute program in order to read the radius uniform location. glUseProgram(gComputeProgram); // Retrieve the radius uniform location iLocRadius = glGetUniformLocation(gComputeProgram, "radius"); // See the compute shader: “layout(std140, binding = 0) buffer destBuffer” gIndexBufferBinding = 0;

Okay, so far so good. Now we are ready to kick off the compute shader and write data to the VBO. The snippet of code below shows how to bind the VBO to the SSBO and submit a compute job to the GPU:

 // Bind the compute program glUseProgram(gComputeProgram); // Set the radius uniform glUniform1f(iLocRadius, (float)frameNum); // Bind the VBO onto SSBO, which is going to filled in witin the compute // shader. // gIndexBufferBinding is equal to 0 (same as the compute shader binding) glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, gVBO); // Submit job for the compute shader execution. // GROUP_SIZE_HEIGHT = GROUP_SIZE_WIDTH = 8 // NUM_VERTS_H = NUM_VERTS_V = 16 // As the result the function is called with the following parameters: // glDispatchCompute(2, 2, 1) glDispatchCompute(                                    (NUM_VERTS_H % GROUP_SIZE_WIDTH + NUM_VERTS_H) / GROUP_SIZE_WIDTH,                                    (NUM_VERTS_V % GROUP_SIZE_HEIGHT + NUM_VERTS_V) / GROUP_SIZE_HEIGHT,                                   1); // Unbind the SSBO buffer. // gIndexBufferBinding is equal to 0 (same as the compute shader binding) glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, 0);

As you may have already noticed, for the glDispatchCompute function we pass the number of groups rather than number of threads to be executed. In our case we execute 2x2x1  groups, which gives 4. However the real number of threads (kernels) executed will be 4 x [8 x 8] which results with the number of 256 threads. The numbers 8x8 come from the compute shader source code, as we hardcoded those numbers within the shader.

So far we have written the compute shader source code, compiled, linked, initialised handlers and dispatched the job for compute. Now it’s time to render the results on screen. However, before we do that we need to remember that all jobs are submitted and executed on the GPU in parallel, so we need to make sure the compute shader will finish the job before the actual draw command starts fetching data from the VBO buffer, which is updated by the compute shader. In this example you won't see much difference in runtime with and without synchronisation but once you implement more complex algorithms with more dependencies, you may notice how important it is to have synchronisation.

 // Call this function before we submit a draw call, which uses dependency // buffer, to the GPU glMemoryBarrier(GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT); // Bind VBO glBindBuffer( GL_ARRAY_BUFFER, gVBO ); // Bind Vertex and Fragment rendering shaders glUseProgram(gProgram); glEnableVertexAttribArray(iLocPosition); glEnableVertexAttribArray(iLocFillColor); // Draw points from VBO glDrawArrays(GL_POINTS, 0, NUM_VERTS);

In order to present the VBO results on screen you can use vertex and fragment programs, which are shown below.

 attribute vec4 a_v4Position; attribute vec4 a_v4FillColor; varying vec4 v_v4FillColor; void main() {       v_v4FillColor = a_v4FillColor;       gl_Position = a_v4Position; }

 varying vec4 v_v4FillColor; void main() {       gl_FragColor = v_v4FillColor; }

I think that’s all for this blog and hopefully I will be able to cover more technical details in the future. I believe you will find compute shaders friendly and easy to use in your work. I personally enjoyed implementing the Cloth Simulation demo, one of Arm’s latest technical demos, which was released at GDC. The important thing in my view is that now, once a developer is used to OpenGL ES, it is easy to move on to GPU Compute using just one API. More than that, exchanging data between graphics and compute buffers appears to be done in a clean and transparent way for developers. You shouldn’t limit your imagination to this blog’s application of how you might want to use compute shaders - this blog is only to help developers learn how to use them. I personally can see a real potential in image processing, as you can implement algorithms that will be executed on the chip using internal memory, which must reduce traffic on the bus between memory and chip.

• Hi Eyal,

GLES driver updates for stock devices are pushed by the device vendor in the form of OTA updates. We sometimes release Vanilla builds of the userspace driver for certain devices at ARM Mali Midgard GPU User Space Drivers - Mali Developer Center Mali Developer Center but the Note 10.1 is not currently supported directly by us.

Hth,

Chris

• Hello,

A bit late on this but...

I have a Galaxy Note 10.1. What do I have to do in order to start using compute shaders on it? are the drivers available yet?

thanks

Eyal

• Trafilem tu przypadkowo, szukajac opisu do compute shader-a. Ja sie patrze, a tu Sylwek. Pozdrowionka! Odezwij sie jak bedziesz w Szczecinie.

Marek

• Hi themaister,

Very good spot! What you wrote is indeed the proper way of doing it.

I have updated the snippet of code above according to your comment.

Thank you!

• The use of glMemoryBarrier() here is slightly wrong. glMemoryBarrier is supposed to target how the incoherent data is read, not written. GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT would be the correct barrier here since the data is used for vertex drawing after compute (pp. 115 in ES3.1 spec).

Graphics & Multimedia blog
• ### Reduced effort and risk for building 4K displays in high-end smartphones and AR/VR devices

Vassilis Androutsopoulos and Licinio Sousa illustrate different display subsystem architecures and describe an interoperable Arm display processing unit (DPU) plus Synopsys MIPI Display Serial Interface…

This blog explains the process of implementing indoor navigation using only Unity and the Google ARCore SDK with today’s handsets.
• ### White Paper: Foveated Rendering

This white paper describes Foveated Rendering on Arm devices, it explains and discusses points such as: What foveated rendering is, and what are it's use cases. Optimizing VR applications for foveated…