Last month I was at Game Developer Conference (GDC) where I had a fabulous time attending various talks and roundtables, visiting exhibitors and I had a particularly good time showing and explaining to people the latest technologies developed within Arm, such as ASTC 3D HDR textures and Transaction Elimination, as well as compute shaders.
With regards to the last one, many of you have been curious about how to get this piece of technology incorporated into your software. With that in mind, I decided to write this blog to help you write a simple program with compute shaders. I hope this blog will help you to create more advanced applications based on this technology.
Compute shaders introduce heterogeneous GPU Compute from within the OpenGL ES API; the same API and shading language which are used for graphics rendering. Now that compute shaders have been introduced to the API, developers do not have to learn another API in order to make use of GPU Compute. The compute shader is just another type of shader in addition to the already broadly known vertex and fragment shaders.
Compute shaders give a lot of freedom to developers to implement complex algorithms and make use of GPU parallel programming. Although the contemporary graphics pipeline is very flexible, developers still tend to stumble on some restrictions. The compute shaders feature, however, makes life easier for us to not think about pipeline stages as we are used to thinking about vertex and fragment. We are no longer restricted by the inputs and outputs of certain pipeline stages. The Shader Storage Buffer Object (SSBO) feature for instance has been introduced along with compute shaders and that gives additional possibilities for exchanging data between pipeline stages, as well as being flexible input and output for compute shaders.
Below you can find a simple example of how to implement compute shaders within your application. The example calculates a coloured circle with a given radius; the radius is a uniform parameter passed by the application and is updated every frame in order to animate the radius of the circle. The whole circle is drawn using points, which are stored as vertices within a Vertex Buffer Object (VBO). The VBO is mapped onto SSBO (without any extra copy in memory) and passed to the compute shader.
Let’s start by writing the OpenGL ES Shading Language (ESSL) compute shader code first:
#version 310 es // The uniform parameters which is passed from application for every frame. uniform float radius; // Declare custom data struct, which represents either vertex or colour. struct Vector3f { float x; float y; float z; float w; }; // Declare the custom data type, which represents one point of a circle. // And this is vertex position and colour respectively. // As you may already noticed that will define the interleaved data within // buffer which is Vertex|Colour|Vertex|Colour|… struct AttribData { Vector3f v; Vector3f c; }; // Declare input/output buffer from/to wich we will read/write data. // In this particular shader we only write data into the buffer. // If you do not want your data to be aligned by compiler try to use: // packed or shared instead of std140 keyword. // We also bind the buffer to index 0. You need to set the buffer binding // in the range [0..3] – this is the minimum range approved by Khronos. // Notice that various platforms might support more indices than that. layout(std140, binding = 0) buffer destBuffer { AttribData data[]; } outBuffer; // Declare what size is the group. In our case is 8x8, which gives // 64 group size. layout (local_size_x = 8, local_size_y = 8, local_size_z = 1) in; // Declare main program function which is executed once // glDispatchCompute is called from the application. void main() { // Read current global position for this thread ivec2 storePos = ivec2(gl_GlobalInvocationID.xy); // Calculate the global number of threads (size) for this uint gWidth = gl_WorkGroupSize.x * gl_NumWorkGroups.x; uint gHeight = gl_WorkGroupSize.y * gl_NumWorkGroups.y; uint gSize = gWidth * gHeight; // Since we have 1D array we need to calculate offset. uint offset = storePos.y * gWidth + storePos.x; // Calculate an angle for the current thread float alpha = 2.0 * 3.14159265359 * (float(offset) / float(gSize)); // Calculate vertex position based on the already calculate angle // and radius, which is given by application outBuffer.data[offset].v.x = sin(alpha) * radius; outBuffer.data[offset].v.y = cos(alpha) * radius; outBuffer.data[offset].v.z = 0.0; outBuffer.data[offset].v.w = 1.0; // Assign colour for the vertex outBuffer.data[offset].c.x = storePos.x / float(gWidth); outBuffer.data[offset].c.y = 0.0; outBuffer.data[offset].c.z = 1.0; outBuffer.data[offset].c.w = 1.0; }
Once the compute shader code has been written, it is time to make it work in our application. Within the application you need to create a compute shader, which is just a new type of shader (GL_COMPUTE_SHADER), and the other calls related to the initialisation remain the same as for vertex and fragment shaders. See below for a snippet of code which creates the compute shader and also checks for both compilation and linking errors:
// Create the compute program, to which the compute shader will be assigned gComputeProgram = glCreateProgram(); // Create and compile the compute shader GLuint mComputeShader = glCreateShader(GL_COMPUTE_SHADER); glShaderSource(mComputeShader, 1, computeShaderSrcCode, NULL); glCompileShader(mComputeShader); // Check if there were any issues when compiling the shader int rvalue; glGetShaderiv(mComputeShader, GL_COMPILE_STATUS, &rvalue); if (!rvalue) { glGetShaderInfoLog(mComputeShader, LOG_MAX, &length, log); printf("Error: Compiler log:\n%s\n", log); return false; } // Attach and link the shader against to the compute program glAttachShader(gComputeProgram, mComputeShader); glLinkProgram(gComputeProgram); // Check if there were some issues when linking the shader. glGetProgramiv(gComputeProgram, GL_LINK_STATUS, &rvalue); if (!rvalue) { glGetProgramInfoLog(gComputeProgram, LOG_MAX, &length, log); printf("Error: Linker log:\n%s\n", log); return false; }
So far we have created the compute shader on the GPU. Now we need to set up handlers, which will be used for setting up inputs and outputs for the shader. In our case we need to retrieve the radius uniform handle and set the gIndexBufferBinding (the integer variable) to 0, as the binding was hardcoded within binding = 0. Using this index we will be able to bind the VBO to that index and write data from within the compute shader to the VBO:
// Bind the compute program in order to read the radius uniform location. glUseProgram(gComputeProgram); // Retrieve the radius uniform location iLocRadius = glGetUniformLocation(gComputeProgram, "radius"); // See the compute shader: “layout(std140, binding = 0) buffer destBuffer” gIndexBufferBinding = 0;
Okay, so far so good. Now we are ready to kick off the compute shader and write data to the VBO. The snippet of code below shows how to bind the VBO to the SSBO and submit a compute job to the GPU:
// Bind the compute program glUseProgram(gComputeProgram); // Set the radius uniform glUniform1f(iLocRadius, (float)frameNum); // Bind the VBO onto SSBO, which is going to filled in witin the compute // shader. // gIndexBufferBinding is equal to 0 (same as the compute shader binding) glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, gVBO); // Submit job for the compute shader execution. // GROUP_SIZE_HEIGHT = GROUP_SIZE_WIDTH = 8 // NUM_VERTS_H = NUM_VERTS_V = 16 // As the result the function is called with the following parameters: // glDispatchCompute(2, 2, 1) glDispatchCompute( (NUM_VERTS_H % GROUP_SIZE_WIDTH + NUM_VERTS_H) / GROUP_SIZE_WIDTH, (NUM_VERTS_V % GROUP_SIZE_HEIGHT + NUM_VERTS_V) / GROUP_SIZE_HEIGHT, 1); // Unbind the SSBO buffer. // gIndexBufferBinding is equal to 0 (same as the compute shader binding) glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, 0);
As you may have already noticed, for the glDispatchCompute function we pass the number of groups rather than number of threads to be executed. In our case we execute 2x2x1 groups, which gives 4. However the real number of threads (kernels) executed will be 4 x [8 x 8] which results with the number of 256 threads. The numbers 8x8 come from the compute shader source code, as we hardcoded those numbers within the shader.
So far we have written the compute shader source code, compiled, linked, initialised handlers and dispatched the job for compute. Now it’s time to render the results on screen. However, before we do that we need to remember that all jobs are submitted and executed on the GPU in parallel, so we need to make sure the compute shader will finish the job before the actual draw command starts fetching data from the VBO buffer, which is updated by the compute shader. In this example you won't see much difference in runtime with and without synchronisation but once you implement more complex algorithms with more dependencies, you may notice how important it is to have synchronisation.
// Call this function before we submit a draw call, which uses dependency // buffer, to the GPU glMemoryBarrier(GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT); // Bind VBO glBindBuffer( GL_ARRAY_BUFFER, gVBO ); // Bind Vertex and Fragment rendering shaders glUseProgram(gProgram); glEnableVertexAttribArray(iLocPosition); glEnableVertexAttribArray(iLocFillColor); // Draw points from VBO glDrawArrays(GL_POINTS, 0, NUM_VERTS);
In order to present the VBO results on screen you can use vertex and fragment programs, which are shown below.
Vertex shader:
attribute vec4 a_v4Position; attribute vec4 a_v4FillColor; varying vec4 v_v4FillColor; void main() { v_v4FillColor = a_v4FillColor; gl_Position = a_v4Position; }
Fragment shader:
varying vec4 v_v4FillColor; void main() { gl_FragColor = v_v4FillColor; }
I think that’s all for this blog and hopefully I will be able to cover more technical details in the future.
I believe you will find compute shaders friendly and easy to use in your work. I personally enjoyed implementing the Cloth Simulation demo, one of Arm’s latest technical demos, which was released at GDC. The important thing in my view is that now, once a developer is used to OpenGL ES, it is easy to move on to GPU Compute using just one API. More than that, exchanging data between graphics and compute buffers appears to be done in a clean and transparent way for developers. You shouldn’t limit your imagination to this blog’s application of how you might want to use compute shaders - this blog is only to help developers learn how to use them. I personally can see a real potential in image processing, as you can implement algorithms that will be executed on the chip using internal memory, which must reduce traffic on the bus between memory and chip.
Hello,
A bit late on this but...
I have a Galaxy Note 10.1. What do I have to do in order to start using compute shaders on it? are the drivers available yet?
thanks
Eyal
Trafilem tu przypadkowo, szukajac opisu do compute shader-a. Ja sie patrze, a tu Sylwek. Pozdrowionka! Odezwij sie jak bedziesz w Szczecinie.
Marek
Hi themaister,
Very good spot! What you wrote is indeed the proper way of doing it.
I have updated the snippet of code above according to your comment.
Thank you!
The use of glMemoryBarrier() here is slightly wrong. glMemoryBarrier is supposed to target how the incoherent data is read, not written. GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT would be the correct barrier here since the data is used for vertex drawing after compute (pp. 115 in ES3.1 spec).
Hi oscarbg,
Thank you for posting all these important questions. I will do my best to answer them.
Emulator OpenGL ES 3.1 - we are expecting the emulator to be released later this year. We have the OpenGL ES 3.0 emulator, which is conformant, whereas the OpenGL ES 3.1 conformance tests are not available yet.
In terms of consumer devices, I have already answered on that for the nazcaspider question. But I also would like to add that once the conformance tests for OpenGL ES 3.1 are available we expect current generation of Mali GPU to pass the conformance tests.
With regards to the extensions, we believe in driving the API to one standard and we are working with Khronos to make this happen. In general we believe that a better route to enable new features is to introduce them in core API rather than through extensions. All we need is to identify the extensions which are essential and reasonable for mobile platforms and introduce them in our implementation.
HTH,
Sylwester