Interested in GPU Compute?  You Have Choices!

Chinese Version 中文版:[原创翻译]GPU通用计算有兴趣吗?你有多种选择!

The most notable addition to OpenGL® ES when version 3.1 was announced at GDC earlier this year was Compute Shaders. Whilst similar to vertex and fragment shaders, Compute Shaders allow much more general-purpose data access and computation. These have been available on desktop OpenGL® since version 4.3 in mid-2012, but it’s the first time they’ve been available in the mobile API. This brings another player to the compute-on-mobile-GPU game, joining the ranks of OpenCL, RenderScript and others. So what do these APIs do and when should you use them? I’ll attempt to answer these questions in this blog.

When it comes to programming the GPU for non-graphics related jobs, the various tools at our disposal share a common goal: to provide an interface between the GPU and CPU so that packets of work to be executed in parallel can be applied to the GPU’s compute resources. Designing tools that are flexible enough to do this and that allow the individual strengths of the GPU’s architecture to be exploited is a complex process. The strength of the GPU is to run small tasks on a wide range of data as far as possible in parallel, often many millions of times. This is after all what a GPU does when processing pixels. Compute on the GPU is just generalizing this capability. So inevitably there are some similarities in how these tools do what they do.

Let’s take a look at the main options…

OpenCL


opencl.png

Initially developed by Apple and subsequently managed by the Khronos Group, the OpenCL specification was released in late 2008. OpenCL is a flexible framework that can target many types of processor, from CPUs and GPUs to DSPs. To do so you need a conformant OpenCL driver for the processor you’re targeting. Once you have that, a properly written OpenCL application will be compatible with other suitably conformant platforms.

When I say OpenCL is flexible, I was perhaps understating it. Based on a variant of C99, it is very flexible, allowing complex algorithms to be shaped across a wide variety of parallel computing architectures.  And it has become very widespread – there are drivers available for hundreds of platforms. See this list for the products that have passed the Khronos conformance tests. ARM supports OpenCL with its family of ARM® Mali™ GPUs. For example Mali-T604 passed conformance in 2012.

So is there a price for all this flexibility?  Well, it can be reasonably complex to set up an OpenCL job… and there can be quite an overhead in doing so. The API breaks down access to OpenCL-compatible devices into a hierarchy of sub units.

OpenCL Execution Model_s.png

So the host computer can in theory have any number of OpenCL devices. Each of these can have any number of compute units and in turn, each of these compute units can have any number of processing elements. OpenCL workgroups – collections of individual threads called work items – run on these processing elements. How all of this is implemented is platform dependent as long as the end result is compliant with the OpenCL standard. As a result, the boilerplate code to set up access to OpenCL devices has to be very flexible to allow for so many potential variations, and this can seem significant, even for a minimal OpenCL application.

There are some great samples and a tutorial available in the ARM Mali OpenCL SDK, with a mix of basic through to more complex examples.

From the earliest days of OpenCL targeting mobile GPUs the API has showed shown great promise, both in terms of accelerating performance and in reduced energy consumption.  Many of these have concentrated on image and video processing. For an example, see this great write-up of the latest software VP9 decoder from Ittiam.

For more examples of some of the developments using OpenCL on mobile, check out Mucho GPU Compute, amigo! from Roberto Mijat.

One of the real benefits of OpenCL, as well as its flexibility, is the huge range of research and developer activity surrounding the API. There are a large number of other languages – more than 70 at the last count – that compile down to OpenCL, easing its use and allowing its benefits to be harnessed in a more familiar environment. And there are several CL libraries and numerous frameworks exposing the OpenCL API from a wide range of languages. PyOpenCL, for example, provides access to OpenCL via Python. See Anton Lokhmotov's blog on this subject Introducing PyOpenCL.


Because of the required setup and overhead, building an OpenCL job into a pipeline is usually only worth doing when the job is big enough, at the point where this overhead becomes insignificant against the work being done. A great example of this was Ittiam System’s recent optimisation of their HEVC and VP9 software video decoder. As not all of the algorithm was suitable for the GPU, Ittiam had to choose how to split the workload between the CPU and GPU. They identified the motion estimation part of the algorithm as being the most likely to present enough parallel computational work to benefit from running on the GPU. The algorithm as a whole is then implemented as a heterogeneous split between the CPU and GPU, with the resulting benefits of reduced CPU workload and reduced overall power usage. See this link for more about Ittiam Systems.  Like most APIs targeting a wide range of architectures, optimisations you make for one platform might need to be tweaked on another, but having the flexibility to address the low level features of a platform to take full advantage of it is one of OpenCL’s real strengths.

Recent Developments


It’s been a busy year so far for Khronos and OpenCL – there have been a number of developments.  Of particular note perhaps is the announcement of version 1.0 of WebCL™, an API that does for OpenCL what WebGL™ does for OpenGL ES by exposing the compute API to JavaScript and bringing compute access into the world of the browser. Of course, support within browsers may take some time – as it did for WebGL – but it’s a sign of OpenCL broadening its appeal.


OpenCL Summary


OpenCL provides an industry standard API that allows the developer to optimise for a supporting platform’s low level architectural features. To help you get going there is a large and growing number of developer resources from a very active community. If the platform you’re planning to develop for supports it, OpenCL can be a powerful tool.

RenderScript

renderscript.png

RenderScript is a proprietary compute API developed by Google. It’s been an official part of Android™ OS since the Honeycomb release in July 2011. Back then it was intended as both a graphics and a compute API, but the graphics part has since been deprecated. There are several similarities with OpenCL… it’s based on C99, has the same concept of organising data into 1, 2 or 3 dimensions etc. For a quick primer on RenderScript, see GPU Computing in Android? With ARM Mali-T604 & RenderScript Compute You Can!  by rmijat or Google’s introduction to RenderScript here.

The process of developing for RenderScript is relatively straightforward. You write your RenderScript C99-based code alongside the Java that makes up the rest of your Android application. The Android SDK creates some additional Java glue to link the two together, and compiles the RenderScripts themselves into bitcode, an intermediate, device-independent format that is bundled with the APK. When the device runs, Android will determine what RenderScript devices are available and capable of running the bitcode in question. This might be a GPU (e.g. Mali-T604) or a DSP.  If one is found, the bitcode is passed onto a driver that creates appropriate machine-level code. If there is no suitable device, Android will default back to running the RenderScript on the CPU.

In this way RenderScript is guaranteed to run on just about any Android device, and even with fallback to the CPU it can provide a useful level of acceleration. So if you are specifically looking for compute acceleration in Android, RenderScript is a great tool.

RenderScript_s.png

The very first device with GPU-accelerated RenderScript was Google’s Nexus 10, which used an SoC featuring an ARM Mali-T604 GPU. Early examples of RenderScript applications have shown a significant benefit from using accelerated GPU compute.

As a relatively young API, RenderScript knowhow and examples are not as easy to come by compared to OpenCL, but this is likely to increase. There’s more detail about how to use RenderScript here.

RenderScript Summary


RenderScript is a great way to benefit from accelerated compute in the vast majority of Android devices. Whether this compute is put onto the GPU or not will depend on the device and availability of RenderScript GPU drivers, but even when that isn’t the case there should still be some benefit from running RenderScripts on the CPU. It’s a higher-level API than OpenCL, with fewer configuration options, and as such can be easier to get to grips with, particularly as RenderScript development is streamlined into the existing Android SDK. If you have this setup, you already have all the tools you need to get going.

Compute Shaders


gles.png

So to the new kid on the block, OpenGL ES 3.1 compute shaders. If you’re used to using vertex and fragment shaders already with OpenGL ES, you’ll fit right in with compute shaders. They’re written in GLSL (OpenGL Shading Language) in pretty much the same way with similar status, uniforms and other properties and have access to many of the same types of data including textures, image types, atomic counters and so on. However, unlike vertex and fragment shaders they’re not built into the same program object and as such are not a part of the same rendering pipeline.

Compute shaders introduce a new general-purpose form of data buffer, the Shader Storage Buffer Object, and mirror the ideas of work items and workgroups used in OpenCL and RenderScript. Other additions to GLSL allow work items to identify their position in the data set being processed and allow the programmer to specify the size and shape of the workgroups.

You might typically use a compute shader in advance of the main rendering pipeline, using the shader’s output as another input to the vertex or fragment stages.

compute shader_s.png

Though not part of a rendering pipeline, compute shaders are typically used to support them. They’re not as well suited to general purpose compute work as OpenCL or RenderScript - but assuming your use-case is suitable, compute shaders offer an easy way to support access to general purpose computing on the GPU.

For a great introduction to Compute Shaders, do see sylvek's recent blog Get started with compute shaders.

Compute Shaders Summary


Compute shaders are coming! How quickly depends on the role-out and adoption of OpenGL ES 3.1, but there’s every chance this technology will find its way into a very wide range of devices as mobile GPUs capable of supporting OpenGL ES 3.1 filter down into the mid-range market over the next couple of years. The same thing happened with the move from OpenGL ES 1.1 to 2.0… nowadays you’d be hard pushed to find a phone or tablet that doesn’t support 2.0.  Relative ease of use combined with growing ubiquity across multiple platforms could just be a winning combination.

See ploutgalatsopoulos' blog on ARM's recent submission for OpenGL ES 3.1 conformance for the Mali-T604, Mali-T628 and Mali-T760 GPUs - and for a great introduction to OpenGL ES 3.1 as a whole, do check out Tom Olson's blog Here comes OpenGL® ES 3.1!

One more thing…


gles.png

So that’s it.  But as Columbo would say… just one more thing…

OpenGL ES 2.0 Fragment Shaders and Frame Buffer Objects


Although not seen as a power compute user’s weapon of choice, fragment shaders have for a long time been used to run some level of general compute - and they do offer one benefit unique amongst all the main approaches here: ubiquity. Any OpenGL ES 2.0-capable GPU – and that really is just about every smart device out there today – can run fragment shaders. This approach involves thinking of texture maps not necessarily as arrays of texels, but just as a 1D or 2D array of data. As long as the data to be read and written by the shader can be represented by supported texture formats, these values can be sampled and written out for each element in the array. You just set up a Frame Buffer Object and typically render a quad (two triangles making a rectangle) into it, using one or more of these data arrays as texture sources. The fragment shader can then compute more or less whatever it wants from the data in these textures, and output the computed result to the FBO. The resulting texture can then be used as a source for any other fragment shaders in the rendering pipeline.

Summary


In this blog I’ve looked at OpenCL, RenderScript, Compute Shaders and fragment shaders as several options for using the GPU for non-graphical compute workloads. Each approach has characteristics that will suit certain applications of developer’s requirements, and all of these tools can be leveraged to both improve performance and reduce energy consumption.  It’s worth noting that the story doesn’t stop here. The world of embedded and mobile heterogeneous computing is evolving fast. The good news is that the Mali GPU architecture is designed to support the latest leading compute APIs, enabling all our customers to achieve improved performance and energy efficiency on a variety of platforms and operating systems.

Anonymous
Graphics & Multimedia blog