Using DS-5 Streamline to Optimize Complex OpenCL™ Applications on Mali GPUs

Chinese Version中文版:在 Mali GPU 上利用 DS-5 Streamline 优化复杂的 OpenCL™ 应用程序

Heterogeneous applications – those running code on multiple processors like a CPU and a GPU at the same time – are inherently difficult to optimize.  Not only do you need to consider how optimally the different parts of code that run on the different processors are performing, but you also need to take into account how well they are interacting with each other.  Is either processor waiting around unnecessarily for the other?  Are you copying large amounts of memory unnecessarily?  What level of utilisation are you making of the GPU?  Where are the bottlenecks?  The complexities of understanding all these are not for the squeamish.

Performance analysis tools are, of course, the answer, at least in part.  DS-5 Streamline performance analyzer is one of these tools and recently saw the addition of some interesting new features targeting OpenCL.  Streamline is one of the components of ARM DS-5 Development Studio, the end-to-end suite of tools for software development on any ARM processor.

So, armed with DS-5 Streamline and a complex, heterogeneous application how should you go about optimization?  In this blog I aim to give you a starting point, introducing the DS-5 tool and a few concepts about optimization along the way.

DS-5 Streamline Overview


pic1.jpg

DS-5 Streamline allows you to attach to a live device and retrieve hardware counters in real time.  The counters you choose are displayed in a timeline, and this can include values from both the CPU and GPU in the same trace.  The image above, for example, shows a timeline with a number of traces.  From the top there’s the dual-core CPU activity in green, the GPU’s graphics activity in light blue and the GPU’s compute activity in red.  Following that are various hardware counter and other system traces.

As well as the timeline, on the CPU side you can drill down to the process you want to analyse and then profile performance within the various parts of the application, right down to system calls.  With Mali GPUs you can specify performance counters and graph them right alongside the CPU.  This allows you to profile both graphics and OpenCL compute jobs, allowing for highly detailed analysis of the processing being done in the cores and their components. A recently added feature, the OpenCL timeline, takes this a step further making it possible to analyse individual kernels amongst a chain of kernels.

Optimization Workflow


So with the basics described, what is the typical optimization process for complex heterogeneous applications?

When the intention is to create a combined CPU and GPU solution for a piece of software you might typically start with a CPU-only implementation.  This gets the gremlins out of the algorithms you need to implement and then acts both as a golden reference for the accuracy of computations being performed, and as a performance reference so you know the level of benefit the move to multiple processor types is giving you.

Often the next step is then to create a “naïve” port.  This is where the transition of code from CPU to GPU is functional but relatively crude. You wouldn’t necessarily expect a big – or indeed any – uplift in performance at this stage, but it’s important to establish a working heterogeneous model if nothing else.

At this point you would typically start thinking about optimization.  Profiling the naïve port is probably a good next step as this can often highlight the level of utilisation within your application and from there you can deduce where to concentrate most of your efforts.  Often what you’re looking for at this stage is a hint as to the best way to implement the parallel components of your algorithm.

Of course to get the very best out of the hardware you’re using it is vital to have a basic understanding at least of the architecture you are targeting.  So let’s start with a bit of architectural background for the Mali GPU.

The OpenCL Execution Model on Mali GPUs


Firstly, here’s how the OpenCL execution model maps onto Mali GPUs.

pic2.jpg

Work items are simply threads on the shader pipeline, each one with its own registers, program counter, stack pointer and private stack. Up to 256 of these can run on a core at a time, each capable of natively processing vector data.

OpenCL work groups – collections of work items – also work on an individual core. Workgroups can have barriers, local atomics and cached local memory.

The ND Range, the entire work load for an OpenCL job, splits the workgroups up and assigns them around the available Mali GPU cores. Global atomics are supported, and we have cached global memory.

As we’ll see, relative to some other GPU architectures, Mali GPU cores are relatively sophisticated devices capable of handling hundreds of threads in flight at any one time.

The Mali GPU Core

Let’s take a closer look inside one of these cores:

pic3.jpg

Here we see the dual ALU, the load/store and the texture pipelines. Threads come in at the top and enter one of these pipes, circle round back up to the top for the next instruction until the thread completes, at which point it exits at the bottom.  We would typically have a great many threads running this way spinning around the pipelines instruction by instruction.

Load/Store

So let’s imagine the first instruction is a load.  It enters and is executed in the load/store pipe.  If the data is available, the thread can loop round on the next cycle for the next instruction.  If the data hasn’t yet arrived from main memory, the instruction will have to wait in the pipe until it’s available.

ALUs

Imagine then the next instruction is arithmetic.  The thread now enters one of the arithmetic pipes.  ALU instructions support SIMD – single instruction, multiple data – allowing operations on several components at a time.  The instruction format itself is VLIW – very long instruction word – supporting several operations per instruction.  This could include, for example, a vector add, a vector multiply and various scalar operations all in one instruction.  This can give the effect of certain operations appearing “as free” because the arithmetic units within the ALU can perform many of these in parallel within a single cycle.  Finally there is a built in function library – the “BIFL” – which has hardware acceleration for many mathematical and other operational functions.

So this is a complex and capable core, designed to keep many threads in flight at a time, and thereby hide latency.  Latency hiding is what this is ultimately all about. We don’t care if an individual thread has to wait around for some data to arrive as long as the pipelines can get on with processing other threads.

Each of these pipelines is independent from the other and likewise the threads are entirely independent from other threads.  The total time for a program to be executed is then defined by the pipeline that needs the most cycles to let every thread execute all the instructions in its program. If we have predominantly load/store operations for example, the load/store pipe will become the limiting factor.  So in order to optimize a program we need to find which pipeline this is allowing us to target optimization efforts effectively.

Hardware Counters

To help determine this we need to access the GPU’s hardware counters. These will identify which parts of the cores are being exercised by a particular job.  In turn this helps target our efforts towards tackling bottlenecks in performance.

There are a large number of these hardware counters available. For example there are counters for each core and counters for individual components within a core, allowing you to peek inside and understand what is going on with the pipelines themselves.  And we have counters for the GPU as a whole, including things like the number of active cycles.

Accessing these counters is where we come back to DS-5 Streamline.  Let’s look at some screenshots of Streamline at work.

pic4.jpg

The first thing to stress is that what we see here is a whole-system view.  The vertical green bars in the top line shows the CPU, the blue bars below that show the graphics part of the application running on the GPU, and the red bars show the compute-specific parts of the application on the GPU.

crop.jpg

There are all sorts of ways to customise this – I’m not going to go into huge amounts of detail here, but you can select from a wide variety of counter information for your system depending on what it is you need to measure. Streamline allows you to isolate counters against specific applications for both CPU and GPU, allowing you to focus in on what you need to see.

crop2.jpg

Looking down the screen you can see an L2 cache measurement - the blue wavy trace in the middle there -  and further down we’ve got a counter showing activity in the Mali GPU’s arithmetic pipelines.  We could scroll down to find more and indeed zoom in to get a more detailed view at any point.

DS-5 Streamline can often show you very quickly where the problem lies in a particular application.  The next image was taken from a computer vision application running on the CPU and using OpenCL on the GPU.  It would run fine for a number of seconds, and then seemingly randomly would suddenly slow down significantly, with the processing framerate dropping in half.

crop3.jpg

You can see the trace has captured the moment this slowdown happened. To the left of the timeline marker we can see the CPU and GPU working reasonably efficiently.  Then this suddenly lengthens out, we see a much bigger gap between the pockets of GPU work, and the CPU activity has grown significantly.  The red bars in amongst the green bars at the top represent increased system activity on the platform.  This trace and others like it were invaluable in showing that the initial problem with this application lay with how it was streaming and processing video.

One of the benefits of having the whole system on view is that we get a holistic picture of the performance of the application across multiple processors and processor types, and this was particularly useful in this example.

crop4.jpg

Here we’ve scrolled down the available counters in the timeline to show some others – in particular the various activities within the Mali GPU’s cores.  You can see counter lines for a number of things, but in particular the arithmetic, load-store and texture pipes – along with cache hits, misses etc.  Hovering over any of these graphs at any point in the timeline will show actual counter numbers.

crop5.jpg

Here for example we can see the load/store pipe instruction issues at the top, and actual instructions on the bottom.  The difference in this case is a measure of the load/store re-issues necessary at this point in the timeline – in itself a measure of efficiency of memory accesses.  What we are seeing at this point represents a reasonably healthy position in this regard.

The next trace is from the same application we were looking at a little earlier, but this time with a more complex OpenCL filter chain enabled.

crop6.jpg

If we look a little closer we can see how efficiently the application is running.  We’ve expanded the CPU trace – the green bars at the top – to show both the cores we had on this platform.  Remember the graphics elements are the blue bars, with the image processing filters represented by the red.

mag.jpg

Looking at the cycle the application is going through for each frame:

  1. Firstly there is CPU activity leading up to the compute job.
  2. Whilst the compute job then runs, the CPU is more or less idle.
  3. With the completion of the compute filters, the CPU does a small amount of processing, setting up the graphics render.
  4. The graphics job then runs, rendering the frame before the sequence starts again.

So in a snapshot we have this holistic and heterogeneous overview of the application and how it is running.  Clearly we could aim for much better performance here by pipelining the workload to avoid the idle gaps we see.  There is no reason why the CPU and GPU couldn’t be made to run more efficiently in parallel, and this trace shows that clearly.

OpenCL Timeline

There are many features of DS-5 Streamline, and I’m not going to attempt to go into them all.  But there’s one in particular I’d like to show you that links the latest Mali GPU driver release to the latest version of DS-5 (v5.20), and that’s the OpenCL Timeline.

pic1.jpg

In this image we’ve just enabled the feature – it’s the horizontal area at the bottom.  This shows the running of individual OpenCL kernels, the time they take to run, any overhead of sync-points between CPU and GPU etc.

crop7.jpg

Here we have the name of each kernel being run along with the supporting host-side setup processes   If we hover over any part of this timeline…

crop8.jpg

… we can see details about the individual time taken for that kernel or operation.  In terms of knowing how then to target optimizations, this is invaluable.

Here’s another view of the same feature.

pic15.jpg

We can click the “Show all dependencies” button and Streamline will show us visually how the kernels are interrelated.  Again, this is all within the timeline, fitting right in with this holistic view of the system.  Being able to do this – particularly for complex, multi-kernel OpenCL applications is becoming a highly valuable tool for developers in helping to understand and improve the performance of ever-more demanding applications.

Optimizing Memory Accesses

So once you have these hardware counters, what sort of use should you make of them?

Generally speaking, the first thing to focus on is the use of memories. The SoC only has one programmer controlled memory in the system – in other words, there is no local memory, it’s all just global.  The CPU and GPU have the same visibility of this memory and often they’ll have a shared memory bus.  Any overlap with memory accesses therefore might cause problems.

If we want to shift back and forth between CPU and GPU, we don’t need to copy memory (as you might do on a desktop architecture).  Instead, we only need to do cache flushes.  These can also take time and needs minimising. So we can take an overview with Streamline of the program allowing us to see when the CPU was running and when the GPU was running, in a similar way to some of the timelines we saw earlier.  We may want to optimize our synchronisation points so that the GPU or CPU are not waiting any longer than they need to. Streamline is very good at visualising this.

Optimizing GPU ALU Load

With memory accesses optimized, the next stage is to look more closely at the execution of your kernels.  As we’ve seen, using Streamline we can zoom into the execution of a kernel and determine what the individual pipelines are doing, and in particular determine which pipeline is the limiting factor.  The Holy Grail here – a measure of peak optimization – is for the limiting pipe to be issuing instructions every cycle.

I mentioned earlier that we have a latency-tolerant architecture because we expect to have a great many threads in the system at any one time. Pressure on register usage, however, will limit the number of threads that can be active at a time.  And this can introduce latency issues once the number of threads falls sufficiently.  This is because if there are too many registers per thread, there are not enough registers for as many threads in total.  This manifests itself in there being too few instructions being issued in the limiting pipe.  And if we’re using too many registers there will be spilling of values back to main memory, so we’ll see additional load/store operations as a result.  The compiler manages all this, but there can be performance implications of doing so.

An excessive register usage will also result in a reduction in the maximum local workgroup size we can use.

The solution is to use fewer registers.  We can use smaller types – if possible.  So switching from 32 bit to 16 bit if that is feasible.  Or we can split the kernel into multiple kernels, each with a reduced number of registers.  We have seen very large kernels which have performed poorly, but when split into 2 or more have then overall performed much better because each individual kernel needs a smaller number of registers.  This allows more threads at the same time, and consequently more tolerance to latency.

Optimizing Cache Usage

Finally, we look at cache usage.  If this is working badly we would see many L/S instructions spinning around the L/S pipe waiting for the data they have requested to be returned. This involves re-issuing instructions until the data is available.  There are GPU hardware counters that show just what we need, and DS-5 can expose them for us.

This has only been a brief look at the world of compute optimization with Mali GPUs.  There’s a lot more out there.  To get you going I’ve included some links below to malideveloper.arm.com for all sorts of useful guides, developer videos, papers and more.

Download DS-5 Streamline: ARM DS-5 Streamline - Mali Developer Center Mali Developer Center

Mali-T600 Series GPU OpenCL Developer Guide: Mali-T600 Series GPU OpenCL Developer Guide - Mali Developer Center Mali Developer Center

GPU Compute, OpenCL and RenderScript Tutorials: http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/

Anonymous