In the world of parallel computing when an algorithm can be easily split into multiple parallel jobs, where the output of each of the jobs doesn’t depend on the output of any other job, it is referred to as “Embarrassingly Parallel” or “Pleasingly Parallel”, whichever you prefer. The reason for this uncharacteristically prosaic terminology is perhaps inspired by the huge relief such algorithms must bring to the weary parallel compute developer who otherwise has to craft delicate inter-thread communication so that parallel jobs can share their results in whatever their algorithm defines as the correct order.
Let me give you a simple example of such a parallel-friendly algorithm. Convolution filters are certainly members of the embarrassingly parallel club. Imagine we have a large array of values:
An example of a convolution filter. Each transformed pixel value is created by multiplying its current value and the values of the pixels around it against a matrix of coefficients
Each pixel in the image is processed by summing a proportion of its original value with a proportion of the original values of the surrounding pixels. The proportion of each pixel usually depends on its proximity to the central pixel being processed. Crucially – and apparently embarrassingly – none of the calculations require knowledge of the result of any of the other calculations. This makes parallelizing much easier, because each of the pixel calculations can be performed in any order. Using a parallel compute API like OpenCL™, it is then easy to assign each pixel to a work item – or thread – and watch your convolution job split itself across the processing cores you have at your disposal.
This sort of example is a nice way to showcase parallel programming. It gets across the message of splitting your job into the smallest processing elements without getting bogged down with too many thorny issues like thread synchronization. But what of these problem – non-embarrassing – algorithms? How should we tackle those?
Well of course, there’s not one answer. Life is not that simple. So we need to resort to an example to showcase the sort of methods at your disposal.
A good one I came across the other day was the Floyd-Steinberg algorithm. This is the name given to an image dithering algorithm invented by Robert W Floyd and Louis Steinberg in 1976. It is typically used when you need to reduce the number of colours in an image and still retain a reasonable perception of the relative colour and brightness levels. This is achieved through pixel dithering. In other words, an approximation of the required colour in each area of the image is achieved with a pattern of pixels. The result becomes a trade-off: what you lose is the overall level of detail, but what you gain is a reasonable colour representation of the original image.
Here's an example:
Original image on the left. On the right the 256-colour dithered version.
In our particular example, we’re going to be looking at converting a grey-scale image – where each pixel can be represented by 256 different levels of grey – to an image only using black and white pixels.
Grey scale version on the left. 2-colour Floyd-Steinberg version on the right
What you can see in this example – and what Floyd and Steinberg discovered – is this concept of error diffusion, where an algorithm could determine the distribution of pixels from a limited palette to achieve an approximation of the original image.
The algorithm itself is actually quite simple, and indeed rather elegant. What you have are three buffers:
The algorithm defines a method of traversing over an image and for each pixel determining a quantization error – effectively the difference between the pixel’s value and what would be the nearest suitable colour from the available palette. This determination is made by reference to both the pixel’s current colour and a value read from the error buffer – as written out by previous pixel calculations. And indeed a proportion of the error calculated for this pixel will then be propagated to neighbouring ones. Here’s how this works:
Step 1: a pixel from the source and a value from the error diffusion buffer are added. Depending on the result, a white or black pixel is written to the destination and an error value is determined.
Step 2: the value of err is split up and distributed back into the error distribution buffer into four neighbouring pixels.
The code for doing all this is actually quite simple:
for each y from top to bottom for each x from left to right val := pixel[x][y] + error_values[x][y] if (val > THRESHOLD) diff := val - THRESHOLD dest[x][y] := 0xff // Write a white pixel else diff := val dest[x][y] := 0x0 // Write a black pixel error_values[x + 1][y ] += (diff * 7) / 16 error_values[x - 1][y + 1] += (diff * 3) / 16 error_values[x ][y + 1] += (diff * 5) / 16 error_values[x + 1][y + 1] += (diff * 1) / 16
This uses these three buffers:
The value of THRESHOLD would typically be set to 128.
So I hope you can see the problem here. We can’t simply assign each pixel’s calculation to an independent work item because we cannot guarantee the order that work items will run. In OpenCL the order of execution of work items – even the order of execution of work groups – is entirely undefined. As we progress left to right across the image, and then line by line down the image, each pixel is dependent on the output of 4 previous pixel calculations.
So is there any hope for parallelization here? On its own perhaps this algorithm is better tackled by the CPU. But imagine the Floyd-Steinberg filter was part of a filter chain, where there was significant benefit from running the other filters before and after this one on a GPU like the ARM® Mali™-T604.
Any move from GPU to CPU will require cache synchronisation, introducing a level of overhead
Here we would need two CPU/GPU synchronization points either side of the Floyd-Steinberg filter. These are potentially quite expensive. Not only do the caches need to be flushed back to main memory, but the CPU needs to be ready to take on this job, which could complicate other jobs the CPU might be doing. So if it was possible to get a GPU version running somehow, even if its processing time was a little slower than the CPU, there might still be some net benefit to the GPU implementation.
Let’s look at the algorithm again and see what we might be able to do. We can see that the only thing stopping an individual pixel being processed is whether its related error buffer value has been written to by all four pixels: the one to the left, and the three above as follows.
C2 depends on the results of four previous pixel calculations, B1, C1, D1 and B2
From the diagrams we can see that if we want to process pixel C2 we have to wait until B1, C1, D1 and B2 have finished as these all write vaues into C2’s error buffer location.
If we have a work-item per pixel, each work item would be having to check for this moment and ultimately you could have work items representing pixels quite low down in the image or far to the right that are simply waiting a long time. And if you fill up all the available threads you can run at the same time, and they’re all waiting, you reach deadlock. Nothing will proceed. Not good.
What we need to do is to impose some order on all this… some good old-fashioned sequential programming alongside the parallelism. By serializing parts of the algorithm we can reduce the number of checks a work item would need to do before it can determine it is safe to continue. One way to do this is to assign an entire row to a single work item. That way we can ensure we process the pixels in a line from left to right. The work item processing the row of pixels below then only needs to check the progress of this work item: as long as it is two pixels ahead then it is safe to proceed with the next pixel. So we would have threads progressing across their row of pixels in staggered form:
Each thread processes a horizontal line of pixels and needs to be staggered as shown here
Of course there are a few wrinkles here. First of all we need to consider workgroups. Each workgroup – a collection of work items, with each work item processing a line – needs to be run in order. So the work items in the first workgroup need to process the top set of lines. The next needs to process the set of lines below this and so on. But there’s no guarantee that workgroups are submitted to the GPU in sequential order, so simply using the OpenCL function get_group_id – which returns the numerical offset of the current workgroup – won’t do as a way of determining which set of lines is processed. Instead we can use OpenCL atomics: if the first work item atomically incremented a global value – and then this is used to determine the actual group of lines a workgroup processes – then we can guarantee the lines will be processed in order as they progress down the image.
Here’s a diagram showing how workgroups would share the load within an image:
Each workgroup processes a horizontal band of pixels. In this case the workgroup size is 128, so the band height is 128 pixels, with each work item (thread) processing a single row of pixels.
So for each line we need a progress marker so that the line below knows which pixel it is safe to calculate up to. A work item can then sit and wait for the line above if it needs to, ensuring no pixel proceeds until the error distribution values it needs have been written out.
Here’s the rough pseudo code for what the kernel needs to do…
is this the first work item in the workgroup? { atomic increment the global workgroup rider initialize to zero the local progress markers } barrier // All work items in this workgroup wait until this point is reached from the global workgroup rider and the local work item id, determine the line in the image we’re processing loop through the pixels in the line we’re processing { wait for the work item representing the line above to have completed enough pixels so we can proceed do the Floyd-Steinberg calculation for this pixel update the progress marker for this line }
You may have spotted the next wrinkle in this solution. The local progress markers are fine for ensuring that individual lines don’t get ahead of themselves – with the exception of the first work item (representing the top line in the group of lines represented by this workgroup). This first line needs to only progress once the last line of the previous workgroup has got far enough along. So we need a way of holding markers for the last line of each workgroup as well. The wait for the first work item then becomes a special case, as does the update of the marker for the last line.
Here’s the initialisation part of the kernel code:
__kernel void fs2( __global uchar *src, // The source greyscale image buffer __global uchar *dst, // The destination buffer __global uchar *err_buf, // The distribution of errors buffer __global uint *workgroup_rider, // A single rider used to create a unique workgroup index __global volatile uint *workgroup_progress, // A buffer of progress markers for each workgroup __local volatile uint *progress, // The local buffer for each workgroup uint width) // The width of the image { __local volatile uint workgroup_number; /* We need to put the workgroups in some order. This is done by the first work item in the workgroup atomically incrementing the global workgroup rider. The local progress buffer - used by the work items in this workgroup also needs initialising.. */ if (get_local_id(0) == 0) // A job for the first work item... { // Get the global order for this workgroup... workgroup_number = atomic_inc(workgroup_rider); // Initialise the local progress markers... for (int i = 0; i < get_local_size(0); i++) progress[i] = 0; } barrier(CLK_LOCAL_MEM_FENCE); // Wait here so we know progress buffer and // workgroup_number have been initialised
Note the use of the 'volatile' keyword when defining some of the variables here. This hints to the compiler that these values can be changed by other threads, thereby avoiding certain optimisations that might otherwise be made.
The barrier in the code is also something to highlight. There are often better ways than using barriers, typically using some kind of custom semaphore system. The barrier here however is only used as part of the initialization of the kernel, and is not used subsequently within the loop. Even so, I implemented a version that used a flag for each workgroup, setting the flag once the initialization has been done during the first work item’s setup phase, and then sitting and checking for the flag to be set for each of the other work items. It was a useful exercise, but didn’t show any noticeable difference in performance.
With initialization done, it’s time to set up the loop that will traverse across the line of pixels:
/* The area of the image we work on depends on the workgroup_number determined earlier. We multiply this by the workgroup size and add the local id index. This gives us the y value for the row this work item needs to calculate. Normally we would expect to use get_global_id to determine this, but can't here. */ int y = (workgroup_number * get_local_size(0)) + get_local_id(0); int err; int sum; for (int x = 1; x < (width - 1); x++) // Each work item processes a line (ignoring 1st and last pixels)... { /* Need to ensure that the data in err_buf required by this workitem is ready. To do that we need to check the progress marker for the line just above us. For the first work item in this workgroup, we get this from the global workgroup_progress buffer. For other work items we can peek into the progress buffer local to this workgroup. In each case we need to know that the previous line has reached 2 pixels on from our own current position... */ if (get_local_id(0) > 0) // For all work items other than the first in this workgroup... { while (progress[get_local_id(0) - 1] < (x + 2)); } else // For the first work item in this workgroup... { if (workgroup_number > 0) while (workgroup_progress[workgroup_number - 1] < (x + 2)); }
At the top of each loop we need to ensure the line above has got far enough ahead of where this line is. So the first item in the work group checks on the progress of the last line in the previous workgroup, whilst the other items check on the progress of the line above.
After that, we’re finally ready to do the Floyd-Steinberg calculation for the current pixel:
sum = src[(width * y) + x] + err_buf[(width * y) + x]; if (sum > THRESHOLD) { err = sum - THRESHOLD; dst[(width * y) + x] = 0xff; } else { err = sum; dst[(width * y) + x] = 0x00; } // Distribute the error values... err_buf[(width * y) + x + 1] += (err * 7) / 16; err_buf[(width * (y + 1)) + x - 1] += (err * 3) / 16; err_buf[(width * (y + 1)) + x ] += (err * 5) / 16; err_buf[(width * (y + 1)) + x + 1] += (err * 1) / 16;
The final thing to do within the main loop is to set the progress markers to reflect that this pixel is done:
/* Set the progress marker for this line... If this work item is the last in the workgroup we set the global marker so the first item in the next workgroup will pick this up. For all other workitems we set the local progress marker. */ if (get_local_id(0) == (get_local_size(0) - 1)) // Last work item in this workgroup? workgroup_progress[workgroup_number] = x; else progress[get_local_id(0)] = x; }
There’s one more thing to do. We need to set the progress markers to just beyond the width of the image so subsequent lines can complete:
/* Although this work item has now finished, subsequent lines need to be able to continue to their ends. So the relevant progress markers need bumping up... */ if (get_local_id(0) == (get_local_size(0) - 1)) // Last work item in this workgroup? workgroup_progress[workgroup_number] = width + 2; else progress[get_local_id(0)] = width + 2; }
Before I talk about performance – and risk getting too carried away – it’s worth considering again the following line:
while (progress[get_local_id(0) - 1] < (x + 2));
This loop keeps a work item waiting until a progress marker is updated, ensuring the processing for this particular line doesn’t proceed until it’s safe to do so. The progress marker is updated by the thread processing the line above. Other than the use of barriers, inter-thread communication is not specifically ruled out in the specification for OpenCL 1.1 or 1.2. But neither is it specifically advocated. In other words, it is a bit of a grey area. As such, there is a risk that behaviour might vary across different platforms.
Take wavefront (or “warp”)-based GPUs for example. With wavefront architectures threads (work items) are clustered together into small groups, each sharing a program counter. This means threads within such a group cannot diverge. They can go dormant whilst other threads follow a different conditional path, but ultimately they are in lock-step with each other. This has some advantages when it comes to scalability, but the above line in this case will stall because if a thread was waiting for another in the same warp, the two can never progress.
The Mali-T600, -T700 and -T800 series of GPUs are not wavefront based. With each thread having its own program counter, threads are entirely independent of each other so the above technique runs fine. But it should be easy enough to accommodate wavefront by replacing the above 'while' loop with a conditional to determine whether the thread can continue:
for (x = 1; x < (width - 1); x++)
{
Wait for line above to be >= 2 pixels ahead
process pixel x
update progress for this line
}
for (x = 1; x < (width - 1); )
if line above is >= 2 pixels ahead
update progress for this line x++
The right-hand version allows the loop to iterate regardless of whether the previous line is ready or not. Note that in this version, x now only increments if the pixel is processed.
It’s also worth mentioning that as all the work items in the same wavefront are in lock-step by design, once the work items have been started further checking between the threads would be unnecessary. It might be feasible to optimise a kernel for a wavefront-based GPU to take advantage of this.
Technically, the above worked, producing an identical result to the CPU reference implementation. But what about performance? The OpenCL implementation ran between 3 and 5 times faster than the CPU implementation. So there is a really useful uplift from the GPU version. It would also be possible to create a multithreaded version on the CPU, and this would potentially provide some improvement. But remember that if this filter stage was part of a chain running on the GPU, with the above solution we can now slot this right in amongst the others, further freeing the CPU and removing those pesky sync points.
Moving the CPU kernel to the GPU will remove the need for cache synchronization in the above example
And what about further room for improvement? There are all sorts of optimisation techniques we would normally advocate, and those steps have not been explored in detail here. But just for example, the byte read and writes could be optimised to load and store several bytes in one operation. There are links at the bottom of this post to some articles and other resources which go into these in more detail. With a cursory glance however it doesn’t appear that many of the usual optimisation suspects would apply easily here… but nevertheless I would be fascinated if any of you out there can see some interesting ways to speed this up further. In the meantime it is certainly encouraging to see the improvement in speed which the Mali GPU brings to the algorithm.
CPU: ARM Cortex®-A15 running at 1.7GHz
GPU: ARM Mali-T604 MP4 running at 544MHz
For more information about optimising compute on Mali GPUs, please see the various tutorials and documents listed here:
GPU Compute, OpenCL and RenderScript Tutorials - Mali Developer Center Mali Developer Center
This work by ARM is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. However, in respect of the code snippets included in the work, ARM further grants to you a non-exclusive, non-transferable, limited license under ARM’s copyrights to Share and Adapt the code snippets for any lawful purpose (including use in projects with a commercial purpose), subject in each case also to the general terms of use on this site. No patent or trademark rights are granted in respect of the work (including the code snippets).
Hello Tim,
how about the workgroup size in this case?
i test on ARM MALI G72.if workgroup size is 1, it runs well;otherwise,Here's the error msg:
Time used on device for execution:-3069543701242889 us.
when the first workgroup stalls, the other workgroups are dispached to run in dead loops until the condition is ready. So the first group maybe NOT have the chance to be scheduled and the other group's condition is always unready. Dead lock happens.
that's my view why it fails. what's your opinion?
Thanks in advace!