When Parallelism Gets Tricky: Accelerating Floyd-Steinberg on the Mali GPU

November 25, 2014

19 minute read time.

Embarrassingly Parallel

In the world of parallel computing when an algorithm can be easily split into multiple parallel jobs, where the output of each of the jobs doesn’t depend on the output of any other job, it is referred to as “Embarrassingly Parallel” or “Pleasingly Parallel”, whichever you prefer. The reason for this uncharacteristically prosaic terminology is perhaps inspired by the huge relief such algorithms must bring to the weary parallel compute developer who otherwise has to craft delicate inter-thread communication so that parallel jobs can share their results in whatever their algorithm defines as the correct order.

Let me give you a simple example of such a parallel-friendly algorithm. Convolution filters are certainly members of the embarrassingly parallel club. Imagine we have a large array of values:

An example of a convolution filter. Each transformed pixel value is created by multiplying its current value and the values of the pixels around it against a matrix of coefficients

Each pixel in the image is processed by summing a proportion of its original value with a proportion of the original values of the surrounding pixels. The proportion of each pixel usually depends on its proximity to the central pixel being processed. Crucially – and apparently embarrassingly – none of the calculations require knowledge of the result of any of the other calculations. This makes parallelizing much easier, because each of the pixel calculations can be performed in any order. Using a parallel compute API like OpenCL™, it is then easy to assign each pixel to a work item – or thread – and watch your convolution job split itself across the processing cores you have at your disposal.

This sort of example is a nice way to showcase parallel programming. It gets across the message of splitting your job into the smallest processing elements without getting bogged down with too many thorny issues like thread synchronization. But what of these problem – non-embarrassing – algorithms? How should we tackle those?

Well of course, there’s not one answer. Life is not that simple. So we need to resort to an example to showcase the sort of methods at your disposal.

A good one I came across the other day was the Floyd-Steinberg algorithm. This is the name given to an image dithering algorithm invented by Robert W Floyd and Louis Steinberg in 1976. It is typically used when you need to reduce the number of colours in an image and still retain a reasonable perception of the relative colour and brightness levels. This is achieved through pixel dithering. In other words, an approximation of the required colour in each area of the image is achieved with a pattern of pixels. The result becomes a trade-off: what you lose is the overall level of detail, but what you gain is a reasonable colour representation of the original image.

Here's an example:

Original image on the left. On the right the 256-colour dithered version.

In our particular example, we’re going to be looking at converting a grey-scale image – where each pixel can be represented by 256 different levels of grey – to an image only using black and white pixels.

Grey scale version on the left. 2-colour Floyd-Steinberg version on the right

What you can see in this example – and what Floyd and Steinberg discovered – is this concept of error diffusion, where an algorithm could determine the distribution of pixels from a limited palette to achieve an approximation of the original image.

The algorithm itself is actually quite simple, and indeed rather elegant. What you have are three buffers:

The original image
An error diffusion buffer
The destination image

The algorithm defines a method of traversing over an image and for each pixel determining a quantization error – effectively the difference between the pixel’s value and what would be the nearest suitable colour from the available palette. This determination is made by reference to both the pixel’s current colour and a value read from the error buffer – as written out by previous pixel calculations. And indeed a proportion of the error calculated for this pixel will then be propagated to neighbouring ones. Here’s how this works:

Step 1: a pixel from the source and a value from the error diffusion buffer are added. Depending on the result, a white or black pixel is written to the destination and an error value is determined.

Step 2: the value of err is split up and distributed back into the error distribution buffer into four neighbouring pixels.

The code for doing all this is actually quite simple:

for each y from top to bottom
    for each x from left to right
        val := pixel[x][y] + error_values[x][y]

        if (val > THRESHOLD)
            diff := val - THRESHOLD
            dest[x][y] := 0xff                // Write a white pixel
        else
            diff := val
            dest[x][y] := 0x0                  // Write a black pixel

        error_values[x + 1][y    ]    += (diff * 7) / 16
        error_values[x - 1][y + 1]    += (diff * 3) / 16
        error_values[x    ][y + 1]    += (diff * 5) / 16
        error_values[x + 1][y + 1]    += (diff * 1) / 16

This uses these three buffers:

pixel (the source grey-scale image, 1 byte per pixel)
dest (the destination black or white image, 1 byte per pixel)
error_values (the buffer used to hold the distributed error values along the way - 1 byte per pixel and initialised to all-zeros before starting).

The value of THRESHOLD would typically be set to 128.

So I hope you can see the problem here. We can’t simply assign each pixel’s calculation to an independent work item because we cannot guarantee the order that work items will run. In OpenCL the order of execution of work items – even the order of execution of work groups – is entirely undefined. As we progress left to right across the image, and then line by line down the image, each pixel is dependent on the output of 4 previous pixel calculations.

Embarrassingly Serial?

So is there any hope for parallelization here? On its own perhaps this algorithm is better tackled by the CPU. But imagine the Floyd-Steinberg filter was part of a filter chain, where there was significant benefit from running the other filters before and after this one on a GPU like the ARM® Mali™-T604.

Any move from GPU to CPU will require cache synchronisation, introducing a level of overhead

Here we would need two CPU/GPU synchronization points either side of the Floyd-Steinberg filter. These are potentially quite expensive. Not only do the caches need to be flushed back to main memory, but the CPU needs to be ready to take on this job, which could complicate other jobs the CPU might be doing. So if it was possible to get a GPU version running somehow, even if its processing time was a little slower than the CPU, there might still be some net benefit to the GPU implementation.

Let’s look at the algorithm again and see what we might be able to do. We can see that the only thing stopping an individual pixel being processed is whether its related error buffer value has been written to by all four pixels: the one to the left, and the three above as follows.

C2 depends on the results of four previous pixel calculations, B1, C1, D1 and B2

From the diagrams we can see that if we want to process pixel C2 we have to wait until B1, C1, D1 and B2 have finished as these all write vaues into C2’s error buffer location.

If we have a work-item per pixel, each work item would be having to check for this moment and ultimately you could have work items representing pixels quite low down in the image or far to the right that are simply waiting a long time. And if you fill up all the available threads you can run at the same time, and they’re all waiting, you reach deadlock. Nothing will proceed. Not good.

What we need to do is to impose some order on all this… some good old-fashioned sequential programming alongside the parallelism. By serializing parts of the algorithm we can reduce the number of checks a work item would need to do before it can determine it is safe to continue. One way to do this is to assign an entire row to a single work item. That way we can ensure we process the pixels in a line from left to right. The work item processing the row of pixels below then only needs to check the progress of this work item: as long as it is two pixels ahead then it is safe to proceed with the next pixel. So we would have threads progressing across their row of pixels in staggered form:

Each thread processes a horizontal line of pixels and needs to be staggered as shown here

Of course there are a few wrinkles here. First of all we need to consider workgroups. Each workgroup – a collection of work items, with each work item processing a line – needs to be run in order. So the work items in the first workgroup need to process the top set of lines. The next needs to process the set of lines below this and so on. But there’s no guarantee that workgroups are submitted to the GPU in sequential order, so simply using the OpenCL function get_group_id – which returns the numerical offset of the current workgroup – won’t do as a way of determining which set of lines is processed. Instead we can use OpenCL atomics: if the first work item atomically incremented a global value – and then this is used to determine the actual group of lines a workgroup processes – then we can guarantee the lines will be processed in order as they progress down the image.

Here’s a diagram showing how workgroups would share the load within an image:

Each workgroup processes a horizontal band of pixels. In this case the workgroup size is 128, so the band height is 128 pixels, with each work item (thread) processing a single row of pixels.

So for each line we need a progress marker so that the line below knows which pixel it is safe to calculate up to. A work item can then sit and wait for the line above if it needs to, ensuring no pixel proceeds until the error distribution values it needs have been written out.

Here’s the rough pseudo code for what the kernel needs to do…

is this the first work item in the workgroup?
{
    atomic increment the global workgroup rider

    initialize to zero the local progress markers
}

barrier        // All work items in this workgroup wait until this point is reached

from the global workgroup rider and the local work item id,
determine the line in the image we’re processing

loop through the pixels in the line we’re processing
{
    wait for the work item representing the line above to
    have completed enough pixels so we can proceed

    do the Floyd-Steinberg calculation for this pixel

    update the progress marker for this line
}

You may have spotted the next wrinkle in this solution. The local progress markers are fine for ensuring that individual lines don’t get ahead of themselves – with the exception of the first work item (representing the top line in the group of lines represented by this workgroup). This first line needs to only progress once the last line of the previous workgroup has got far enough along. So we need a way of holding markers for the last line of each workgroup as well. The wait for the first work item then becomes a special case, as does the update of the marker for the last line.

Here’s the initialisation part of the kernel code:

__kernel void fs2( __global uchar        *src,                // The source greyscale image buffer
                  __global uchar        *dst,                // The destination buffer
                  __global uchar        *err_buf,            // The distribution of errors buffer
                  __global uint          *workgroup_rider,    // A single rider used to create a unique workgroup index
                  __global volatile uint *workgroup_progress,  // A buffer of progress markers for each workgroup
                  __local volatile uint  *progress,            // The local buffer for each workgroup
                  uint                  width)                // The width of the image
{
    __local volatile uint        workgroup_number;

    /* We need to put the workgroups in some order. This is done by
        the first work item in the workgroup atomically incrementing
        the global workgroup rider. The local progress buffer - used
        by the work items in this workgroup also needs initialising..
      */
      if (get_local_id(0) == 0)            // A job for the first work item...
      {
            // Get the global order for this workgroup...
            workgroup_number        = atomic_inc(workgroup_rider);

            // Initialise the local progress markers...
            for (int i = 0; i < get_local_size(0); i++)
                progress[i]        = 0;
      }


      barrier(CLK_LOCAL_MEM_FENCE);        // Wait here so we know progress buffer and
                                          // workgroup_number have been initialised

Note the use of the 'volatile' keyword when defining some of the variables here. This hints to the compiler that these values can be changed by other threads, thereby avoiding certain optimisations that might otherwise be made.

The barrier in the code is also something to highlight. There are often better ways than using barriers, typically using some kind of custom semaphore system. The barrier here however is only used as part of the initialization of the kernel, and is not used subsequently within the loop. Even so, I implemented a version that used a flag for each workgroup, setting the flag once the initialization has been done during the first work item’s setup phase, and then sitting and checking for the flag to be set for each of the other work items. It was a useful exercise, but didn’t show any noticeable difference in performance.

With initialization done, it’s time to set up the loop that will traverse across the line of pixels:

      /* The area of the image we work on depends on the workgroup_number determined earlier.
        We multiply this by the workgroup size and add the local id index. This gives us the
        y value for the row this work item needs to calculate. Normally we would expect to
        use get_global_id to determine this, but can't here.
      */
      int                y = (workgroup_number * get_local_size(0)) + get_local_id(0);
      int                err;
      int                sum;

      for (int x = 1; x < (width - 1); x++)  // Each work item processes a line (ignoring 1st and last pixels)...
      {
          /* Need to ensure that the data in err_buf required by this
              workitem is ready. To do that we need to check the progress
              marker for the line just above us. For the first work item in this
              workgroup, we get this from the global workgroup_progress buffer.
              For other work items we can peek into the progress buffer local
              to this workgroup.

              In each case we need to know that the previous line has reached
              2 pixels on from our own current position...
          */
          if (get_local_id(0) > 0)          // For all work items other than the first in this workgroup...
          {
              while (progress[get_local_id(0) - 1] < (x + 2));
          }
          else                              // For the first work item in this workgroup...
          {
              if (workgroup_number > 0)
                  while (workgroup_progress[workgroup_number - 1] < (x + 2));
          }

At the top of each loop we need to ensure the line above has got far enough ahead of where this line is. So the first item in the work group checks on the progress of the last line in the previous workgroup, whilst the other items check on the progress of the line above.

After that, we’re finally ready to do the Floyd-Steinberg calculation for the current pixel:

          sum = src[(width * y) + x] + err_buf[(width * y) + x];

          if (sum > THRESHOLD)
          {
              err                  = sum - THRESHOLD;
              dst[(width * y) + x] = 0xff;
          }
          else
          {
              err                  = sum;
              dst[(width * y) + x] = 0x00;
          }

          // Distribute the error values...
          err_buf[(width * y)      + x + 1] += (err * 7) / 16;
          err_buf[(width * (y + 1)) + x - 1] += (err * 3) / 16;
          err_buf[(width * (y + 1)) + x    ] += (err * 5) / 16;
          err_buf[(width * (y + 1)) + x + 1] += (err * 1) / 16;

The final thing to do within the main loop is to set the progress markers to reflect that this pixel is done:

          /* Set the progress marker for this line...


              If this work item is the last in the workgroup we set
              the global marker so the first item in the next
              workgroup will pick this up.


              For all other workitems we set the local progress marker.
          */
          if (get_local_id(0) == (get_local_size(0) - 1))      // Last work item in this workgroup?
              workgroup_progress[workgroup_number]  = x;
          else
              progress[get_local_id(0)]              = x;
      }

There’s one more thing to do. We need to set the progress markers to just beyond the width of the image so subsequent lines can complete:

      /* Although this work item has now finished, subsequent lines
          need to be able to continue to their ends. So the relevant
          progress markers need bumping up...
        */
      if (get_local_id(0) == (get_local_size(0) - 1)) // Last work item in this workgroup?
          workgroup_progress[workgroup_number]        = width + 2;
      else
          progress[get_local_id(0)]                  = width + 2;
  }

A Word about Warp

Before I talk about performance – and risk getting too carried away – it’s worth considering again the following line:

while (progress[get_local_id(0) - 1] < (x + 2));

This loop keeps a work item waiting until a progress marker is updated, ensuring the processing for this particular line doesn’t proceed until it’s safe to do so. The progress marker is updated by the thread processing the line above. Other than the use of barriers, inter-thread communication is not specifically ruled out in the specification for OpenCL 1.1 or 1.2. But neither is it specifically advocated. In other words, it is a bit of a grey area. As such, there is a risk that behaviour might vary across different platforms.

Take wavefront (or “warp”)-based GPUs for example. With wavefront architectures threads (work items) are clustered together into small groups, each sharing a program counter. This means threads within such a group cannot diverge. They can go dormant whilst other threads follow a different conditional path, but ultimately they are in lock-step with each other. This has some advantages when it comes to scalability, but the above line in this case will stall because if a thread was waiting for another in the same warp, the two can never progress.

The Mali-T600, -T700 and -T800 series of GPUs are not wavefront based. With each thread having its own program counter, threads are entirely independent of each other so the above technique runs fine. But it should be easy enough to accommodate wavefront by replacing the above 'while' loop with a conditional to determine whether the thread can continue:

Current method for the main loop	Alternative method supporting wavefront-based architectures
for (x = 1; x < (width - 1); x++) { Wait for line above to be >= 2 pixels ahead process pixel x update progress for this line }	for (x = 1; x < (width - 1); ) { if line above is >= 2 pixels ahead { process pixel x update progress for this line x++ } }

Current method for the main loop

Alternative method supporting wavefront-based architectures

for (x = 1; x < (width - 1); x++)

{

Wait for line above to be >= 2 pixels
ahead

process pixel x

update progress for this line

}

for (x = 1; x < (width - 1); )

{

if line above is >= 2 pixels ahead

{

process pixel x

update progress for this line
x++

}

The right-hand version allows the loop to iterate regardless of whether the previous line is ready or not. Note that in this version, x now only increments if the pixel is processed.

It’s also worth mentioning that as all the work items in the same wavefront are in lock-step by design, once the work items have been started further checking between the threads would be unnecessary. It might be feasible to optimise a kernel for a wavefront-based GPU to take advantage of this.

How did it do?

Technically, the above worked, producing an identical result to the CPU reference implementation. But what about performance? The OpenCL implementation ran between 3 and 5 times faster than the CPU implementation. So there is a really useful uplift from the GPU version. It would also be possible to create a multithreaded version on the CPU, and this would potentially provide some improvement. But remember that if this filter stage was part of a chain running on the GPU, with the above solution we can now slot this right in amongst the others, further freeing the CPU and removing those pesky sync points.

Moving the CPU kernel to the GPU will remove the need for cache synchronization in the above example

And what about further room for improvement? There are all sorts of optimisation techniques we would normally advocate, and those steps have not been explored in detail here. But just for example, the byte read and writes could be optimised to load and store several bytes in one operation. There are links at the bottom of this post to some articles and other resources which go into these in more detail. With a cursory glance however it doesn’t appear that many of the usual optimisation suspects would apply easily here… but nevertheless I would be fascinated if any of you out there can see some interesting ways to speed this up further. In the meantime it is certainly encouraging to see the improvement in speed which the Mali GPU brings to the algorithm.

Platform used for case study

CPU: ARM Cortex®-A15 running at 1.7GHz

GPU: ARM Mali-T604 MP4 running at 544MHz

Further resources

For more information about optimising compute on Mali GPUs, please see the various tutorials and documents listed here:

GPU Compute, OpenCL and RenderScript Tutorials - Mali Developer Center Mali Developer Center

This work by ARM is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. However, in respect of the code snippets included in the work, ARM further grants to you a non-exclusive, non-transferable, limited license under ARM’s copyrights to Share and Adapt the code snippets for any lawful purpose (including use in projects with a commercial purpose), subject in each case also to the general terms of use on this site. No patent or trademark rights are granted in respect of the work (including the code snippets).

3 comments
0 members are here

Mobile, Graphics, and Gaming blog

Join the Upscaling Revolution with Arm Accuracy Super Resolution (Arm ASR)

Lisa Sheckleford

With Arm ASR you can easily improve frames per second, enhance visual quality, and prevent thermal throttling for smoother, longer gameplay.
- March 18, 2025
Generative AI in game development

Roberto Lopez Mendez

How is Generative AI (GenAI) technology impacting different areas of game development?
- March 13, 2025
Physics simulation with graph neural networks targeting mobile

Tomas Zilhao Borges

In this blog post, we perform a study of the GNN architecture and the new TF-GNN API and determine whether GNNs are a viable approach for implementing physics simulations.
- February 26, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog