Optimizing TIFF image processing using AARCH64 (64-bit) Neon

October 13, 2022

15 minute read time.

Overview

This guide shows how 64-bit Neon technology can be used to improve performance in image processing applications. We use 64-bit Neon intrinsics to optimize different aspects of the open-source Tag Image File Format (TIFF) image processing library, libTIFF.

If you are not familiar with Neon, then we recommend reading this page on Arm’s website as an introduction.

Comparing 32-bit and 64-bit NEON

Previous generations of Arm instruction sets (e.g. the older Armv7-A) included 32-bit Neon instructions. However, the 64-bit variants include both 32-bit and 64-bit execution states, which means both the previous generation and the newer Neon instructions are available depending on which mode a process is being executed.

Note that the Neon unit actually operates on 128-bit registers allowing for more data to be processed than the rest of the processor.

Why libTIFF?

We chose to optimize libTIFF version 4.4.0 because it is open-source, which means we have access to the source code. It is a library which is used in many larger software projects, including the Android operating system as well as the Chrome web browser. Being an older library means that there are many areas that can be optimized for newer generations of hardware.

You may download the source code here.

And the build instructions are shown here.

What test platforms did we use?

It is noteworthy that Neon performance improvement can vary depending on the type of CPU cores and operating systems and configurations used. For the purpose of testing the optimizations presented in this guide, we have used the following smartphones as target platforms:

Galaxy S7, model SM-G930F released in 2016 with Android 7. This handset runs a Exynos 8890 Octa-Core chipset which includes Cortex-A53 cores running at up to 2.3 GHz.
Pixel 4 XL, released in 2019 with Android 10. This handset runs a Snapdragon 855 Octa-core chipset which includes Qualcomm® Kryo 485 cores running at up to 2.8 GHz.

To ensure that our measurements were not being affected by the Kernel Scheduler, which moves application processes between slower or faster cores transparently, we enabled only one of the faster cores and set the “Kernel Scaling Governor” to “Performance” mode. This forced the frequency to the maximum amount possible. We did this on the Galaxy S7 device, but performed the tests on the Pixel 4 XL without this change. The process to do this may involve building the Android operating system itself from source code and is outside the scope of this article.

How did we measure the performance and what images were used for testing?

We have used a custom Android app with two separate images. Each image causes different areas of code within libtiff to run. We process the images using the original code and our Neon-optimized version. Once the processing is complete for both images and both codes, the app displays performance statistics and the comparison.

Image 1 is a “flipped” grayscale image with black and white text. It is stored as 8-bit per pixel. Its orientation is “top right”, so the first pixel is actually the top-right of the image. This image makes use of two Neon optimizations: one for converting 8BPP to 32BPP and one for the horizontal flip.

Image 2 is a similar image, but with varying background colors behind the text. It is stored in CMYK format and makes use of the Neon optimization for converting CMYK to RGBA.

Of course, compiler options matter too when building software libraries from source. In our case, we used libTIFF’s default build options which use the -O2 optimization flag. The build command line would be too complicated to include in full; however, we have a simplified version for one of the source file here:

$ aarch64-linux-android24-clang -O2 -c SOURCE_FILE.c -fPIC -o SOURCE_FILE.o

Which areas of code did we optimize?

To start with, we picked the following areas of code to optimize:

Optimization 1: Convert Single Channel color format to RGBA color format.
Optimization 2: Flipping the entire image.
Optimization 3: Convert CMYK color format to RGBA color format.

Optimization 1: Convert Single Channel color format to RGBA color format

Neon intrinsics used for this optimization: vld1q_lane_u32, vst1q_u32

Image 1 is black and white. In this case, the data is stored in 8BPP (grayscale). When TIFFImageRGBARead extracts the image, the pixels are converted to RGBA. In this case, we are simply loading the 8 bits-per-pixel image, expanding to 32-bits per pixel (8-bits per channel) and displaying the image in the UI.

It is worth noting that the variable “BWmap” is a lookup table that converts 8-bit grayscale values to 32-bit values.

libTiff also uses this lookup table as an opportunity to easily provide inversion of grayscale images too. With our sample image, it takes an 8-bit value B and returns a 32-bit value with B copied to each 8-bit channel - with 255 inserted into the alpha channel.

Here is the original code.

static void putgreytile(TIFFRGBAImage* img, uint32_t* cp,
                        uint32_t x, uint32_t y, uint32_t w, uint32_t h, 
                        int32_t fromskew, int32_t toskew, unsigned char* pp)
{
  int samplesperpixel = img->samplesperpixel;
  uint32_t** BWmap = img->BWmap;
  // 1. For all lines of the image..
  for( ; h > 0; --h) {
    // 2. For all pixels across a line..
    for (x = w; x > 0; --x) {
      // 3. Convert the 8-bit pixel value into a 32-bit RGBA pixel (1 pixel at a time)
      *cp++ = BWmap[*pp][0];
      pp += samplesperpixel;
    }
    cp += toskew;
    pp += fromskew;
  }
}

And here is the 64-bit Neon optimized version. Note that for simplicity we will optimize the most frequent use case where the values “toskew” and “fromskew” are zero.

// 1. Does the image match our requirements for optimization and are we to use Neon? (tif_packbitsmode is a tag that is set by the app). We also check to see if there are a multiple of 4 pixels - this is because we want to process four pixels in one iteration. Each pixel is 32-bits so four pixels fit nicely into the 128-bit registers of Neon in AArch64.
uint32_t n = h * w;
if (img->tif->tif_packbitsmode == PackBits_Neon
    && toskew == 0
    && fromskew == 0
    && (n & 0x3) == 0)
{
  n >>= 2;
  uint32x4_t p = vdupq_n_u32(0);
  while(n-- > 0) {
    // 2. Use the BWmap lookup table to convert the pixel to 32-bit value.
    //    Each 32-bit pixel is loaded into one of 4 lanes of the 128-bit register.
    p = vld1q_lane_u32(BWmap[*pp], p, 0);
    pp += samplesperpixel;
    p = vld1q_lane_u32(BWmap[*pp], p, 1);
    pp += samplesperpixel;
    p = vld1q_lane_u32(BWmap[*pp], p, 2);
    pp += samplesperpixel;
    p = vld1q_lane_u32(BWmap[*pp], p, 3);
    pp += samplesperpixel;
    // 3. Write 4 pixels out with one single 128-bit write operation.
    vst1q_u32(cp, p);
    cp += 4;
  }
}

Results for Optimization 1 (Converting Grayscale to RGBA):

For 1000 iterations	Original code (ms)	Optimized code (ms)	Improvement
Galaxy S7	6.09	3.16	1.92x
Pixel 4 XL	4.27	2.53	1.68x

Optimization 2: Flipping the entire image

Neon instructions used in this optimization: vcreate_u8, vcombine_u8, vqtbl1q_u8, vld1q_u8, vst1q_u8

TIFF allows images to be stored in any orthogonal layout. Essentially, the first pixel in memory could be the top-left, top-right, bottom-left or bottom-right of the image. If an image within the TIFF is stored right-to-left and we want the pixels in a left-to-right layout (as we want with our Android Image View), then we need to perform a horizontal flip. This is very similar to a flipping operation that you might perform on your images or photos on your mobile device.

Here is the original code. It consists of two small loops that reverse the ordering of the pixels line by line. It is important to note that each pixel is 32-bits (RGBA color format), so the lines of pixels are reversed 32-bits at a time (rather than byte by byte). This ensures the order of the color channels is not disturbed.

uint32_t line;
// 1. For all lines in the image (a line consists of “w” (width) 32-bit pixels)
for (line = 0; line < h; line++) {
  uint32_t *left = raster + (line * w);
  uint32_t *right = left + w - 1;
  // 2. Swap the pixel at the beginning of the line with the pixel at the end,
  //    then work inwards, swapping as we go.
  while ( left < right ) {
    // 3. Swap two pixels on the same line
    uint32_t temp = *left;
    *left = *right;
    *right = temp;
    left++;
    right--;
  }
}

And here is the 64-bit Neon optimized version. For simplicity, we only apply the Neon optimization if the image is a multiple of 8 pixels wide.

// For all lines of the image..
for (line = 0; line < h; line++) {
    uint32_t *left = raster + (line * w);
    uint32_t *right = left + w;
    right -= 4;
    // create an index table for table instruction (vqtbl1q_u8 below).
    // these indices will swap the pixel ordering but
    // preserve the channel ordering within those pixels (remembering
    // that each pixel is 4 bytes)
    // If we have four RGBA pixels W, X, Y and Z we want to swap them to Z, Y, X and W.
    // This table helps us achieve this effect (on a little-endian Arm platform).
    // We list the decimal values below to make reading the values easier:
    // uint8_t reverseIndices[16] = {
    //   /* Fetch pixel Z’s RGBA components (indices R=12, G=13, B=14 and A=15)
    //      and place them in indices 0 to 3 instead: */
    //   [ 0] 0x0C = 12, // Z.Red   = 12th byte in the input
    //   [ 1] 0x0D = 13, // Z.Green = 13th byte in the input
    //   [ 2] 0x0E = 14, // Z.Blue  = 14th byte in the input
    //   [ 3] 0x0F = 15, // Z.Alpha = 15th byte in the input
    //   /* Fetch pixel Y’s RGBA components: */
    //   [ 4] 0x08 =  8,
    //   [ 5] 0x09 =  9,
    //   [ 6] 0x0A = 10,
    //   [ 7] 0x0B = 11,
    //   /* Fetch pixel X’s RGBA components: */
    //   [ 8] 0x04 =  4,
    //   [ 9] 0x05 =  5,
    //   [10] 0x06 =  6,
    //   [11] 0x07 =  7,
    //   /* Fetch pixel W’s RGBA components: */
    //   [12] 0x00 =  0,
    //   [13] 0x01 =  1,
    //   [14] 0x02 =  2,
    //   [15] 0x03 =  3 };
    uint8x8_t reverse1 = vcreate_u8(0x0B0A09080F0E0D0Cull);
    uint8x8_t reverse2 = vcreate_u8(0x0302010007060504ull); 
    uint8x16_t reverseIndices = vcombine_u8(reverse1, reverse2);
    // each time through the loop we will swap 4 pixels from the left with
    // 4 pixels from the right (while also reversing the order within each
    // batch of 4 pixels)
    while ( left < right ) {
        // load pixels from the left and reverse their order
        uint8x16_t leftPixels = vld1q_u8((uint8_t*)left);
        uint8x16_t reversedLeftPixels = vqtbl1q_u8(leftPixels, reverseIndices);
        // load pixels from the right and reverse their order
        uint8x16_t rightPixels = vld1q_u8((uint8_t*)right);
        uint8x16_t reversedRightPixels = vqtbl1q_u8(rightPixels, reverseIndices);
        // copy the right-hand pixels to the left and the left-hand pixels
        // to the right
        vst1q_u8((uint8_t*)left, reversedRightPixels);
        vst1q_u8((uint8_t*)right, reversedLeftPixels);
        left += 4;
        right -= 4;
    }
}

Results for Optimization 2 (Image Flip operation):

For 1000 iterations	Original code (ms)	Optimized code (ms)	Improvement
Galaxy S7	3.12	2.71	1.15x
Pixel 4 XL	8.25	6.28	1.31x

Optimization 3: Convert CMYK color format to RGBA color format

Neon intrinsics used for this optimization: vdupq_n_u16, vld1_u8, vmovl_u8, vsubq_u16, vget_low_u16, vget_high_u16, vmul_lane_u16, vget_lane_u16.

Image 2 stores pixels in CMYK (Cyan, Magenta, Yellow and Black) format. CMYK is used in printing and popular image editing programs often support loading and saving files as CMYK. TIFF supports automatic conversion of CMYK to RGBA using the TIFFReadRGBA interface (though the alpha channel A is always output as 255).

To convert CMYK to RGB, libtiff uses the following calculations:

R = 255 x (1 - C) x (1 - K)

G = 255 x (1 - M) x (1 - K)

B = 255 x (1 - Y) x (1 - K)

As with the other optimizations, we have inserted an if-statement to check whether to use the original code or the Neon code. For clarity, we do not show the if-statement here.

// The original code uses a macro UNROLL8 to unroll code and
// processes 8 pixels each iteration.
// The macro actually contains a for loop that iterates over a single line
// (the “w” parameter is the width of the image).
#define UNROLL8(w, op1, op2) { \
  uint32_t _x; \
// M1. For the whole width of the image..
  for (_x = w; _x >= 8; _x -= 8) { \
  op1; \
// M2. Repeat for 8 pixels..
  REPEAT8(op2); \
  } \
// M3. For any pixels left over..
  if (_x > 0) { \
    op1; \
    CASE8(_x,op2); \
  } \
} // end of macro UNROLL8()

// 1. For each line of the image (“h” is image height)
for( ; h > 0; --h) {
  // 2. Convert 8 pixels from CMYK to RGBA (A always 255)
  UNROLL8(w, NOP, {
    k = 255 - pp[3];
    r = (k*(255-pp[0]))/255;
    g = (k*(255-pp[1]))/255;
    b = (k*(255-pp[2]))/255;
    // 3. Write each pixel to memory one at a time
    *cp++ = PACK(r, g, b);
    pp += samplesperpixel
  });
  cp += toskew;
  pp += fromskew;
}

And here is the 64-bit Neon optimized version. Again, for simplicity, we only optimized the case where skewing is not required. In other words, the values “toskew” and “fromskew” are zero.

// We will loop over all pixels of the image
uint32_t np = w * h; 
uint32_t* endp = cp + np;

// indices for VTBL that will duplicate each pixels K value
uint8x8_t dupK1 = vcreate_u8(0xff06ff06ff06ff06ull);
uint8x8_t dupK2 = vcreate_u8(0xff0eff0eff0eff0eull);
uint8x16_t kindices = vcombine_u8(dupK1, dupK2);

// indices that will grab the final results
uint8_t resultIndices[16] = {0,1,2,-1,4,5,6,-1,8,9,10,-1,12,13,14,-1};
while(cp < endp) {
    // 16 copies of 255
    uint8x16_t v255 = vdupq_n_u8 (255);
    // load 4 pixels (each pixel is 4 bytes with the CMYK values)
    uint8x16_t src_u8 = vld1q_u8(pp);
    // perform (255 - x) on each component
    // each vsubl is working on 2 pixels
    uint16x8_t subl = vsubl_u8(vget_low_u8(v255), vget_low_u8(src_u8));
    uint16x8_t subh = vsubl_high_u8(v255, src_u8);
    // duplicate k element from each pixel in subl
    uint8x16_t kl = vqtbl1q_u8(subl, kindices);
    uint8x16_t kh = vqtbl1q_u8(subh, kindices);
    // multiply (255 - x) by (255 - k)
    uint16x8_t ml = vmulq_u16(kl, subl);
    uint16x8_t mh = vmulq_u16(kh, subh);
    // the results we need are in the low 8 bits of the uint16 elements
    // combine results and result (throwing away all the upper halves of all the uint16)
    uint8x16_t idx = vld1q_u8 (resultIndices);
    uint16x8_t resultl = ml / 255; 
    uint16x8_t resulth = mh / 255; 
    uint8x16_t packed = vuzp1q_u8 (vreinterpretq_u8_u16 (resultl),
                                   vreinterpretq_u8_u16 (resulth));
    // wherever the index is -1, we'll take the value from v255 (we return 255 in alpha)
    uint8x16_t pixels = vqtbx1q_u8 (v255, packed, idx);
    // store the four RGBA pixels and advance the pointers/counters
    vst1q_u8((uint8_t*)cp, pixels);
    cp += 4;
    pp += samplesperpixel * 4;
}

Results for Optimization 3 (converting CMYK to RGBA):

For 1000 iterations	Original code (ms)	Optimized code (ms)	Improvement
Galaxy G7	13.31	7.70	1.72x
Pixel 4 XL	11.80	5.89	2.00x

Automatic compiler optimizations

It’s worth noting that the compiler options used to build a library can have a big effect on the performance of your program. This includes some automatic Neon generation. To demonstrate this, we have tried building and testing the original non-Neon version of libTIFF with optimization option -O0 (no optimization at all) and compared this build with the default -O2 (most suitable optimizations applied).

Here are the results on the Galaxy S7:

	using -O0 (ms)	using -O2 (ms)	Improvement
Grayscale to RGBA	17.97	6.09	3.5x
Flip	10.84	3.12	3.4x
CMYK to RGBA	46.95	13.31	3.5x

In addition to the above improvements, we tried to optimize the PackBits compression formula used in some TIFF images. In this case, our optimization was only showing minimal improvements where the image data does not lend itself well to processing many bytes at a time. This was likely due to the automatic Neon generation for the main case already being optimal.

Final remarks

As you can see, results vary on different types of CPUs. However, there are big gains to be made by using Neon. It is worth pointing out that manually optimizing code in this way may not always be a good idea. You must also pay attention to the data or images being processed, as well as the optimization capabilities built into compilers.

Ramin Zaghi is a former engineer at Arm who is now the Chief Debugging Officer and Founder of VISUALSILICON.

You can learn more on the topic at the 'Efficient video encoding on the cloud with Arm servers' masterclass at Arm DevSummit 2022.

[CTAToken URL = "https://devsummit.arm.com/flow/arm/devsummit22/sessions-catalog/page/sessions/session/1656444468881001mmLQ" target="_blank" text="Learn more" class ="green"]

Architectures and Processors blog

Using SVE in C#

Alan Hayward

.NET 9 introduces SVE support on Arm, allowing users to write simplified vectorised code. This blog gives examples in C# and compares it to C++.
- November 20, 2024
Part 3: Enabling PAC and BTI on AArch64 for Linux

Bill Roberts

Supporting C++ style exceptions and DWARF for Pointer Authentication Codes (PAC) signed pointers.
- November 20, 2024
Part 2: Enabling PAC and BTI on AArch64 for Linux

Bill Roberts

Utilizing Pointer Authentication Codes (PAC) and Branch Target Instructions (BTI) together and optimizations in instruction counts.
- November 19, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog