This guide shows how 64-bit Neon technology can be used to improve performance in image processing applications. We use 64-bit Neon intrinsics to optimize different aspects of the open-source Tag Image File Format (TIFF) image processing library, libTIFF.
If you are not familiar with Neon, then we recommend reading this page on Arm’s website as an introduction.
Previous generations of Arm instruction sets (e.g. the older Armv7-A) included 32-bit Neon instructions. However, the 64-bit variants include both 32-bit and 64-bit execution states, which means both the previous generation and the newer Neon instructions are available depending on which mode a process is being executed.
Note that the Neon unit actually operates on 128-bit registers allowing for more data to be processed than the rest of the processor.
We chose to optimize libTIFF version 4.4.0 because it is open-source, which means we have access to the source code. It is a library which is used in many larger software projects, including the Android operating system as well as the Chrome web browser. Being an older library means that there are many areas that can be optimized for newer generations of hardware.
You may download the source code here.
And the build instructions are shown here.
It is noteworthy that Neon performance improvement can vary depending on the type of CPU cores and operating systems and configurations used. For the purpose of testing the optimizations presented in this guide, we have used the following smartphones as target platforms:
To ensure that our measurements were not being affected by the Kernel Scheduler, which moves application processes between slower or faster cores transparently, we enabled only one of the faster cores and set the “Kernel Scaling Governor” to “Performance” mode. This forced the frequency to the maximum amount possible. We did this on the Galaxy S7 device, but performed the tests on the Pixel 4 XL without this change. The process to do this may involve building the Android operating system itself from source code and is outside the scope of this article.
We have used a custom Android app with two separate images. Each image causes different areas of code within libtiff to run. We process the images using the original code and our Neon-optimized version. Once the processing is complete for both images and both codes, the app displays performance statistics and the comparison.
Image 1 is a “flipped” grayscale image with black and white text. It is stored as 8-bit per pixel. Its orientation is “top right”, so the first pixel is actually the top-right of the image. This image makes use of two Neon optimizations: one for converting 8BPP to 32BPP and one for the horizontal flip.
Image 2 is a similar image, but with varying background colors behind the text. It is stored in CMYK format and makes use of the Neon optimization for converting CMYK to RGBA.
Of course, compiler options matter too when building software libraries from source. In our case, we used libTIFF’s default build options which use the -O2 optimization flag. The build command line would be too complicated to include in full; however, we have a simplified version for one of the source file here:
$ aarch64-linux-android24-clang -O2 -c SOURCE_FILE.c -fPIC -o SOURCE_FILE.o
To start with, we picked the following areas of code to optimize:
Neon intrinsics used for this optimization: vld1q_lane_u32, vst1q_u32
Image 1 is black and white. In this case, the data is stored in 8BPP (grayscale). When TIFFImageRGBARead extracts the image, the pixels are converted to RGBA. In this case, we are simply loading the 8 bits-per-pixel image, expanding to 32-bits per pixel (8-bits per channel) and displaying the image in the UI.
It is worth noting that the variable “BWmap” is a lookup table that converts 8-bit grayscale values to 32-bit values.
libTiff also uses this lookup table as an opportunity to easily provide inversion of grayscale images too. With our sample image, it takes an 8-bit value B and returns a 32-bit value with B copied to each 8-bit channel - with 255 inserted into the alpha channel.
Here is the original code.
static void putgreytile(TIFFRGBAImage* img, uint32_t* cp, uint32_t x, uint32_t y, uint32_t w, uint32_t h, int32_t fromskew, int32_t toskew, unsigned char* pp) { int samplesperpixel = img->samplesperpixel; uint32_t** BWmap = img->BWmap; // 1. For all lines of the image.. for( ; h > 0; --h) { // 2. For all pixels across a line.. for (x = w; x > 0; --x) { // 3. Convert the 8-bit pixel value into a 32-bit RGBA pixel (1 pixel at a time) *cp++ = BWmap[*pp][0]; pp += samplesperpixel; } cp += toskew; pp += fromskew; } }
And here is the 64-bit Neon optimized version. Note that for simplicity we will optimize the most frequent use case where the values “toskew” and “fromskew” are zero.
// 1. Does the image match our requirements for optimization and are we to use Neon? (tif_packbitsmode is a tag that is set by the app). We also check to see if there are a multiple of 4 pixels - this is because we want to process four pixels in one iteration. Each pixel is 32-bits so four pixels fit nicely into the 128-bit registers of Neon in AArch64. uint32_t n = h * w; if (img->tif->tif_packbitsmode == PackBits_Neon && toskew == 0 && fromskew == 0 && (n & 0x3) == 0) { n >>= 2; uint32x4_t p = vdupq_n_u32(0); while(n-- > 0) { // 2. Use the BWmap lookup table to convert the pixel to 32-bit value. // Each 32-bit pixel is loaded into one of 4 lanes of the 128-bit register. p = vld1q_lane_u32(BWmap[*pp], p, 0); pp += samplesperpixel; p = vld1q_lane_u32(BWmap[*pp], p, 1); pp += samplesperpixel; p = vld1q_lane_u32(BWmap[*pp], p, 2); pp += samplesperpixel; p = vld1q_lane_u32(BWmap[*pp], p, 3); pp += samplesperpixel; // 3. Write 4 pixels out with one single 128-bit write operation. vst1q_u32(cp, p); cp += 4; } }
Neon instructions used in this optimization: vcreate_u8, vcombine_u8, vqtbl1q_u8, vld1q_u8, vst1q_u8
TIFF allows images to be stored in any orthogonal layout. Essentially, the first pixel in memory could be the top-left, top-right, bottom-left or bottom-right of the image. If an image within the TIFF is stored right-to-left and we want the pixels in a left-to-right layout (as we want with our Android Image View), then we need to perform a horizontal flip. This is very similar to a flipping operation that you might perform on your images or photos on your mobile device.
Here is the original code. It consists of two small loops that reverse the ordering of the pixels line by line. It is important to note that each pixel is 32-bits (RGBA color format), so the lines of pixels are reversed 32-bits at a time (rather than byte by byte). This ensures the order of the color channels is not disturbed.
uint32_t line; // 1. For all lines in the image (a line consists of “w” (width) 32-bit pixels) for (line = 0; line < h; line++) { uint32_t *left = raster + (line * w); uint32_t *right = left + w - 1; // 2. Swap the pixel at the beginning of the line with the pixel at the end, // then work inwards, swapping as we go. while ( left < right ) { // 3. Swap two pixels on the same line uint32_t temp = *left; *left = *right; *right = temp; left++; right--; } }
And here is the 64-bit Neon optimized version. For simplicity, we only apply the Neon optimization if the image is a multiple of 8 pixels wide.
// For all lines of the image.. for (line = 0; line < h; line++) { uint32_t *left = raster + (line * w); uint32_t *right = left + w; right -= 4; // create an index table for table instruction (vqtbl1q_u8 below). // these indices will swap the pixel ordering but // preserve the channel ordering within those pixels (remembering // that each pixel is 4 bytes) // If we have four RGBA pixels W, X, Y and Z we want to swap them to Z, Y, X and W. // This table helps us achieve this effect (on a little-endian Arm platform). // We list the decimal values below to make reading the values easier: // uint8_t reverseIndices[16] = { // /* Fetch pixel Z’s RGBA components (indices R=12, G=13, B=14 and A=15) // and place them in indices 0 to 3 instead: */ // [ 0] 0x0C = 12, // Z.Red = 12th byte in the input // [ 1] 0x0D = 13, // Z.Green = 13th byte in the input // [ 2] 0x0E = 14, // Z.Blue = 14th byte in the input // [ 3] 0x0F = 15, // Z.Alpha = 15th byte in the input // /* Fetch pixel Y’s RGBA components: */ // [ 4] 0x08 = 8, // [ 5] 0x09 = 9, // [ 6] 0x0A = 10, // [ 7] 0x0B = 11, // /* Fetch pixel X’s RGBA components: */ // [ 8] 0x04 = 4, // [ 9] 0x05 = 5, // [10] 0x06 = 6, // [11] 0x07 = 7, // /* Fetch pixel W’s RGBA components: */ // [12] 0x00 = 0, // [13] 0x01 = 1, // [14] 0x02 = 2, // [15] 0x03 = 3 }; uint8x8_t reverse1 = vcreate_u8(0x0B0A09080F0E0D0Cull); uint8x8_t reverse2 = vcreate_u8(0x0302010007060504ull); uint8x16_t reverseIndices = vcombine_u8(reverse1, reverse2); // each time through the loop we will swap 4 pixels from the left with // 4 pixels from the right (while also reversing the order within each // batch of 4 pixels) while ( left < right ) { // load pixels from the left and reverse their order uint8x16_t leftPixels = vld1q_u8((uint8_t*)left); uint8x16_t reversedLeftPixels = vqtbl1q_u8(leftPixels, reverseIndices); // load pixels from the right and reverse their order uint8x16_t rightPixels = vld1q_u8((uint8_t*)right); uint8x16_t reversedRightPixels = vqtbl1q_u8(rightPixels, reverseIndices); // copy the right-hand pixels to the left and the left-hand pixels // to the right vst1q_u8((uint8_t*)left, reversedRightPixels); vst1q_u8((uint8_t*)right, reversedLeftPixels); left += 4; right -= 4; } }
Neon intrinsics used for this optimization: vdupq_n_u16, vld1_u8, vmovl_u8, vsubq_u16, vget_low_u16, vget_high_u16, vmul_lane_u16, vget_lane_u16.
Image 2 stores pixels in CMYK (Cyan, Magenta, Yellow and Black) format. CMYK is used in printing and popular image editing programs often support loading and saving files as CMYK. TIFF supports automatic conversion of CMYK to RGBA using the TIFFReadRGBA interface (though the alpha channel A is always output as 255).
To convert CMYK to RGB, libtiff uses the following calculations:
R = 255 x (1 - C) x (1 - K)
G = 255 x (1 - M) x (1 - K)
B = 255 x (1 - Y) x (1 - K)
As with the other optimizations, we have inserted an if-statement to check whether to use the original code or the Neon code. For clarity, we do not show the if-statement here.
// The original code uses a macro UNROLL8 to unroll code and // processes 8 pixels each iteration. // The macro actually contains a for loop that iterates over a single line // (the “w” parameter is the width of the image). #define UNROLL8(w, op1, op2) { \ uint32_t _x; \ // M1. For the whole width of the image.. for (_x = w; _x >= 8; _x -= 8) { \ op1; \ // M2. Repeat for 8 pixels.. REPEAT8(op2); \ } \ // M3. For any pixels left over.. if (_x > 0) { \ op1; \ CASE8(_x,op2); \ } \ } // end of macro UNROLL8() // 1. For each line of the image (“h” is image height) for( ; h > 0; --h) { // 2. Convert 8 pixels from CMYK to RGBA (A always 255) UNROLL8(w, NOP, { k = 255 - pp[3]; r = (k*(255-pp[0]))/255; g = (k*(255-pp[1]))/255; b = (k*(255-pp[2]))/255; // 3. Write each pixel to memory one at a time *cp++ = PACK(r, g, b); pp += samplesperpixel }); cp += toskew; pp += fromskew; }
And here is the 64-bit Neon optimized version. Again, for simplicity, we only optimized the case where skewing is not required. In other words, the values “toskew” and “fromskew” are zero.
// We will loop over all pixels of the image uint32_t np = w * h; uint32_t* endp = cp + np; // indices for VTBL that will duplicate each pixels K value uint8x8_t dupK1 = vcreate_u8(0xff06ff06ff06ff06ull); uint8x8_t dupK2 = vcreate_u8(0xff0eff0eff0eff0eull); uint8x16_t kindices = vcombine_u8(dupK1, dupK2); // indices that will grab the final results uint8_t resultIndices[16] = {0,1,2,-1,4,5,6,-1,8,9,10,-1,12,13,14,-1}; while(cp < endp) { // 16 copies of 255 uint8x16_t v255 = vdupq_n_u8 (255); // load 4 pixels (each pixel is 4 bytes with the CMYK values) uint8x16_t src_u8 = vld1q_u8(pp); // perform (255 - x) on each component // each vsubl is working on 2 pixels uint16x8_t subl = vsubl_u8(vget_low_u8(v255), vget_low_u8(src_u8)); uint16x8_t subh = vsubl_high_u8(v255, src_u8); // duplicate k element from each pixel in subl uint8x16_t kl = vqtbl1q_u8(subl, kindices); uint8x16_t kh = vqtbl1q_u8(subh, kindices); // multiply (255 - x) by (255 - k) uint16x8_t ml = vmulq_u16(kl, subl); uint16x8_t mh = vmulq_u16(kh, subh); // the results we need are in the low 8 bits of the uint16 elements // combine results and result (throwing away all the upper halves of all the uint16) uint8x16_t idx = vld1q_u8 (resultIndices); uint16x8_t resultl = ml / 255; uint16x8_t resulth = mh / 255; uint8x16_t packed = vuzp1q_u8 (vreinterpretq_u8_u16 (resultl), vreinterpretq_u8_u16 (resulth)); // wherever the index is -1, we'll take the value from v255 (we return 255 in alpha) uint8x16_t pixels = vqtbx1q_u8 (v255, packed, idx); // store the four RGBA pixels and advance the pointers/counters vst1q_u8((uint8_t*)cp, pixels); cp += 4; pp += samplesperpixel * 4; }
It’s worth noting that the compiler options used to build a library can have a big effect on the performance of your program. This includes some automatic Neon generation. To demonstrate this, we have tried building and testing the original non-Neon version of libTIFF with optimization option -O0 (no optimization at all) and compared this build with the default -O2 (most suitable optimizations applied).
Here are the results on the Galaxy S7:
In addition to the above improvements, we tried to optimize the PackBits compression formula used in some TIFF images. In this case, our optimization was only showing minimal improvements where the image data does not lend itself well to processing many bytes at a time. This was likely due to the automatic Neon generation for the main case already being optimal.
As you can see, results vary on different types of CPUs. However, there are big gains to be made by using Neon. It is worth pointing out that manually optimizing code in this way may not always be a good idea. You must also pay attention to the data or images being processed, as well as the optimization capabilities built into compilers.
Ramin Zaghi is a former engineer at Arm who is now the Chief Debugging Officer and Founder of VISUALSILICON.
You can learn more on the topic at the 'Efficient video encoding on the cloud with Arm servers' masterclass at Arm DevSummit 2022.
[CTAToken URL = "https://devsummit.arm.com/flow/arm/devsummit22/sessions-catalog/page/sessions/session/1656444468881001mmLQ" target="_blank" text="Learn more" class ="green"]