Coding for NEON - Part 4: Shifting Left and Right

Chinese Version 中文版:NEON编码 - 第4部分: 左右移位

This article introduces the shifting operations provided by NEON, and shows how they can be used to convert image data between commonly used color depths. Previous articles in this series:


Part 1: Loads and Stores, Part 2: Dealing with Leftovers and Part 3: Matrix Multiplication.

Shifting Vectors


A shift on NEON is very similar to shifts you may have used in scalar ARM code. The shift moves the bits in each element of a vector left or right. Bits that fall of the left or right of each element are discarded; they are not shifted to adjacent elements.

The amount to shift can be specified with a literal encoded in the instruction, or with an additional shift vector. When using a shift vector, the shift applied to each element of the input vector depends on the value of the corresponding element in the shift vector. The elements in the shift vector are treated as signed values, so left, right and zero shifts are possible, on a per-element basis.


A right shift operating on a vector of signed elements, indicated by the type attached to the instruction, will sign extend each element. This is the equivalent of an arithmetic shift you may have used in ARM code. Shifts applied to unsigned vectors do not sign extend.

Shifting and Inserting

NEON also supports shifts with insertion, providing a way to combine bits from two vectors. For example, shift left and insert (VSLI) shifts each element of the source vector left. The new bits inserted at the right of each element are the corresponding bits from the destination vector.

Shifting and Accumulation

Finally, NEON supports shifting the elements of a vector right, and accumulating the results into another vector. This is useful for situations in which interim calculations are made at a high precision, before the result is combined with a lower precision accumulator.

Instruction Modifiers


Each shift instruction can take one or more modifiers. These modifiers do not change the shift operation itself, but the inputs or outputs are adjusted to remove bias or saturate to a range.


There are five shift modifiers:

  • Rounding, denoted by an R prefix, corrects for the bias caused by truncation when shifting right.
  • Narrow, denoted by an N suffix, causes the number of bits in each element of the result to be halved. It implies Q (128-bit) source and D (64-bit) destination registers.
  • Long, denoted by an L suffix, causes the number of bits in each elements of the result to be doubled. It implies D source and Q destination registers.
  • Saturating, denoted by a Q prefix, sets each result element to the minimum or maximum of the representable range, if the result exceeds that range. The number of bits and sign type of the vector are used
    to determine the saturation range.
  • Unsigned Saturating, denoted by a Q prefix and U suffix, is similar to the saturation modifier, but the result is saturated to an unsigned range when given signed or unsigned inputs.


Some combinations of these modifiers do not describe useful operations, and so the instruction is not provided by NEON. For example, a saturating shift right (which would be called VQSHR) is unnecessary, as right shifting makes results smaller, and so the value cannot exceed the available range.

Table of Shifts Available


All of the shifting instructions provided by NEON are shown in the table below. They are arranged according to the modifiers mentioned earlier. If you are still unsure about what the modifier letters mean, use the table to select the instruction you need.

An Example: Converting Color Depth


Converting between color depths is a frequent operation required in graphics processing. Often, input or output data is in an RGB565 16-bit color format, but working with the data is much easier in RGB888 format. This is particularly true on NEON, as there is no native support for data types like RGB565.


However, NEON can still handle RGB565 data efficiently, and the vector shifts introduced above provide a method to do it.

From 565 to 888


First, we will look at converting RGB565 to RGB888. We assume there are eight 16-bit pixels in register q0, and we would like to separate reds, greens and blues into 8-bit elements across three registers d2 to d4.

     vshr.u8      q1, q0, #3      @ shift red elements right by three bits,
                                    @  discarding the green bits at the bottom of
                                    @  the red 8-bit elements.
    vshrn.i16    d2, q1, #5      @ shift red elements right and narrow,
                                    @  discarding the blue and green bits.
    vshrn.i16    d3, q0, #5      @ shift green elements right and narrow,
                                    @  discarding the blue bits and some red bits
                                    @  due to narrowing.
    vshl.i8      d3, d3, #2      @ shift green elements left, discarding the
                                    @  remaining red bits, and placing green bits
                                    @  in the correct place.
    vshl.i16  q0, q0, #3      @ shift blue elements left to most-significant
                                    @  bits of 8-bit color channel.
    vmovn.i16    d4, q0          @ remove remaining red and green bits by
                                    @  narrowing to 8 bits.

The effects of each instruction are described in the comments above, but in summary, the operation performed on each channel is:

  1. Remove color data for adjacent channels using shifts to push the bits off either end of the element.
  2. Use a second shift to position the color data in the most-significant bits of each element, and narrow to reduce element size from 16 to eight bits.

Note the use of element sizes in this sequence to address 8 and 16 bit elements, in order to achieve some of the masking operations.

A small problem


You may notice that, if you use the code above to convert to RGB888 format, your whites aren't quite white. This is because, for each channel, the lowest two or three bits are zero, rather than one; a white represented in RGB565 as (0x1F, 0x3F, 0x1F) becomes (0xF8, 0xFC, 0xF8) in RGB888. This can be fixed using shift with insert to place some of the most-significant bits into the lower bits.

From 888 to 565


Now, we can look at the reverse operation, converting RGB888 to RGB565. Here, we ssume that the RGB888 data is in the format produced by the code above; separated cross three registers d0 to d2, with each register containing eight elements of each color. The result will be stored as eight 16-bit RGB565 elements in q2.


     vshll.u8  q2, d0, #8      @ shift red elements left to most-significant
                                     @  bits of wider 16-bit elements.
     vshll.u8  q3, d1, #8      @ shift green elements left to most-significant
                                     @  bits of wider 16-bit elements.
     vsri.16      q2, q3, #5      @ shift green elements right and insert into
                                     @  red elements.
     vshll.u8  q3, d2, #8      @ shift blue elements left to most-significant
                                     @  bits of wider 16-bit elements.
     vsri.16      q2, q3, #11        @ shift blue elements right and insert into
                                     @  red and green elements.


Again, the detail is in the comments for each instruction, but in summary, for each channel:

  1. Lengthen each element to 16-bits, and shift the color data into the most significant bits.
  2. Use shift right with insert to position each color channel in the result register.

Conclusion


The powerful range of shift instructions provided by NEON allows you to:

  • Quickly divide and multiply vectors by powers of two, with rounding and saturation.
  • Shift and copy bits from one vector to another.
  • Make interim calculations at high precision and accumulate results at a lower precision.


In the next article, we will look at some of the other data processing instructions provided by NEON.

  • You can do this using the VEXT instruction. This extracts a set of contiguous bytes across a pair of registers into a third register. If the input registers are the same, it is effectively a rotate operation. In your example, if the vector is arranged such that <1> is in element 3 and <4> is in element 0 of q0, you can create the new vector <2, 3, 4, 1> using:
      VEXT q0, q0, q0, #12  // Extract a new Q register at 12 byte offset.

    If you need arbitrary element permutation, use the VTBL instruction. This uses a table of values in D registers, the elements of which are indexed by a target D register. For each element in the target, an element in the table is indexed and inserted into the target.

    Other NEON element permutation instructions are VBIT/VBIF/VBSL, VZIP, VUZP, VTRN and VREV16/32/64. Details of these can be found in the ARM Architecture Reference Manual: http://infocenter.arm.com/help/topic/com.a...406b/index.html

    I'll write a post on permutation instructions soon.
  • Thanks! That was very helpful!
  • Hey, what's the best and/or fastest way to do element shifts/shuffles? So for example having a 4x32bit vector <1,2,3,4> I want to get <2,3,4,1> or <2,3,4,x> with x undefined. I there a way to achieve this with NEON?
  • Nice article.  I use an vectorizing compiler and I wonder ..Is it possibe to have the arm vectorizing compiler generate saturating instructions?
  • If you need a particular instruction, you can use intrinsics to insert it into your C code directly. This is compatible with ARM's own C compiler, and GCC. For example:

      int8x8_t a, b;
      a = vqadd_s8(a, b);  // Inserts VQADD.S8  Da, Da, Db

    Information on intrinsics and the ARM C compiler can be found here: http://infocenter.arm.com/help/index.jsp?t...a/BABGHIFH.html

    Alternatively, you can use the idiom recognition features of the compiler, which should identify code typically used to saturate values. For example, define a static inlined function that saturates inputs:

      int sat(int x) { if (x>127) x = 127; if (x<-128) x=-128; return x; }

    Using this in a vectorizable loop should produce saturating instructions, but it will depend on your compiler. I believe GCC isn't able to do this yet.
  • That's nice but how quickly convert RGB565 to RGBA8888 without separate the color component.I.e. Having 2 32bit pixel into a Dn register.While there is no alpha into a 16 bit pixel, let suppose that alpha value is 0.Thanks

  • That's nice but how quickly convert RGB565 to RGBA8888 without separate the color component.I.e. Having 2 32bit pixel into a Dn register.While there is no alpha into a 16 bit pixel, let suppose that alpha value is 0.Thanks


    The easiest way is to split the color components into separate D registers, as described above. The components of each pixel can be reinterleaved when storing them back to memory, which for RGBA, requires using VST4. Trying to do the same operation, but keeping RGBARGBA pixels in each D register requires more masking, shifting and combining operations.
  • Hi Martin.

    I had to use your code into one of my program.
    Your RGB565 to RGB888

    vshr.u8  q1, q0, #3
    vshrn.i16 d2, q1, #5
    vshrn.i16 d3, q0, #5
    vshl.i8  d3, d3, #2
    vshl.i16  q0, q0, #3
    vmovn.i16 d4, q0

    take 12 cycles on a Cortex A8 !

    You can reduce to 6 cycles just my using an extra Qn register

    vshr.u8  q3, q0, #3
    vshrn.i16 d3, q0, #5
    vshl.i16  q0, q0, #3
    vshrn.i16 d2, q3, #5
    vshl.i8  d3, d3, #2
    vmovn.i16 d4, q0


    I hope that could help anybody !
  • I think there is somthing wrong for the first picture-"VSHL.U16 d2,d1,d0"

    According to RVCT, the valid byte for the NEON register d0 is just "the least significant byte". So, U16 just means that there are 4 groups in d0 and only the least significant byte of each group is valid. Only half of d0 should be highlight.