Coding for NEON - Part 2: Dealing With Leftovers

In the first post on NEON about loads and stores we looked at transferring data between the NEON processing unit and memory. In this post, we deal with an often encountered problem: input data that is not a multiple of the length of the vectors you want to process. You need to handle the leftover elements at the start or end of the array - what is the best way to do this on NEON?

Leftovers


Using NEON typically involves operating on vectors of data from four to sixteen elements in length. Frequently, you will find that your array is not a multiple of that length, and you have to process those leftover elements separately.

For example, you want to load, process and store eight elements per iteration using NEON, but your array is 21 elements long. The first two iterations go well, but for the third, there are only five elements remaining to be processed. What do you do?

Fixing Up


There are three ways to handle these leftovers. The methods vary in requirements, performance, and code size. They are listed below in order, with the fastest approach first.

Larger Arrays

If you can change the size of the arrays that you are processing, increase the length of the array to the next multiple of the vector size using padding elements. This allows you to read and write beyond the end of your data without corrupting adjacent storage.


In the example above, increasing the array size to 24 elements allows the third iteration to complete without potential data corruption.

Notes

  • Allocating larger arrays will consume more memory. The increase could be significant if many short arrays are involved.
  • The new padding elements created at the end of the array may need to be initialized to a value that does not affect the result of the calculation. For example, if you are summing an array, the new elements must be initialized to zero for the result to be unaffected. If you are finding the minimum of an array, set the new elements to the maximum value an element can take.
  • In some cases, it may not be possible to initialize the padding elements to a value that does not affect the result of a calculation - when finding the range of a set of numbers, for example.

Code Fragment


@ r0 = input array pointer
@ r1 = output array pointer
@ r2 = length of data in array

@ We can assume that the array length is greater than zero, is an integer
@ number of vectors, and is greater than or equal to the length of data
@ in the array.

     add  r2, r2, #7      @ add (vector length-1) to the data length
     lsr  r2, r2, #3      @ divide the length of the array by the length
                             @  of a vector, 8, to find the number of
                             @  vectors of data to be processed

loop:
     subs    r2, r2, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0}, [r0]!  @ load eight elements from the array pointed to
                             @  by r0 into d0, and update r0 to point to the
                             @  next vector
     ...
     ...                  @ process the input in d0
     ...

     vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and
                             @  update r1 to point to next vector
     bne  loop            @ if r2 is not equal to 0, loop


Overlapping


If the operation is suitable, leftover elements can be handled using overlapping. This involves processing some of the elements in the array twice.

In the example case, the first iteration would process elements zero to seven, the second processes elements five to 12, and the third 13 to 20. Notice that elements five to seven, the overlap between the first and second vectors, have been processed twice.

Notes

  • Overlapping can be used only when the operation applied to the input data does not vary with the number of times the operation is applied; the operation must be idempotent. For example, it can be used if you are trying to find the maximum element in an array. It can not be used if you are summing an array - the overlapped elements will be counted twice.
  • The number of elements in the array must fill at least one complete vector.

Code Fragment


@ r0 = input array pointer
@ r1 = output array pointer
@ r2 = length of data in array

@ We can assume that the operation is idempotent, and the array is greater
@ than or equal to one vector long.

     ands    r3, r2, #7      @ calculate number of elements left over after
                             @  processing complete vectors using
                             @  data length & (vector length - 1)
     beq  loopsetup    @ if the result of the ands is zero, the length
                             @  of the data is an integer number of vectors,
                             @  so there is no overlap, and processing can begin
                             @  at the loop

                             @ handle the first vector separately
     vld1.8  {d0}, [r0], r3  @ load the first eight elements from the array,
                             @  and update the pointer by the number of elements
                             @  left over
     ...
     ...                  @ process the input in d0
     ...

     vst1.8  {d0}, [r1], r3  @ write eight elements to the output array, and
                             @  update the pointer

                             @ now, set up the vector processing loop
loopsetup:
     lsr  r2, r2, #3      @ divide the length of the array by the length
                             @  of a vector, 8, to find the number of
                             @  vectors of data to be processed

                             @ the loop can now be executed as normal. the
                             @  first few elements of the first vector will
                             @  overlap with some of those processed above
loop:
     subs    r2, r2, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0}, [r0]!  @ load eight elements from the array, and update
                             @  the pointer
     ...
     ...                  @ process the input in d0
     ...

     vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and
                             @  update the pointer
     bne  loop            @ if r2 is not equal to 0, loop

Single Elements


NEON provides loads and stores that can operate on single elements in a vector. Using these, you can load a partial vector containing one element, operate on it, and write the element back to memory.

For the example problem, the first two iterations execute as normal, processing elements zero to seven, and eight to 15. The third iteration needs only to process five elements. They are handled in a separate loop, which loads, processes and stores single elements.

Notes

  • This approach is slower than the previous methods, as each element must be loaded, processed and stored individually.
  • Handling leftovers like this requires two loops - one for the vectors, and a second for the single elements. This can double the amount of code in the function.
  • NEON single element loads only change the value of the destination element, leaving the rest of the vector intact. If the calculation that you are vectorizing involves instructions that work across a vector, such as VPADD, the register must be initiliazed before loading the first single element into it.

Code Fragment


@ r0 = input array pointer
@ r1 = output array pointer
@ r2 = length of data in array

     lsrs    r3, r2, #3      @ calculate the number of complete vectors to be
                             @  processed and set flags
     beq  singlesetup  @ if there are zero complete vectors, branch to
                             @  the single element handling code

                             @ process vector loop
vectors:
     subs    r3, r3, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0}, [r0]!  @ load eight elements from the array and update
                             @  the pointer
     ...
     ...                  @ process the input in d0
     ...

     vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and
                             @  update the pointer
     bne  vectors      @ if r3 is not equal to zero, loop

singlesetup:
     ands    r3, r2, #7      @ calculate the number of single elements to process
     beq  exit            @ if the number of single elements is zero, branch
                             @  to exit

                             @ process single element loop
singles:
     subs    r3, r3, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0[0]}, [r0]!  @ load single element into d0, and update the
                             @  pointer
     ...
     ...                  @ process the input in d0[0]
     ...

     vst1.8  {d0[0]}, [r1]!  @ write the single element to the output array,
                             @  and update the pointer
     bne  singles      @ if r3 is not equal to zero, loop

exit:

Further Considerations

Beginning or End


The overlapping and single element techniques can be applied at the start or end of processing an array. The code above can be easily adapted to fix up elements at either end, if it is more suitable for your application.

Alignment


Load and store addresses should be aligned to cache lines, allowing more efficient memory accesses.


This requires at least 16-word alignment on Cortex-A8. If you can not align the start of your input and output arrays, you must handle elements at the beginning of processing an array (for alignment) and at the end of the array (for the incomplete final vector.)

When aligning memory accesses for speed, remember to use :64 or :128 or :256 address qualifiers with your load and store instructions, for optimum performance. You can compare the number of cycles required to issue a load or store using the data available in the Technical Reference Manual for your target core.

Here's the relevant page in the Cortex-A8 TRM.

Using Arm to Fix Up


In the single elements case, you could use Arm instructions to operate on each element. However, storing to the same area of memory with both Arm and NEON instructions can reduce performance, as the writes from the Arm pipeline are delayed until writes from the NEON pipeline have been completed.

Generally, you should avoid writing to the same area of memory (specifically, the same cache line) from both Arm and NEON code.


In the next post, we will look at a practical application of NEON: matrix multiplication.

Anonymous
  • FYI libyuv uses both of these techniques - overlap, I call 'last16', because I do the multiple of 16 and then redo the last 16, with overlap.  And remainder.  Both are done in row_any.cc.caveat - the overlap technique doesnt work with overlapping pointers. e.g. in-place manipulation.There are 2 other techniques that can be used at a higher levelcoalescing - which merges sequential operations. e.g. rows in an imageblocking - which breaks a single operation into multiple smaller ones.Re 6 bytesIf you really need to load/store 6 bytes, you could try three 16 bit values:vld3.16  {d0[0],d1[0],d2[0]}, [r0]!vst3.16  {d0[0],d1[0],d2[0]}, [r1]!
  • if  i only want to load 6 bytes data from array R0.
    I code like this: 
     
    vld1.8 {D0}, R0

    but this instructions will load 8 bytes into D0,,,and then
    vst1.8 {D0}, R1

    will store 8 bytes into memory.

    but i only want to copy 6 bytes, other 2 bytes may overwrite the data usful..
    how should i do ??
    thanks!
  • The __attribute__(aligned(x)) attribute is for aligning variables on the stack or for aligning function addresses. I haven't seen it being used as a function flag though. Are you sure about this?
  • QUOTE (xtrawurst @ May 12 2010, 07:19 PM)
    How can I enable alignment for load/store operations in C (using the GCC compiler intrinsics)? From what I read about aligned loads and stores, you can specify alignment using additional "@16, @32, ..." flags, but that only works with assembler code, right? Is there any way to have this functionality in C as well?


    there is attribute to do that e.g.

    __attribute__((aligned(x))) where x is the amount of alignment you seek.

    int x __attribute__ ((aligned (16))) = 0;

    would align 'x' to 16-byte boundary
  • How can I enable alignment for load/store operations in C (using the GCC compiler intrinsics)? From what I read about aligned loads and stores, you can specify alignment using additional "@16, @32, ..." flags, but that only works with assembler code, right? Is there any way to have this functionality in C as well?