Coding for Neon - Part 2: Dealing With Leftovers

September 11, 2013

9 minute read time.

This blog has been updated and turned into a more formal guide on Arm Developer. You can find the latest guide here:

Coding for Neon - Leftovers

In part 1 of this series on Neon about loads and stores we looked at transferring data between the Neon processing unit and memory. In this post, we deal with an often encountered problem: input data that is not a multiple of the length of the vectors you want to process. You need to handle the leftover elements at the start or end of the array - what is the best way to do this on Neon?

Leftovers

Using Neon typically involves operating on vectors of data from four to sixteen elements in length. Frequently, you will find that your array is not a multiple of that length, and you have to process those leftover elements separately.

For example, you want to load, process and store eight elements per iteration using Neon, but your array is 21 elements long. The first two iterations go well, but for the third, there are only five elements remaining to be processed. What do you do?

Fixing Up

There are three ways to handle these leftovers. The methods vary in requirements, performance, and code size. They are listed below in order, with the fastest approach first.

Larger Arrays

If you can change the size of the arrays that you are processing, increase the length of the array to the next multiple of the vector size using padding elements. This allows you to read and write beyond the end of your data without corrupting adjacent storage.

In the example above, increasing the array size to 24 elements allows the third iteration to complete without potential data corruption.

Padding an array to fill an integer number of vectors

Notes

Allocating larger arrays will consume more memory. The increase could be significant if many short arrays are involved.
The new padding elements created at the end of the array may need to be initialized to a value that does not affect the result of the calculation. For example, if you are summing an array, the new elements must be initialized to zero for the result to be unaffected. If you are finding the minimum of an array, set the new elements to the maximum value an element can take.
In some cases, it may not be possible to initialize the padding elements to a value that does not affect the result of a calculation - when finding the range of a set of numbers, for example.

Code Fragment


 @ r0 = input array pointer
 @ r1 = output array pointer
 @ r2 = length of data in array
 
 @ We can assume that the array length is greater than zero, is an integer 
 @ number of vectors, and is greater than or equal to the length of data 
 @ in the array.
 
     add  r2, r2, #7      @ add (vector length-1) to the data length
     lsr  r2, r2, #3      @ divide the length of the array by the length
                             @  of a vector, 8, to find the number of
                             @  vectors of data to be processed
 
 loop:
     subs    r2, r2, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0}, [r0]!  @ load eight elements from the array pointed to
                             @  by r0 into d0, and update r0 to point to the 
                             @  next vector
     ...
     ...                  @ process the input in d0
     ...
 
     vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and
                             @  update r1 to point to next vector
     bne  loop            @ if r2 is not equal to 0, loop

Overlapping

If the operation is suitable, leftover elements can be handled using overlapping. This involves processing some of the elements in the array twice.

In the example case, the first iteration would process elements zero to seven, the second processes elements five to 12, and the third 13 to 20. Notice that elements five to seven, the overlap between the first and second vectors, have been processed twice.

Overlapping vectors. Elements in orange are processed twice

Notes

Overlapping can be used only when the operation applied to the input data does not vary with the number of times the operation is applied; the operation must be idempotent. For example, it can be used if you are trying to find the maximum element in an array. It can not be used if you are summing an array - the overlapped elements will be counted twice.
The number of elements in the array must fill at least one complete vector.

Code Fragment


 @ r0 = input array pointer
 @ r1 = output array pointer
 @ r2 = length of data in array
 
 @ We can assume that the operation is idempotent, and the array is greater
 @ than or equal to one vector long.
 
     ands    r3, r2, #7      @ calculate number of elements left over after
                             @  processing complete vectors using
                             @  data length & (vector length - 1)
     beq  loopsetup    @ if the result of the ands is zero, the length
                             @  of the data is an integer number of vectors,
                             @  so there is no overlap, and processing can begin 
                             @  at the loop
 
                             @ handle the first vector separately
     vld1.8  {d0}, [r0], r3  @ load the first eight elements from the array,
                             @  and update the pointer by the number of elements
                             @  left over
     ...
     ...                  @ process the input in d0
     ...
 
     vst1.8  {d0}, [r1], r3  @ write eight elements to the output array, and
                             @  update the pointer
 
                             @ now, set up the vector processing loop
 loopsetup:
     lsr  r2, r2, #3      @ divide the length of the array by the length
                             @  of a vector, 8, to find the number of
                             @  vectors of data to be processed
 
                             @ the loop can now be executed as normal. the
                             @  first few elements of the first vector will
                             @  overlap with some of those processed above
 loop:
     subs    r2, r2, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0}, [r0]!  @ load eight elements from the array, and update
                             @  the pointer
     ...
     ...                  @ process the input in d0
     ...
 
     vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and
                             @  update the pointer
     bne  loop            @ if r2 is not equal to 0, loop

Single Elements

Neon provides loads and stores that can operate on single elements in a vector. Using these, you can load a partial vector containing one element, operate on it, and write the element back to memory.

For the example problem, the first two iterations execute as normal, processing elements zero to seven, and eight to 15. The third iteration needs only to process five elements. They are handled in a separate loop, which loads, processes and stores single elements.

Processing single elements image

Notes

This approach is slower than the previous methods, as each element must be loaded, processed and stored individually.
Handling leftovers like this requires two loops - one for the vectors, and a second for the single elements. This can double the amount of code in the function.
Neon single element loads only change the value of the destination element, leaving the rest of the vector intact. If the calculation that you are vectorizing involves instructions that work across a vector, such as VPADD, the register must be initiliazed before loading the first single element into it.

Code Fragment


 @ r0 = input array pointer
 @ r1 = output array pointer
 @ r2 = length of data in array
 
     lsrs    r3, r2, #3      @ calculate the number of complete vectors to be
                             @  processed and set flags
     beq  singlesetup  @ if there are zero complete vectors, branch to
                             @  the single element handling code
 
                             @ process vector loop
 vectors:
     subs    r3, r3, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0}, [r0]!  @ load eight elements from the array and update
                             @  the pointer
     ...
     ...                  @ process the input in d0
     ...
 
     vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and
                             @  update the pointer
     bne  vectors      @ if r3 is not equal to zero, loop
 
 singlesetup:
     ands    r3, r2, #7      @ calculate the number of single elements to process
     beq  exit            @ if the number of single elements is zero, branch
                             @  to exit
 
                             @ process single element loop
 singles:
     subs    r3, r3, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0[0]}, [r0]!  @ load single element into d0, and update the
                             @  pointer
     ...
     ...                  @ process the input in d0[0]
     ...
 
     vst1.8  {d0[0]}, [r1]!  @ write the single element to the output array,
                             @  and update the pointer
     bne  singles      @ if r3 is not equal to zero, loop
 
 exit:

Further Considerations

Beginning or End

The overlapping and single element techniques can be applied at the start or end of processing an array. The code above can be easily adapted to fix up elements at either end, if it is more suitable for your application.

Alignment

Load and store addresses should be aligned to cache lines, allowing more efficient memory accesses.

This requires at least 16-word alignment on Cortex-A8. If you can not align the start of your input and output arrays, you must handle elements at the beginning of processing an array (for alignment) and at the end of the array (for the incomplete final vector.)

When aligning memory accesses for speed, remember to use :64 or :128 or :256 address qualifiers with your load and store instructions, for optimum performance. You can compare the number of cycles required to issue a load or store using the data available in the Technical Reference Manual for your target core.

Here's the relevant page in the Cortex-A8 TRM.

Using Arm to Fix Up

In the single elements case, you could use Arm instructions to operate on each element. However, storing to the same area of memory with both Arm and Neon instructions can reduce performance, as the writes from the Arm pipeline are delayed until writes from the Neon pipeline have been completed.

Generally, you should avoid writing to the same area of memory (specifically, the same cache line) from both Arm and Neon code.

In the next post, we will look at a practical application of Neon: Matrix Multiplication.

Read Part 3: Matrix Multiplication

Frank Barchard over 12 years ago

FYI libyuv uses both of these techniques - overlap, I call 'last16', because I do the multiple of 16 and then redo the last 16, with overlap. And remainder. Both are done in row_any.cc.caveat - the overlap technique doesnt work with overlapping pointers. e.g. in-place manipulation.There are 2 other techniques that can be used at a higher levelcoalescing - which merges sequential operations. e.g. rows in an imageblocking - which breaks a single operation into multiple smaller ones.Re 6 bytesIf you really need to load/store 6 bytes, you could try three 16 bit values:vld3.16 {d0[0],d1[0],d2[0]}, [r0]!vst3.16 {d0[0],d1[0],d2[0]}, [r1]!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Coomy gau over 12 years ago

if i only want to load 6 bytes data from array R0.
I code like this:

vld1.8 {D0}, R0

but this instructions will load 8 bytes into D0,,,and then
vst1.8 {D0}, R1

will store 8 bytes into memory.

but i only want to copy 6 bytes, other 2 bytes may overwrite the data usful..
how should i do ??
thanks!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Rafael Spring over 12 years ago

The __attribute__(aligned(x)) attribute is for aligning variables on the stack or for aligning function addresses. I haven't seen it being used as a function flag though. Are you sure about this?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
khem khem over 12 years ago

QUOTE (xtrawurst @ May 12 2010, 07:19 PM)
How can I enable alignment for load/store operations in C (using the GCC compiler intrinsics)? From what I read about aligned loads and stores, you can specify alignment using additional "@16, @32, ..." flags, but that only works with assembler code, right? Is there any way to have this functionality in C as well?

there is attribute to do that e.g.

__attribute__((aligned(x))) where x is the amount of alignment you seek.

int x __attribute__ ((aligned (16))) = 0;

would align 'x' to 16-byte boundary
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Rafael Spring over 12 years ago

How can I enable alignment for load/store operations in C (using the GCC compiler intrinsics)? From what I read about aligned loads and stores, you can specify alignment using additional "@16, @32, ..." flags, but that only works with assembler code, right? Is there any way to have this functionality in C as well?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025
When a barrier does not block: The pitfalls of partial order

Wathsala Vithanage

Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
- September 15, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Coding for Neon - Part 2: Dealing With Leftovers

Leftovers

Fixing Up

Larger Arrays

Notes

Code Fragment

Overlapping

Notes

Code Fragment

Single Elements

Notes

Code Fragment

Further Considerations

Beginning or End

Alignment

Using Arm to Fix Up

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Arm A-Profile Architecture developments 2025

When a barrier does not block: The pitfalls of partial order