Coding for NEON - Part 5: Rearranging Vectors

This article describes the instructions provided by NEON for rearranging data within vectors. Previous articles in this series: Part 1: Loads and Stores, Part 2: Dealing with Leftovers, Part 3: Matrix Multiplication and Part 4: Shifting Left and Right.


When writing code for NEON, you may find that sometimes, the data in your registers are not quite in the correct format for your algorithm. You may need to rearrange the elements in your vectors so that subsequent arithmetic can add the correct parts together, or perhaps the data passed to your function is in a strange format, and must be reordered before your speedy SIMD code can handle it.

This reordering operation is called a permutation. Permutation instructions rearrange individual elements, selected from single or multiple registers, to form a new vector.

Before we begin

Before you dive into using the permutation instructions provided by NEON, consider whether you really need to use them. Permutation instructions are similar to move instructions, in that they often represent CPU cycles consumed preparing data, rather than processing it.

Your code is not speed optimal until it uses the fewest number of cycles to complete a task; move and permute instructions are often good areas to target optimization.


How do you avoid unnecessary permutes? There are a number of options:

  • Rearrange your input data. It often costs nothing to store your data in a more appropriate format, avoiding the need to permute on load and store. However, consider data locality, and its effect on cache performance before changing your data structures.
  • Redesign your algorithm. A different algorithm may be available that uses a similar number of processing steps, but can handle data in a different format.
  • Modify the previous processing stage. A small change to an earlier processing stage, adjusting the way in which data is stored to memory, may reduce or eliminate the need for permutation operations.
  • Use interleaving loads and stores. As we've seen previously, load and store instructions have the ability to interleave and deinterleave. Even if this doesn't completely eliminate the need to permute, it can reduce the number of additional instructions you need.
  • Combine approaches. Using more than one of these techniques can be still be more efficient than additional permutation instructions.

If you have considered all of these, but none put your data in a more suitable format, try using the permutation instructions.


NEON provides a range of permutation instructions, from basic reversals to arbitrary vector reconstruction. Simple permutations can be achieved using instructions that take a single cycle to issue, whereas the more complex operations use multiple cycles, and may require additional registers to be set up. As always, benchmark or profile your code regularly, and check your processor's Technical Reference Manual (Cortex-A8, Cortex-A9) for performance details.

VMOV and VSWP: Move and Swap

VMOV and VSWP are the simplest permute instructions, copying the contents of an entire register to another, or swapping the values in a pair of registers.

Although you may not regard them as permute instructions, they can be used to change the values in the two D registers that make up a Q register. For example, VSWP d0, d1 swaps the most and least-significant 64-bits of q0.

VREV: Reverse

VREV reverses the order of 8, 16 or 32-bit elements within a vector. There are three variants:

  • VREV16 reverses each pair of 8-bit sub-elements making up 16-bit elements within a vector.
  • VREV32 reverses the four 8-bit or two 16-bit sub-elements making up 32-bit elements within a vector.
  • VREV64 reverses eight 8-bit, four 16-bit or two 32-bit elements in a vector.

Use VREV to reverse the endianness of data, rearrange color components or exchange channels of audio samples.

VEXT: Extract

VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors.

VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.

VTRN: Transpose

VTRN transposes 8, 16 or 32-bit elements between a pair of vectors. It treats the elements of the vectors as 2x2 matrices, and transposes each matrix.

Use multiple VTRN instructions to transpose larger matrices. For example, a 4x4 matrix consisting of 16-bit elements can be transposed using three VTRN instructions.

This is the same operation performed by VLD4 and VST4 after loading, or before storing, vectors. As they require fewer instructions, try to use these structured memory access features in preference to a sequence of VTRN instructions, where possible.

VZIP and VUZP: Zip and Unzip

VZIP interleaves the 8, 16 or 32-bit elements of a pair of vectors. The operation is the same as that performed by VST2 before storing, so use VST2 rather than VZIP if you need to zip data immediately before writing back to memory.

VUZP is the inverse of VZIP, deinterleaving the 8, 16, or 32-bit elements of a pair of vectors. The operation is the same as that performed by VLD2 after loading from memory.

VTBL, VTBX: Table and Table Extend

VTBL constructs a new vector from a table of vectors and an index vector. It is a byte-wise table lookup operation.

The table consists of one to four adjacent D registers. Each byte in the index vector is used to index a byte in the table of vectors. The indexed value is inserted into the result vector at the position corresponding to the location of the original index in the index vector.

VTBL and VTBX differ in the way that out-of-range indexes are handled. If an index exceeds the length of the table, VTBL inserts zero at the corresponding position in the result vector, but VTBX leaves the value in the result vector unchanged.

If you use a single source vector as the table, VTBL allows you to implement an arbitrary permutation of a vector, at the expense of setting up an index register. If the operation is used in a loop, and the type of permutation doesn't change, you can initialize the index register outside the loop, and remove the setup overhead.


Although there are other methods to achieve permute-like operations, such as using load and store instructions to operate on single vector elements, the repeated memory accesses that these require makes them significantly slower, and so they are not recommended.


It is wise to consider carefully whether your code really needs to permute your data. However, when your algorithm requires it, permute instructions provide an efficient method to get your data into the right format.

The topic for the next post in the Coding for NEON series has not been decided, so if you have ideas for NEON-related topics that you would like me to cover, please suggest them in the comments below.

  • Dear Martyn, I have read all your posts on neon instruction and optimization. I want to develop more effective SIMD for neon on GCC. Do you have any suggestions or can you share me your email? I want to have more discussions with you. Thank you.
  • Dear Martyn, one idea for a next episode of your great blog is may be the problem/optimisation to compare with following branch instructions. As far as I see if you do a compare in VFP or NEON and you want to branch then first the flags must be transfered. So there's no e.g. SUBS R0,R0,#1 - BMI .jump one thing might be how to avaid the branches in NEON or at least how to efficiently branch after a VFP/NEON compare...
  • Since many people (including me) write NEON code using compiler intrinsics (such as GCCs intrinsics), that might be a good topic to cover. Some best practices and in particular how to write efficient code using intrinsics (avoiding stalls, hiding latency, etc..).