Coding for Neon - Part 5: Rearranging Vectors

September 11, 2013

6 minute read time.

This blog has been updated and formalized into a guide on Arm developer. You can find it here:

Coding for Neon - permutation - rearranging vectors

This article describes the instructions provided by Neon for rearranging data within vectors. Previous articles in this series:

Introduction

When writing code for Neon, you may find that sometimes, the data in your registers are not quite in the correct format for your algorithm. You may need to rearrange the elements in your vectors so that subsequent arithmetic can add the correct parts together, or perhaps the data passed to your function is in a strange format, and must be reordered before your speedy SIMD code can handle it.

This reordering operation is called a permutation. Permutation instructions rearrange individual elements, selected from single or multiple registers, to form a new vector.

Before we begin

Before you dive into using the permutation instructions provided by Neon, consider whether you really need to use them. Permutation instructions are similar to move instructions, in that they often represent CPU cycles consumed preparing data, rather than processing it.

Your code is not speed optimal until it uses the fewest number of cycles to complete a task; move and permute instructions are often good areas to target optimization.

Alternatives

How do you avoid unnecessary permutes? There are a number of options:

Rearrange your input data. It often costs nothing to store your data in a more appropriate format, avoiding the need to permute on load and store. However, consider data locality, and its effect on cache performance before changing your data structures.
Redesign your algorithm. A different algorithm may be available that uses a similar number of processing steps, but can handle data in a different format.
Modify the previous processing stage. A small change to an earlier processing stage, adjusting the way in which data is stored to memory, may reduce or eliminate the need for permutation operations.
Use interleaving loads and stores. As we've seen previously, load and store instructions have the ability to interleave and deinterleave. Even if this doesn't completely eliminate the need to permute, it can reduce the number of additional instructions you need.
Combine approaches. Using more than one of these techniques can be still be more efficient than additional permutation instructions.

If you have considered all of these, but none put your data in a more suitable format, try using the permutation instructions.

Instructions

Neon provides a range of permutation instructions, from basic reversals to arbitrary vector reconstruction. Simple permutations can be achieved using instructions that take a single cycle to issue, whereas the more complex operations use multiple cycles, and may require additional registers to be set up. As always, benchmark or profile your code regularly, and check your processor's Technical Reference Manual (Cortex-A8, Cortex-A9) for performance details.

VMOV and VSWP: Move and Swap

VMOV and VSWP are the simplest permute instructions, copying the contents of an entire register to another, or swapping the values in a pair of registers.

Although you may not regard them as permute instructions, they can be used to change the values in the two D registers that make up a Q register. For example, VSWP d0, d1 swaps the most and least-significant 64-bits of q0.

VREV: Reverse

VREV reverses the order of 8, 16 or 32-bit elements within a vector. There are three variants:

VREV16 reverses each pair of 8-bit sub-elements making up 16-bit elements within a vector.
VREV32 reverses the four 8-bit or two 16-bit sub-elements making up 32-bit elements within a vector.
VREV64 reverses eight 8-bit, four 16-bit or two 32-bit elements in a vector.

Use VREV to reverse the endianness of data, rearrange color components or exchange channels of audio samples.

VEXT: Extract

VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors.

VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.

VTRN: Transpose

VTRN transposes 8, 16 or 32-bit elements between a pair of vectors. It treats the elements of the vectors as 2x2 matrices, and transposes each matrix.

Use multiple VTRN instructions to transpose larger matrices. For example, a 4x4 matrix consisting of 16-bit elements can be transposed using three VTRN instructions.

This is the same operation performed by VLD4 and VST4 after loading, or before storing, vectors. As they require fewer instructions, try to use these structured memory access features in preference to a sequence of VTRN instructions, where possible.

VZIP and VUZP: Zip and Unzip

VZIP interleaves the 8, 16 or 32-bit elements of a pair of vectors. The operation is the same as that performed by VST2 before storing, so use VST2 rather than VZIP if you need to zip data immediately before writing back to memory.

VUZP is the inverse of VZIP, deinterleaving the 8, 16, or 32-bit elements of a pair of vectors. The operation is the same as that performed by VLD2 after loading from memory.

VTBL, VTBX: Table and Table Extend

VTBL constructs a new vector from a table of vectors and an index vector. It is a byte-wise table lookup operation.

The table consists of one to four adjacent D registers. Each byte in the index vector is used to index a byte in the table of vectors. The indexed value is inserted into the result vector at the position corresponding to the location of the original index in the index vector.

VTBL and VTBX differ in the way that out-of-range indexes are handled. If an index exceeds the length of the table, VTBL inserts zero at the corresponding position in the result vector, but VTBX leaves the value in the result vector unchanged.

If you use a single source vector as the table, VTBL allows you to implement an arbitrary permutation of a vector, at the expense of setting up an index register. If the operation is used in a loop, and the type of permutation doesn't change, you can initialize the index register outside the loop, and remove the setup overhead.

Others

Although there are other methods to achieve permute-like operations, such as using load and store instructions to operate on single vector elements, the repeated memory accesses that these require makes them significantly slower, and so they are not recommended.

Conclusion

It is wise to consider carefully whether your code really needs to permute your data. However, when your algorithm requires it, permute instructions provide an efficient method to get your data into the right format.

We hope you enjoyed this series. Please check out Arm's Developer website for more information, guides and getting to grips with Neon.

Learn more about Neon

amirelattar over 6 years ago in reply to amirelattar

I found out, just in case some one needs the answer:

mov Vn.16b,Vm.16B //for 16 bit register
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
amirelattar over 6 years ago

Hello Martyn

vmov is not defined for AARCH64 what replaces it to copy Qn to Qm

mov is not working on whole register?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Rafael Spring over 12 years ago

Since many people (including me) write NEON code using compiler intrinsics (such as GCCs intrinsics), that might be a good topic to cover. Some best practices and in particular how to write efficient code using intrinsics (avoiding stalls, hiding latency, etc..).
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Michael Kübel over 12 years ago

Dear Martyn, one idea for a next episode of your great blog is may be the problem/optimisation to compare with following branch instructions. As far as I see if you do a compare in VFP or NEON and you want to branch then first the flags must be transfered. So there's no e.g. SUBS R0,R0,#1 - BMI .jump thing...so one thing might be how to avaid the branches in NEON or at least how to efficiently branch after a VFP/NEON compare...
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
lei zhou over 12 years ago

Dear Martyn, I have read all your posts on neon instruction and optimization. I want to develop more effective SIMD for neon on GCC. Do you have any suggestions or can you share me your email? I want to have more discussions with you. Thank you.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025
When a barrier does not block: The pitfalls of partial order

Wathsala Vithanage

Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
- September 15, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog