This blog has been updated and formalized into a guide on Arm developer. You can find it here:
This article describes the instructions provided by Neon for rearranging data within vectors. Previous articles in this series:
When writing code for Neon, you may find that sometimes, the data in your registers are not quite in the correct format for your algorithm. You may need to rearrange the elements in your vectors so that subsequent arithmetic can add the correct parts together, or perhaps the data passed to your function is in a strange format, and must be reordered before your speedy SIMD code can handle it.
This reordering operation is called a permutation. Permutation instructions rearrange individual elements, selected from single or multiple registers, to form a new vector.
Before you dive into using the permutation instructions provided by Neon, consider whether you really need to use them. Permutation instructions are similar to move instructions, in that they often represent CPU cycles consumed preparing data, rather than processing it.
Your code is not speed optimal until it uses the fewest number of cycles to complete a task; move and permute instructions are often good areas to target optimization.
How do you avoid unnecessary permutes? There are a number of options:
If you have considered all of these, but none put your data in a more suitable format, try using the permutation instructions.
Neon provides a range of permutation instructions, from basic reversals to arbitrary vector reconstruction. Simple permutations can be achieved using instructions that take a single cycle to issue, whereas the more complex operations use multiple cycles, and may require additional registers to be set up. As always, benchmark or profile your code regularly, and check your processor's Technical Reference Manual (Cortex-A8, Cortex-A9) for performance details.
VMOV and VSWP are the simplest permute instructions, copying the contents of an entire register to another, or swapping the values in a pair of registers.
Although you may not regard them as permute instructions, they can be used to change the values in the two D registers that make up a Q register. For example, VSWP d0, d1 swaps the most and least-significant 64-bits of q0.
VSWP d0, d1
VREV reverses the order of 8, 16 or 32-bit elements within a vector. There are three variants:
Use VREV to reverse the endianness of data, rearrange color components or exchange channels of audio samples.
VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors.
VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.
VTRN transposes 8, 16 or 32-bit elements between a pair of vectors. It treats the elements of the vectors as 2x2 matrices, and transposes each matrix.
Use multiple VTRN instructions to transpose larger matrices. For example, a 4x4 matrix consisting of 16-bit elements can be transposed using three VTRN instructions.
This is the same operation performed by VLD4 and VST4 after loading, or before storing, vectors. As they require fewer instructions, try to use these structured memory access features in preference to a sequence of VTRN instructions, where possible.
VZIP interleaves the 8, 16 or 32-bit elements of a pair of vectors. The operation is the same as that performed by VST2 before storing, so use VST2 rather than VZIP if you need to zip data immediately before writing back to memory.
VUZP is the inverse of VZIP, deinterleaving the 8, 16, or 32-bit elements of a pair of vectors. The operation is the same as that performed by VLD2 after loading from memory.
VTBL constructs a new vector from a table of vectors and an index vector. It is a byte-wise table lookup operation.
The table consists of one to four adjacent D registers. Each byte in the index vector is used to index a byte in the table of vectors. The indexed value is inserted into the result vector at the position corresponding to the location of the original index in the index vector.
VTBL and VTBX differ in the way that out-of-range indexes are handled. If an index exceeds the length of the table, VTBL inserts zero at the corresponding position in the result vector, but VTBX leaves the value in the result vector unchanged.
If you use a single source vector as the table, VTBL allows you to implement an arbitrary permutation of a vector, at the expense of setting up an index register. If the operation is used in a loop, and the type of permutation doesn't change, you can initialize the index register outside the loop, and remove the setup overhead.
Although there are other methods to achieve permute-like operations, such as using load and store instructions to operate on single vector elements, the repeated memory accesses that these require makes them significantly slower, and so they are not recommended.
It is wise to consider carefully whether your code really needs to permute your data. However, when your algorithm requires it, permute instructions provide an efficient method to get your data into the right format.
We hope you enjoyed this series. Please check out Arm's Developer website for more information, guides and getting to grips with Neon.
[CTAToken URL = "https://developer.arm.com/architectures/instruction-sets/simd-isas/neon" target="_blank" text="Learn more about Neon" class ="green"]