Coding for Neon - Part 5: Rearranging Vectors

September 11, 2013

6 minute read time.

This blog has been updated and formalized into a guide on Arm developer. You can find it here:

Coding for Neon - permutation - rearranging vectors

This article describes the instructions provided by Neon for rearranging data within vectors. Previous articles in this series:

Introduction

When writing code for Neon, you may find that sometimes, the data in your registers are not quite in the correct format for your algorithm. You may need to rearrange the elements in your vectors so that subsequent arithmetic can add the correct parts together, or perhaps the data passed to your function is in a strange format, and must be reordered before your speedy SIMD code can handle it.

This reordering operation is called a permutation. Permutation instructions rearrange individual elements, selected from single or multiple registers, to form a new vector.

Before we begin

Before you dive into using the permutation instructions provided by Neon, consider whether you really need to use them. Permutation instructions are similar to move instructions, in that they often represent CPU cycles consumed preparing data, rather than processing it.

Your code is not speed optimal until it uses the fewest number of cycles to complete a task; move and permute instructions are often good areas to target optimization.

Alternatives

How do you avoid unnecessary permutes? There are a number of options:

Rearrange your input data. It often costs nothing to store your data in a more appropriate format, avoiding the need to permute on load and store. However, consider data locality, and its effect on cache performance before changing your data structures.
Redesign your algorithm. A different algorithm may be available that uses a similar number of processing steps, but can handle data in a different format.
Modify the previous processing stage. A small change to an earlier processing stage, adjusting the way in which data is stored to memory, may reduce or eliminate the need for permutation operations.
Use interleaving loads and stores. As we've seen previously, load and store instructions have the ability to interleave and deinterleave. Even if this doesn't completely eliminate the need to permute, it can reduce the number of additional instructions you need.
Combine approaches. Using more than one of these techniques can be still be more efficient than additional permutation instructions.

If you have considered all of these, but none put your data in a more suitable format, try using the permutation instructions.

Instructions

Neon provides a range of permutation instructions, from basic reversals to arbitrary vector reconstruction. Simple permutations can be achieved using instructions that take a single cycle to issue, whereas the more complex operations use multiple cycles, and may require additional registers to be set up. As always, benchmark or profile your code regularly, and check your processor's Technical Reference Manual (Cortex-A8, Cortex-A9) for performance details.

VMOV and VSWP: Move and Swap

VMOV and VSWP are the simplest permute instructions, copying the contents of an entire register to another, or swapping the values in a pair of registers.

Although you may not regard them as permute instructions, they can be used to change the values in the two D registers that make up a Q register. For example, VSWP d0, d1 swaps the most and least-significant 64-bits of q0.

VREV: Reverse

VREV reverses the order of 8, 16 or 32-bit elements within a vector. There are three variants:

VREV16 reverses each pair of 8-bit sub-elements making up 16-bit elements within a vector.
VREV32 reverses the four 8-bit or two 16-bit sub-elements making up 32-bit elements within a vector.
VREV64 reverses eight 8-bit, four 16-bit or two 32-bit elements in a vector.

Use VREV to reverse the endianness of data, rearrange color components or exchange channels of audio samples.

VEXT: Extract

VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors.

VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.

VTRN: Transpose

VTRN transposes 8, 16 or 32-bit elements between a pair of vectors. It treats the elements of the vectors as 2x2 matrices, and transposes each matrix.

Use multiple VTRN instructions to transpose larger matrices. For example, a 4x4 matrix consisting of 16-bit elements can be transposed using three VTRN instructions.

This is the same operation performed by VLD4 and VST4 after loading, or before storing, vectors. As they require fewer instructions, try to use these structured memory access features in preference to a sequence of VTRN instructions, where possible.

VZIP and VUZP: Zip and Unzip

VZIP interleaves the 8, 16 or 32-bit elements of a pair of vectors. The operation is the same as that performed by VST2 before storing, so use VST2 rather than VZIP if you need to zip data immediately before writing back to memory.

VUZP is the inverse of VZIP, deinterleaving the 8, 16, or 32-bit elements of a pair of vectors. The operation is the same as that performed by VLD2 after loading from memory.

VTBL, VTBX: Table and Table Extend

VTBL constructs a new vector from a table of vectors and an index vector. It is a byte-wise table lookup operation.

The table consists of one to four adjacent D registers. Each byte in the index vector is used to index a byte in the table of vectors. The indexed value is inserted into the result vector at the position corresponding to the location of the original index in the index vector.

VTBL and VTBX differ in the way that out-of-range indexes are handled. If an index exceeds the length of the table, VTBL inserts zero at the corresponding position in the result vector, but VTBX leaves the value in the result vector unchanged.

If you use a single source vector as the table, VTBL allows you to implement an arbitrary permutation of a vector, at the expense of setting up an index register. If the operation is used in a loop, and the type of permutation doesn't change, you can initialize the index register outside the loop, and remove the setup overhead.

Others

Although there are other methods to achieve permute-like operations, such as using load and store instructions to operate on single vector elements, the repeated memory accesses that these require makes them significantly slower, and so they are not recommended.

Conclusion

It is wise to consider carefully whether your code really needs to permute your data. However, when your algorithm requires it, permute instructions provide an efficient method to get your data into the right format.

We hope you enjoyed this series. Please check out Arm's Developer website for more information, guides and getting to grips with Neon.

Learn more about Neon

Parents

Michael Kübel over 10 years ago

Dear Martyn, one idea for a next episode of your great blog is may be the problem/optimisation to compare with following branch instructions. As far as I see if you do a compare in VFP or NEON and you want to branch then first the flags must be transfered. So there's no e.g. SUBS R0,R0,#1 - BMI .jump thing...so one thing might be how to avaid the branches in NEON or at least how to efficiently branch after a VFP/NEON compare...
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Michael Kübel over 10 years ago

Dear Martyn, one idea for a next episode of your great blog is may be the problem/optimisation to compare with following branch instructions. As far as I see if you do a compare in VFP or NEON and you want to branch then first the flags must be transfered. So there's no e.g. SUBS R0,R0,#1 - BMI .jump thing...so one thing might be how to avaid the branches in NEON or at least how to efficiently branch after a VFP/NEON compare...
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Architectures and Processors blog

Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

Ker Liu

In-depth analysis of what the PMU of L2D_CACHE_WR counts on the Neoverse N2 server.
- April 15, 2024
Arm SPE: SoC Telemetry & Performance Analysis using Statistical Profiling Extension

Brian Jeff

We refer to the SPE performance methodology whitepaper published by Arm for details on the content of this blog.
- December 8, 2023
Implementing the WebAssembly bitmask operations on the 64-bit Arm architecture

Anton Kirilov

We discuss some of the challenges that we face when we are trying to implement the WebAssembly SIMD bitmask operations on the 64-bit Arm architecture.
- December 6, 2023

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog