Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Coding for Neon - Part 1: Load and Stores
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • guidance
  • SIMD and Vector Processing Instructions
  • NEON
  • Tutorial
  • SIMD and Vector Execution
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Coding for Neon - Part 1: Load and Stores

Martyn
Martyn
September 11, 2013
5 minute read time.

This blog has been updated and turned into a more formal guide on Arm Developer. You can find the latest guide here:

  • Coding for Neon - Load and Stores

Arm's Neon technology is a 64/128-bit hybrid SIMD architecture designed to accelerate the performance of multimedia and signal processing applications, including video encoding and decoding, audio encoding and decoding, 3D graphics, speech and image processing.

This is the first part of a series of posts on how to write SIMD code for Neon using assembly language. The series will cover getting started with Neon, using it efficiently, and later, hints and tips for more experienced coders. We will begin by looking at memory operations, and how to use the flexible load and store with permute instructions.

An Example

We will start with a concrete example. You have a 24-bit RGB image, where the pixels are arranged in memory as R, G, B, R, G, B... You want to perform a simple image processing operation, like switching the red and blue channels. How can you do this efficiently using NEON?

Using a load that pulls RGB data linearly from memory into registers makes the red/blue swap awkward.

Loading RGB data with a linear load

Code to swap channels based on this input is not going to be elegant - masks, shifting, combining. It is unlikely to be efficient.

Neon provides structure load and store instructions to help in these situations. They pull in data from memory and simultaneously separate values into different registers. For this example, you can use VLD3 to split up red, green and blue as they are loaded.

Loading RGB data with a structure load

Now switch the red and blue registers (VSWP d0, d2) and write the data back to memory, with reinterleaving, using the similarly named VST3 store instruction.

The Details

Overview

Neon structure loads read data from memory into 64-bit NEON registers, with optional deinterleaving. Stores work similarly, reinterleaving data from registers before writing it to memory.

NEON structure loads and stores image

Syntax

The structure load and store instructions have a syntax consisting of five parts.

The structure load and stores syntax

  • The instruction mnemonic which is either VLD for loads or VST for stores.
  • A numeric interleave pattern, the gap between corresponding elements in each structure.
  • An element type specifying the number of bits in the accessed elements.
  • A set of 64-bit Neon registers to be read or written. Up to four registers can be listed, depending on the interleave pattern.
  • An Arm address register containing the location to be accessed in memory. The address can be updated after the access.

Interleave Pattern

Instructions are available to load, store and deinterleave structures containing from one to four equally sized elements, where the elements are the usual NEON supported widths of 8, 16 or 32-bits.

  • VLD1 is the simplest form. It loads one to four registers of data from memory, with no deinterleaving. Use this when processing an array of non-interleaved data.
  • VLD2 loads two or four registers of data, deinterleaving even and odd elements into those registers. Use this to separate stereo audio data into left and right channels.
  • VLD3 loads three registers and deinterleaves. Useful for splitting RGB pixels into channels.
  • VLD4 loads four registers and deinterleaves. Use it to process ARGB image data.

Stores support the same options, but interleave the data from registers before writing them to memory.

Element Types

Loads and stores interleave elements based on the size specified to the instruction. For example, loading two Neon registers with VLD2.16 results in four 16-bit elements in the first register, and four 16-bit elements in the second, with adjacent pairs (even and odd) separated to each register.

Loading and deinterleaving 16-bit data

Changing the element size to 32-bits causes the same amount of data to be loaded, but now only two elements make up each vector, again separated into even and odd elements.

Loading and deinterleaving 32-bit data

Element size also affects endianness handling. In general, if you specify the correct element size to the load and store instructions, bytes will be read from memory in the appropriate order, and the same code will work on little and big-endian systems.

Finally, element size has an impact on pointer alignment. Alignment to the element size will generally give better performance, and it may be a requirement of your target operating system. For example, when loading 32-bit elements, align the address of the first element to at least 32-bits.

Single or Multiple Elements

In addition to loading multiple elements, structure loads can also read single elements from memory with deinterleaving, either to all lanes of a Neon register, or to a single lane, leaving the other lanes intact.

Loading and deinterleaving to all vector lanes

The latter form is useful when you need to construct a vector from data scattered in memory.

Loading and deinterleaving to a single vector lane

Stores are similar, providing support for writing single or multiple elements with interleaving.

Addressing

Structure load and store instructions support three formats for specifying addresses.

  • Register: [ {,:}]

    This is the simplest form. Data will be loaded and stored to the specified address.
  • Register with increment after: [{,:}]!

    Use this to update the pointer after loading or storing, ready to load or store the next elements. The increment is equal to the number of bytes read or written by the instruction.
  • Register with post-index: [{,:}],

    After the memory access, the pointer is incremented by the value in register Rm. This is useful when reading or writing groups of elements that are separated by fixed widths, eg. when reading a vertical line of data from an image.

You can also specify an alignment for the pointer passed in Rn, using the optional : parameter, which often speeds up memory accesses.

Other Loads and Stores

We have only dealt with structure loads and stores in this post. Neon also provides:

  • VLDR and VSTR
    to load or store a single register as a 64-bit value.
  • VLDM and VSTM
    to load multiple registers as 64-bit values. Useful for storing and retrieving registers from the stack.

For more details on supported load and store operations, see the Arm Architecture Reference Manual. Detailed cycle timing information for the instructions can be found in the Technical Reference Manual for each core.

In my next post, we will look at efficiently handling arrays with lengths that are not a multiple of the vector size. Read it below.

Read Part 2 - Dealing with Leftovers

Anonymous
  • Roberto Ferrara
    Roberto Ferrara 9 months ago in reply to Martyn

    Hi, I have seen this example elsewhere with mov instead of swp. Why can't the registers just be swapped directly in vst? like

    vst3.8 {d2, d1, d0}, [r1]!

    Using mov or swp makes it look like vld<n> and vst<n> only load and store in the first registers in order, this is either undocumented or confusing

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • huningbo
    huningbo over 5 years ago

    Neon is short for what words?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Akshay srinivas
    Akshay srinivas over 7 years ago

    Thanks ,Nice explanation!

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Martyn
    Martyn over 11 years ago

    Those loads are described under "Single or Multiple Elements", above.

    Essentially, there are three types of load and deinterleave:

    1. Load and deinterleave multiple structures to multiple lanes, eg. VLD3.8 {d0, d1, d2}, [r0]
    2. Load and deinterleave one structure to one lane leaving the others intact, eg. VLD3.8 {d0[1], d1[1], d2[1]}, [r0]
    3. Load and deinterleave one structure and duplicate to all lanes, eg. VLD3.8 {d0[], d1[], d2[]}, [r0]
    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • vikky13
    vikky13 over 11 years ago

    Thanks a  lot for your explanation - it is very usefull. From it I conclude that interleave\deinterleave are executed always unconditionaly if vldX\vstX used and X>1.

    What confuses me however is the rvct compiler assembler guide that states (chapter 5) that there are different flavours of these instructions - one with (de)interleaving (n-element structures) and the other (n-element structure to one lane. It loads one n-element structure) with just a regular load.  Could you please comment on that?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
>
Architectures and Processors blog
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025
  • Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

    Samer El-Haj-Mahmoud
    Samer El-Haj-Mahmoud
    Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
    • January 28, 2025