Coding Using NEON Technology

September 11, 2013

4 minute read time.

ARM NEON™ technology is widely used for multimedia optimization. The SIMD architecture of NEON technology makes it very suitable for many compute intensive modules in multimedia codecs such as filtering, de-blocking etc. This blog explores effective coding techniques to enhance performance of an audio/video codec.

The compression algorithms used in multimedia codecs involve processing on large chunks of data. These may be simple element-wise operations on arrays or matrices or more complex loops iterating over large data. In many cases, the code can be re-arranged so that loop iterations can be executed efficiently and in parallel.

In the case of a video codec, most of processing is done on row pixels and column pixels of a block. For instance, the in-loop de-blocking filter involves data processing on multiple pixels of vertical and horizontal edges. These computations can be accelerated using NEON technology by arranging the data in a suitable way. Consider the case of vertical filtering.

NEON load and transpose instructions are used to load all 16 pixels (8 bit depth) of one column into a Q register. As shown in the above figure, the 8 pixels P3,P2,P1,P0,Q0,Q1,Q2,Q3 of a row of frame buffer are loaded into one D register and similar operation done for the remaining 15 rows with an increment of frame width for each row. To make use of NEON instructions the loaded row pixels need to be transposed. Vector transpose treats the elements of its operand vectors as elements of 2 x 2matrices, and transposes the matrices .As a first step in transpose, top 8 row pixels should be loaded into d0,d2,d4,d6,d8,d10,d12,d14 and bottom 8 row pixels should be loaded into d1,d3,d5,d7,d9,d11,d13,d15.

Top 8 rows

VLD1.8 d0, [framebuffer], framewidth

VLD1.8 d2, [framebuffer], framewidth

VLD1.8 d4, [framebuffer], framewidth

VLD1.8 d6, [framebuffer], framewidth

VLD1.8 d8, [framebuffer], framewidth

VLD1.8 d10, [framebuffer], framewidth

VLD1.8 d12, [framebuffer], framewidth

VLD1.8 d14, [framebuffer], framewidth

Bottom 8 rows

VLD1.8 d1, [framebuffer], framewidth

VLD1.8 d3, [framebuffer], framewidth

VLD1.8 d5, [framebuffer], framewidth

VLD1.8 d7, [framebuffer], framewidth

VLD1.8 d9, [framebuffer], framewidth

VLD1.8 d11, [framebuffer], framewidth

VLD1.8 d13, [framebuffer], framewidth

VLD1.8 d15, [framebuffer], framewidth

VTRN.32 , VTRN.16 , VTRN.8 instructions are operated on the 8 Q registers(16 D registers) to arrange each column data into one Q register (a total of 8 Q registers) as follows:

Consider the 32 bit partitions of two Q registers as the elements of 2X2 matrix and perform transpose on the 2 internal 2x2 matrices on the following Q registers.

VTRN.32 Q0, Q4

VTRN.32 Q1, Q5

VTRN.32 Q2, Q6

VTRN.32 Q3, Q7

Consider the 16 bit partitions of two Q registers as the elements of 2X2 matrix and perform transpose on the 4 internal 2x2 matrices on the following Q registers.

VTRN.16 Q0, Q2

VTRN.16 Q1, Q3

VTRN.16 Q4, Q6

VTRN.16 Q5, Q7

Consider the 8 bit partitions of two Q registers as the elements of 2X2 matrix and perform transpose on the 8 internal 2x2 matrices on the following Q registers.

VTRN.8 Q0, Q1

VTRN.8 Q2, Q3

VTRN.8 Q4, Q5

VTRN.8 Q6, Q7

To smooth a single edge, filter operations are performed on the 8 Q registers. Consider the equation for updating edge pixel P0.

P0 = (P2+2P1+2P0+2Q0+Q1+4)>>3

One edge pixel requires 4 additions, 3 multiplications and one round instruction. In NEON this operations are performed on Q registers and hence 16 edge pixels get updated with 4 additions, 3 multiplications and one round instruction. Implementing this module using NEON instructions gives 80% reduction in cycles.

Certain compute intensive modules in audio codec such as stereo processing, FFT, filtering etc., can also be effectively coded using NEON. Efficient utilization of NEON features gives an average of 80% reduction of cycles in the case of video codecs and about 40% in the case of audio codecs.

Since data elements in video processing are either 8-bit or 16-bit; NEON vector instructions and large register set make it very suitable for parallel computation of data. Multiple data elements can be loaded/stored in less number of cycles. Additionally NEON also supports aligned loads which can further be used to reduce cycles. Interleaved loads/stores supported by NEON are very suitable to optimize the FFT.

For further details on different optimization techniques and how the code can be arranged to utilize the NEON features, please refer to:

"Optimization of Multimedia Codecs using ARM NEON" under Media and Downloads on SoCtronics.com.
Or to directly download the white paper here.

Partner Blogger:
Vivek Arora is Program Manager in SoCtronics Technologies Pvt. Ltd., responsible for customer program and product management. He has Bachelor of Technology in Computer Engineering & MBA degrees. He has around 12 years of experience, majority of which is in developing products like mobile phones, custom embedded platforms etc. for global companies.

1 comment
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Coding Using NEON Technology

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC