1 2 Previous Next

Android Community

29 posts

ARM NEON optimization

Posted by yangzhang Mar 27, 2015

ARM NEON optimization

1 Introduction

After reading the article ARM NEON programming quick reference, I believe youhave a basicunderstanding ofARM NEONprogramming.But when applying ARM NEON to a real-worldapplication, there are many programming skills to observe.This article aims to introduce somecommon NEONoptimization skills which come from development practice. The issue of NEON assembly and intrinsics will also be discussed.

2 NEON optimization skills

When using NEON to optimize applications, there are some commonly used optimization skills as follows.

2.1 Remove data dependencies

On ARM v7-A platform, NEON instructions usually take 3~9 cycles. It takes more cycles than ARM instructions. To reduce instruction latency, it’s better to avoid using the destination register of current instruction as the source register of next instruction.

The example:

C code

float SumSquareError_C(const float* src_a, const float* src_b, int count)


  float sse = 0u;

  int i;

  for (i = 0; i < count; ++i) {

    float diff = src_a[i] - src_b[i];

    sse += (float)(diff * diff);


  return sse;


NEON implementation 1

NEON implementation 2

float SumSquareError_NEON1(const float* src_a, const float* src_b, int count)


  float sse;

  asm volatile (

    // Clear q8, q9, q10, q11

    "veor    q8, q8, q8                            \n"

    "veor    q9, q9, q9                            \n"

    "veor    q10, q10, q10                     \n"

    "veor    q11, q11, q11                     \n"


  "1:                                                           \n"

    "vld1.32     {q0, q1}, [%[src_a]]!       \n"

    "vld1.32     {q2, q3}, [%[src_a]]!       \n"

    "vld1.32     {q12, q13}, [%[src_b]]!  \n"

    "vld1.32     {q14, q15}, [%[src_b]]!  \n"

"subs %[count], %[count], #16  \n"

// q0, q1, q2, q3 are the destination of vsub.

// they are also the source of vmla.

    "vsub.f32 q0, q0, q12                      \n"

    "vmla.f32   q8, q0, q0                        \n"


    "vsub.f32   q1, q1, q13                      \n"

    "vmla.f32   q9, q1, q1                       \n"


    "vsub.f32   q2, q2, q14                    \n"

    "vmla.f32   q10, q2, q2                    \n"


    "vsub.f32   q3, q3, q15                    \n"

    "vmla.f32   q11, q3, q3                    \n"

    "bgt        1b                                        \n"


    "vadd.f32   q8, q8, q9                      \n"

    "vadd.f32   q10, q10, q11               \n"

    "vadd.f32   q11, q8, q10                 \n"

    "vpadd.f32  d2, d22, d23                \n"

    "vpadd.f32  d0, d2, d2                     \n"

    "vmov.32    %3, d0[0]                      \n"

    : "+r"(src_a),





    : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11",

      "q12", "q13","q14", "q15");

  return sse;


float SumSquareError_NEON2(const float* src_a, const float* src_b, int count)


  float sse;

  asm volatile (

    // Clear q8, q9, q10, q11

    "veor    q8, q8, q8                            \n"

    "veor    q9, q9, q9                            \n"

    "veor    q10, q10, q10                     \n"

    "veor    q11, q11, q11                     \n"


  "1: \n"

    "vld1.32     {q0, q1}, [%[src_a]]!       \n"

    "vld1.32     {q2, q3}, [%[src_a]]!       \n"

    "vld1.32     {q12, q13}, [%[src_b]]!  \n"

    "vld1.32     {q14, q15}, [%[src_b]]!  \n"

    "subs       %[count], %[count], #16  \n"

    "vsub.f32 q0, q0, q12                      \n"

    "vsub.f32   q1, q1, q13                     \n"

    "vsub.f32   q2, q2, q14                     \n"

    "vsub.f32   q3, q3, q15                     \n"


    "vmla.f32   q8, q0, q0                      \n"

    "vmla.f32   q9, q1, q1                      \n"

    "vmla.f32   q10, q2, q2                    \n"

    "vmla.f32   q11, q3, q3                    \n"

    "bgt        1b                                         \n"


    "vadd.f32   q8, q8, q9                      \n"

    "vadd.f32   q10, q10, q11                \n"

    "vadd.f32   q11, q8, q10                  \n"

    "vpadd.f32  d2, d22, d23                 \n"

    "vpadd.f32  d0, d2, d2                      \n"

    "vmov.32    %3, d0[0]                       \n"

    : "+r"(src_a),





    : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11",

      "q12", "q13","q14", "q15");

  return sse;


In NEON implementation 1, the destination register is used as source register immediately; In NEON implementation 2, instructions are rescheduled and given the latency as much as possible. The test result indicates that implementation 2 is ~30% faster than implementation 1. Thus, reducing data dependencies can improve performance significantly. A good news is that compiler can fine-tune NEON intrinsics automatically to avoid data dependencies which is really one of the big advantages.

2.2 Reduce branches

There isn’t branch jump instruction in NEON instruction set. When the branch jump is needed, jump instructions of ARM are used. In ARM processors, branch prediction techniques are widely used. But once the branch prediction fails, the punishment is rather high. So it’s better to avoid the using jump instructions. In fact, logical operations can be used to replace branch in some cases.

The example:

C implementation

NEON implementation

if( flag )


        dst[x * 4]       = a;

        dst[x * 4 + 1] = a;

        dst[x * 4 + 2] = a;

        dst[x * 4 + 3] = a;




        dst[x * 4]       = b;

        dst[x * 4 + 1] = b;

        dst[x * 4 + 2] = b;

        dst[x * 4 + 3] = b;



//dst[x * 4]       = (a&Eflag) | (b&~Eflag);

//dst[x * 4 + 1] = (a&Eflag) | (b&~Eflag);

//dst[x * 4 + 2] = (a&Eflag) | (b&~Eflag);

//dst[x * 4 + 3] = (a&Eflag) | (b&~Eflag);


VBSL qFlag, qA, qB

ARM NEON instruction set provides the instructions as follows to help users implement the logical operation above:


Reducing branches is not specific to NEON only. It is a commonly used trick. Even in a C program, this trick is also worth the effort.

2.3 Preload data-PLD[i]

ARM processor is load/store system. Except load/store instructions, all operations perform on registers. Therefore increasing the efficiency of load/store instructions is very important for optimizing application.

Preload instruction allows the processor to signal the memory system that a data load from an address is likely in the near future. If the data is preloaded into cache correctly, it would be helpful to improve the rate of cache hit which can boost performance significantly. But the preload is not a panacea. It’s very hard to use on recent processors and it can be harmful too. A bad preload will reduce performance.

PLD syntax:

    PLD{cond} [Rn {, #offset}]

    PLD{cond} [Rn, +/-Rm {, shift}]

    PLD{cond} label


Cond - is an optional condition code.

Rn - is the register on which the memory address is based.

Offset - is an immediate offset. If offset is omitted, the address is the value in Rn.

Rm - contains an offset value and must not be PC (or SP, in Thumb state).

Shift - is an optional shift.

Label - is a PC-relative expression.

The PLD operation features:

  • Independent of load and store instruction execution
  • Happens in the background while the processor continues to execute other instructions.
  • The offset is specified to real cases.

2.4 Misc

In ARM NEON programming, Different instruction sequences can be used to perform the same operation. But fewer instructions do not always produce better performance. It depends on benchmark and profiling result of specific cases. Below listed are some special cases in development practice.

2.4.1 Floating-point VMLA/VMLS instruction

Usually, VMUL+VADD/VMUL+VSUB can be replaced by VMLA/VMLS because fewer instructions are used. But compared to floating-point VMUL, floating-point VMLA/VMLS has a longer instruction delay. If there aren’t other instructions that can be inserted into delay slot, using floating-point VMUL+VADD/VMUL+VSUB will show a better performance.

A real-world example is floating-point FIR function in Ne10. The code snippets are as follows:

Implementation 1: there is only one instruction “VEXT” between two “VMLA” which needs 9 execution cycles according to the table of NEON floating-point instructions timing.

Implementation 2: there is still data dependency on qAcc0. But VADD/VMUL needs 5 execution cycles only.

The cycle n in the table describes the total number of execution cycles roughly. Compared to implementation 1, implementation 2 saves 6 cycles. The benchmark also shows that implementation 2 has a better performance.

Implementation 1: VMLA

VEXT qTemp1,qInp,qTemp,#1

VMLA qAcc0,qInp,dCoeff_0[0]-- cycle 0


VEXT qTemp2,qInp,qTemp,#2

VMLA qAcc0,qTemp1,dCoeff_0[1] -- cycle 9


VEXT qTemp3,qInp,qTemp,#3

VMLA qAcc0,qTemp2,dCoeff_1[0] -- cycle 18


VMLA qAcc0,qTemp3,dCoeff_1[1] -- cycle 27

The final result of qAcc0 -- cycle 36

Implementation 2: VMUL+VADD

VEXT qTemp1,qInp,qTemp,#1

VMLA qAcc0,qInp,dCoeff_0[0] ]-- cycle 0

VMUL qAcc1,qTemp1,dCoeff_0[1]


VEXT qTemp2,qInp,qTemp,#2

VMUL qAcc2,qTemp2,dCoeff_1[0]

VADD qAcc0, qAcc0, qAcc1-- cycle 9


VEXT qTemp3,qInp,qTemp,#3

VMUL qAcc3,qTemp3,dCoeff_1[1]

VADD qAcc0, qAcc0, qAcc2-- cycle 14


VADD qAcc0, qAcc0, qAcc3-- cycle 19

The final result of qAcc0 -- cycle 24

Compared to implementation 1, 3 VADD are needed which takes 6 issue cycles

So that the number of cycles is 30 cycles

For more code details, see:

https://github.com/projectNe10/Ne10/commit/97c162781c83584851ea3758203f9d2aa46772d5?diff=split: modules/dsp/NE10_fir.neon.sline 195

NEON floating-point instructions timing:






















The table is from appendix[ii]


  • Cycles: instruction issue
  • Result: instruction execute

2.5 Summary

NEON optimization techniques are summarized as follows:

  • Utilize the delay slot of instruction as much as possible.
  • Avoid branches.
  • Pay attention to cache hit.

3 NEON assembly and intrinsics

In “ARM NEON programming quick guide”, there is a simple comparison of the pros and cons of NEON assembly and intrinsics:


NEON assembly

NEON intrinsic


Always shows the best performance for the specified platform for an experienced developer.

Depends heavily on the toolchain that is used


The different ISA (i.e. ARM v7-A/v8-A AArch32 and ARM v8-A AArch64) has different assembly implementation. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.

Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.


Hard to read/write compared with C.

Similar to C code, it’s easy to read/write.

But the reality is far more complex than many of these, especially when it comes to ARM v7-A / v8-A cross-platform. In the following sections, this issue will be analyzed further with some examples.

3.1 Programming

For NEON beginners, intrinsics are easier to use than assembly. But experienced developers may be more familiar with NEON assembly programming. They need time to adapt to the intrinsics coding. Some issues that may occur in real development are described in the following.

3.1.1 Flexibility of instruction

From the perspective of using instructions, assembly instruction is more flexible than intrinsics. It is mainly reflected in the data load / store.


Intrinsics instruction

Load data into a 64-bit register, vld1_s8/u8/s16/u16/s32…etc

Load data into a 128-bit register, vld1q_s8/u8/s16/u16/s32…etc

ARM v7-A assembly

VLD1 { Dd}, [Rn]

VLD1 { Dd, D(d+1) }, [Rn]

VLD1 { Dd, D(d+1), D(d+2)}, [Rn]

VLD1 { Dd, D(d+1), D(d+2), D(d+3) }, [Rn]

ARM v8-A assembly

LD1 { <Vt>.<T}, [<Xn|SP>]

LD1 { <Vt>.<T>, <Vt2>.<T>}, [<Xn|SP>]

LD1 { <Vt>.<T>, <Vt2>.<T>, <Vt3>.<T> }, [<Xn|SP>]

LD1 { <Vt>.<T>, <Vt2>.<T>, <Vt3>.<T>, <Vt4>.<T> }, [<Xn|SP>]

This issue will be fixed with upgrading compiler in the future.
Sometimes, compiler has been able to translate two intrinsics instructions into one assembly instruction, such as:

Therefore, it’s expected that intrinsics instruction will have the same flexibility with assembly instructions, with the upgrade of ARM v8 toolchain.

3.1.2 Register allocation

When programming in NEON assembly, registers have to be allocated by users. It must be known clearly which registers are occupied. One of benefits of programming in intrinsics is that users only need to define variables. The compiler will allocate registers automatically. This is an advantage, but it might be a weakness in some cases. Practice has proved that using too many NEON registers simultaneously in the intrinsics programming will bring gcc compiler register allocation issue. When this issue happens, many data are pushed into the stack, which will greatly affect performance of the program. Therefore, users should pay attention to this issue when intrinsics programming. When there is performance exception (such as the performance of C is better than it of NEON), you need to check the disassembly to see if there is register allocation issue firstly. For ARM64, there are more registers (32 128-bit NEON registers). The impact of this issue is significantly reduced.

3.2 Performance and compiler

On one platform, the performance of NEON assembly is only decided by implementation, and has nothing to do with the compiler. The benefit is that you can predict and control the performance of program when you hand-tune the codes, but there isn’t surprise.

Conversely, the performance of NEON intrinsics is greatly dependent on the compiler used. Different compilers may bring very different performance. In general, the older the compiler, the worse the performance. When compatibility with older compilers is needed, it must be considered carefully whether the intrinsics will fit the need. In addition, when fine-tuning the code, it’s hard for user to predict the change of performance with the intervention of the compiler. But there may be surprise. Sometimes the intrinsics might bring better performance than assembly. This is very rare, but does occur.

Compiler will have an impact on the process of NEON optimization. The following figure describes the general process of NEON implementation and optimization.


NEON assembly and intrinsics have the same process of implementation, coding - debugging performance test. But they have different process of optimization step.

The methods of assembly fine-tuning are:

  • Change implementations, such as changing the instructions, adjusting the parallelism.
  • Adjust the instruction sequence to reduce data dependency.
  • Try the skills described in section 2.

When fine-tuning assembly, a sophisticated way is that:

  • Know the number of used instructions exactly
  • Get the execution cycles of program by using the PMU (Performance Monitoring Unit).
  • Adjust the sequence of instructions based on the timing of used instructions. And try to minimize the instruction delay as possible as you can

The disadvantage of this approach is that the changes are specific to one micro-architecture. When the platform is switched, performance improvement achieved might be lost. This is also very time-consuming to do for often comparatively small gains.

Fine-tuning of NEON intrinsics is more difficult.

  • Try the methods used in NEON assembly optimization.
  • Look at the generated assembly and check the data dependencies and register usage.
  • Check whether the performance meets the expectation. If yes, the work of optimization is done. Then the performance with other compilers needs to be verified again.

When porting the assembly code of ARM v7-A with intrinsics for ARM v7-A/v8-A compatibility, performance of assembly can be used as a reference of performance. So it’s easy to check whether the work is done.  However, when intrinsics are used to optimize ARM v8-A code, there isn’t a performance reference. It’s difficult to determine whether the performance is optimal. Based on the experience on ARM v7-A, there might be a doubt whether the assembly has the better performance. I think the impact of this issue will become smaller and smaller with the maturity of the ARM v8-A environment.

3.3 Cross-platform and portability

Now, most of the existing NEON assembly codes can only run on the platforms of ARM v7-A / ARM v8-A AArch32 mode. If you want to run them on platforms of ARM v8-A AArch64 mode, you must rewrite these codes, which take a lot of work. In such situation, if the codes are programmed with NEON intrinsics, they can be run directly on platforms of ARM v8-A AArch64 mode. Cross-platform is one of great advantages. Meanwhile, you just need to maintain on set of code for different platform by using intrinsics, which also significantly reduces the maintenance effort.

However, due to the different hardware resources on ARM v7-A/ARM v8-A platform, sometimes there still might be two sets of code even with intrinsics. The FFT implementation in Ne10 project is an example:

// radix 4 butterfly with twiddles

scratch[0].r = scratch_in[0].r;

scratch[0].i = scratch_in[0].i;

scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;

scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;

scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;

scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;

scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;

scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;

The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, the following can be concluded:

  • 20 64-bit NEON registers are needed if 2 radix4 butterflies are executed in one loop.
  • 20 128-bit NEON registers are needed if 4 radix4 butterflies are executed in one loop.

And, for ARM v7-A/v8-A AArch32 and v8-A AArch64,

  • There are 32 64-bit or 16 128-bit NEON registers for ARM v7-A/v8-A AArch32.
  • There are 32 128-bit NEON registers for ARM v8-A AArch64.

Considering the above factors, the FFT implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARM v7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARM v8-A AArch64.

The above example can illustrate that you need to pay attention to some exceptions when maintaining one set of code across ARM v7-A / v8-A platform,

3.4 Future

Many issues about using NEON assembly and intrinsics have been discussed. But these issues are temporary. In the long term, intrinsics will be better. By using intrinsics, you can reap the benefits of hardware and compiler upgrade without reprogramming. That means some classical algorithms just need to be implemented once. The compiler will help to adjust these codes for new hardware, which reduces the workload significantly. Pffft is an example.

The following figure describes the performance of pffft and Ne10 real FFT on the ARM Cortex A9 platform with gcc. X-axis represents the length of FFT. Y-axis represents the time of execution, the smaller the better. Pffft is implemented with NEON intrinsics. Ne10 real FFT is implemented with NEON assembly. They don’t use the same algorithm, but they have similar performance.


In the ARM v8-A AArch64 mode, the Ne10 real FFT is rewritten with both NEON assembly and intrinsics. Section 3.3 has explained that ARM v8-A can process 4 butterflies in parallel, but ARM v7-A can only process 2 butterflies in parallel. So theoretically, the effect of FFT optimization on ARM v8-A would be better than that on ARM v7-A. However, based on the following figure, it’s clear that pffft has the best performance.  The result illustrates that the compiler should have done very good optimizations specified for ARM v8 architecture.


From this example, it is concluded that: pffft, the performance of which isn’t the best on the ARM v7-A, shows a very good performance in the ARM V8-A AArch64 mode. That proves the point: compiler can adjust intrinsics automatically for ARM V8-A to achieve a good performance.

In the long run, existing NEON assembly code have to be rewritten for ARM v8. If NEON is upgraded in the future, the code has to be rewritten again and again. But for NEON intrinsics code, it is expected that it may show good performance on ARM v8-A with the help of compilers. Even if NEON is upgraded, you can also look forward to the upgrade of compilers.

3.5 Summary

In this section, the pros and cons of NEON assembly and intrinsics have been analyzed with some examples. The benefits of intrinsics far outweigh drawbacks. Compared to assembly, Intrinsics are more easily programmed, and have better compatibility between ARM v7 and ARM v8.

Some tips for NEON intrinsics are summarized as follows:

  • The number of registers used
  • Compiler
  • Do look at the generated assembly

4 End

This blog mainly discusses some common NEON optimization skills and analyzes the pros and cons of NEON assembly and intrinsics with some examples. Hope that these can help NEON developers in actual development.


[i] Cortex™-A Series Version: 2.0 Programmer’s Guide: 17-12 and 17-20

[ii] Cortex™-A9 NEON™ Media Processing Engine Revision: r4p1 Technical Reference Manual: 3.4.8

ARM NEON programming quick reference

1 Introduction

This article aims to introduce ARM NEON technology. Hope that beginners can get started with NEON programming quickly after reading the article. The article will also inform users which documents can be consulted if more detailed information is needed.

2 NEON overview

This section describes the NEON technology and supplies some background knowledge.

2.1 What is NEON?

NEON technology is an advanced SIMD (Single Instruction, Multiple Data) architecture for the ARM Cortex-A series processors. It can accelerate multimedia and signal processing algorithms such as video encoder/decoder, 2D/3D graphics, gaming, audio and speech processing, image processing, telephony, and sound.

NEON instructions perform "Packed SIMD" processing:

  • Registers are considered as vectors of elements of the same data type
  • Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single-precision floating-point on ARM 32-bit platform, both single-precision floating-point and double-precision floating-point on ARM 64-bit platform.
  • Instructions perform the same operation in all lanes

2.2 History of ARM Adv SIMD

ARM v6[i]

SIMD extension

ARM v7-A


ARM v8-A


  • Operates on 32-bit general purpose ARM registers
  • 8-bit/16-bit integer
  • 2x16-bit/4x8-bit operations per instruction
  • Separate register bank, 32x64-bit NEON registers
  • 8/16/32/64-bit integer
  • Single precision floating point
  • Up to 16x8-bit operations per instruction
  • Separate register bank, 32x128-bit NEON registers
  • 8/16/32/64-bit integer
  • Single precision floating point
  • double precision floating point, both of them are IEEE compliance
  • Up to 16x8-bit operations per instruction

[i] The ARM Architecture Version 6 (ARMv6) David Brash: page 13


2.3 Why use NEON

NEON provides:

  • Support for both integer and floating point operations ensures the adaptability of a broad range of applications, from codecs to High Performance Computing to 3D graphics.
  • Tight coupling to the ARM processor provides a single instruction stream and a unified view of memory, presenting a single development platform target with a simpler tool flow[ii]

3 ARM v7/v8 comparison

ARM v8-A is a fundamental change to the ARM architecture. It supports the 64-bit Execution state called “AArch64”, and a new 64-bit instruction set “A64”. To provide compatibility with the ARM v7-A (32-bit architecture) instruction set, a 32-bit variant of ARMv8-A “AArch32” is provided. Most of existing ARM v7-A code can be run in the AArch32 execution state of ARM v8-A.

This section compares the NEON-related features of both the ARM v7-A and ARM v8-A architectures. In addition, general purpose ARM registers and ARM instructions, which are used often for NEON programming, will also be mentioned. However, the focus is still on the NEON technology.

3.1 Register

ARM v7-A and AArch32 have the same general purpose ARM registers – 16 x 32-bit general purpose ARM registers (R0-R15).

ARM v7-A and AArch32 have 32 x 64-bit NEON registers (D0-D31). These registers can also be viewed as 16x128-bit registers (Q0-Q15). Each of the Q0-Q15 registers maps to a pair of D registers, as shown in the following figure.


AArch64 by comparison, has 31 x 64-bit general purpose ARM registers and 1 special register having different names, depending on the context in which it is used. These registers can be viewed as either 31 x 64-bit registers (X0-X30) or as 31 x 32-bit registers (W0-W30).

AArch64 has 32 x 128-bit NEON registers (V0-V31). These registers can also be viewed as 32-bit Sn registers or 64-bit Dn registers.


3.2 Instruction set[iii]

The following figure illustrates the relationship between ARM v7-A, ARM v8-A AArch32 and ARM v8-A AArch64 instruction set.



The ARM v8-A AArch32 instruction set consists of A32 (ARM instruction set, a 32-bit fixed length instruction set) and T32 (Thumb instruction set, a 16-bit fixed length instruction set; Thumb2 instruction set, 16 or 32-bit length instruction set). It is a superset of the ARM v7-A instruction set, so that it retains the backwards compatibility necessary to run existing software. There are some additions to A32 and T32 to maintain alignment with the A64 instruction set, including NEON division, and the Cryptographic Extension instructions.

Compared to the AArch32 instruction set, the A64 instruction set has changed a lot. However, it retains the same functionality of the A32 instruction set. NEON double precision floating point (IEEE compliance) is also supported.

3.3 NEON instruction format

This section describes the changes to the NEON instruction syntax.

3.3.1 ARM v7-A/AArch32 instruction syntax[iv]

All mnemonics for ARMv7-A/AAArch32 NEON instructions (as with VFP) begin with the letter “V”. Instructions are generally able to operate on different data types, with this being specified in the instruction encoding. The size is indicated with a suffix to the instruction. The number of elements is indicated by the specified register size and data type of operation. Instructions have the following general format:

V{<mod>}<op>{<shape>}{<cond>}{.<dt>}{<dest>}, src1, src2


<mod> - modifiers

  • Q: The instruction uses saturating arithmetic, so that the result is saturated within the range of the specified data type, such as VQABS, VQSHL etc.
  • H: The instruction will halve the result. It does this by shifting right by one place (effectively a divide by two with truncation), such as VHADD, VHSUB.
  • D: The instruction doubles the result, such as VQDMULL, VQDMLAL, VQDMLSL and VQ{R}DMULH
  • R: The instruction will perform rounding on the result, equivalent to adding 0.5 to the result before truncating, such as VRHADD, VRSHR.

<op> - the operation (for example, ADD, SUB, MUL).

<shape> - Shape.

Neon data processing instructions are typically available in Normal, Long, Wide and Narrow variants.

  • Long (L): instructions operate on double-word vector operands and produce a quad-word vector result. The result elements are twice the width of the operands, and of the same type. Lengthening instructions are specified using an L appended to the instruction.

  • Wide (W): instructions operate on a double-word vector operand and a quad-word vector operand, producing a quad-word vector result. The result elements and the first operand are twice the width of the second operand elements. Widening instructions have a W appended to the instruction.

  • Narrow (N): instructions operate on quad-word vector operands, and produce a double-word vector result. The result elements are half the width of the operand elements. Narrowing instructions are specified using an N appended to the instruction.

<cond> - Condition, used with IT instruction

<.dt> - Data type, such as s8, u8, f32 etc.

<dest> - Destination

<src1> - Source operand 1

<src2> - Source operand 2

Note: {} represents and optional parameter.

For example:

VADD.I8 D0, D1, D2

VMULL.S16 Q2, D8, D9

For more information, please refer to the documents listed in the Appendix.

3.3.2 AArch64 NEON instruction syntax[v]

In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:

{<prefix>}<op>{<suffix>}  Vd.<T>, Vn.<T>, Vm.<T>


<prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.

<op> – operation, such as ADD, AND etc.

<suffix> - suffix

  • P: “pairwise” operations, such as ADDP.
  • V: the new reduction (across-all-lanes) operations, such as FMAXV.
  • 2new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2.

ADDHN2: add two 128-bit vectors and produce a 64-bit vector result which is stored as high 64-bit part of NEON register.

SADDL2: add two high 64-bit vectors of NEON register and produce a 128-bit vector result.

<T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).

For example:

UADDLP    V0.8H, V0.16B

FADD V0.4S, V0.4S, V0.4S


For more information, please refer to the documents listed in the Appendix.

3.4 NEON instructions[vi]

The following table compares the ARM v7-A, AArch32 and AArch64 NEON instruction set.

“√” indicates that the AArch32 NEON instruction has the same format as ARM v7-A NEON instruction.

“Y” indicates that the AArch64 NEON instruction has the same functionality as ARM v7-A NEON instructions, but the format is different. Please check the ARM v8-A ISA document.

If you are familiar with the ARM v7-A NEON instructions, there is a simple way to map the NEON instructions of ARM v7-A and AArch64. It is to check the NEON intrinsics document, so that you can find the AArch64 NEON instruction according to the intrinsics instruction.

New or changed functionality is highlighted.


ARM v7-A



logical and compare

VAND, VBIC, VEOR, VORN, and VORR (register)


VBIC and VORR (immediate)




VMOV, VMVN (register)








general data processing

VCVT (between fixed-point or integer, and floating-point)


VCVT (between half-precision and single-precision floating-point)





to single-precision)





VMOV, VMVN (immediate)
























VSHL, VQSHL, VQSHLU, and VSHLL (by immediate)


V{Q}{R}SHL (by signed variable)












general arithmetic








































There isn’t float MLA/MLS

VMUL{L}, VMLA{L}, and VMLS{L} (by scalar)




VQDMULL, VQDMLAL, and VQDMLSL (by vector or by scalar)


VQ{R}DMULH (by vector or by scalar)








load and store

VLDn/VSTn(n=1, 2, 3, 4)




Crypto Extension























4 NEON programming basics

There are four ways of using NEON[vii]:

  • NEON optimized libraries
  • Vectorizing compilers
  • NEON intrinsics
  • NEON assembly

4.1 Libraries

The users can call the NEON optimized libraries directly in their program. Currently, you can use the following libraries:

  • OpenMax DL

This provides the recommended approach for accelerating AV codecs and supports signal processing and color space conversions.

  • Ne10

It is ARM’s open source project. Currently, the Ne10 library provides some math, image processing and FFT function. The FFT implementation is faster than other open source FFT implementations.

4.2 Vectorizing compilers

Adding vectorizing options in GCC can help C code to generate NEON code. GNU GCC gives you a wide range of options that aim to increase the speed, or reduce the size of the executable files they generate. For each line in your source code there are generally many possible choices of assembly instructions that could be used. The compiler must trade-off a number of resources, such as registers, stack and heap space, code size (number of instructions), compilation time, ease of debug, and number of cycles per instruction in order to produce an optimized image file.

4.3 NEON intrinsics

NEON intrinsics provides a C function call interface to NEON operations, and the compiler will automatically generate relevant NEON instructions allowing you to program once and run on either an ARM v7-A or ARM v8-A platform. If you intend to use the AArch64 specific NEON instructions, you can use the (__aarch64__) macro definition to separate these codes, as in the following example.

NEON intrinsics example:

//add for float array. assumed that count is multiple of 4




void add_float_c(float* dst, float* src1, float* src2, int count)


     int i;

     for (i = 0; i < count; i++)

         dst[i] = src1[i] + src2[i];



void add_float_neon1(float* dst, float* src1, float* src2, int count)


     int i;

     for (i = 0; i < count; i += 4)


         float32x4_t in1, in2, out;

         in1 = vld1q_f32(src1);

         src1 += 4;

         in2 = vld1q_f32(src2);

         src2 += 4;

         out = vaddq_f32(in1, in2);

         vst1q_f32(dst, out);

         dst += 4;


         // The following is only an example describing how to use AArch64 specific NEON

         // instructions.                              

#if defined (__aarch64__)

         float32_t tmp = vaddvq_f32(in1);





Checking disassembly, you can find vld1/vadd/vst1 NEON instruction on ARM v7-A platform and ldr/fadd/str NEON instruction on ARM v8-A platform.

4.4 NEON assembly

There are two ways to write NEON assembly:

  • Assembly files
  • Inline assembly

4.4.1 Assembly files

You can use ".S" or “.s” as the file suffix. The only difference is that C/C ++ preprocessor will process .S files first. C language features such as macro definitions can be used.

When writing NEON assembly in a separate file, you need to pay attention to saving the registers. For both ARM v7 and ARM v8, the following registers must be saved:


ARM v7-A/AArch32


General purpose registers

R0-R3 parameters

R4-R11 need to be saved

R12 IP


R14(LR) need to be saved

R0 for return value

X0-X7 parameters


X19-X28 need to be saved

X29(FP) need to be saved


X0, X1  for return value

NEON registers

D8-D15 need to be saved

D part of V8-V15 need to be saved

Stack alignment

64-bit alignment

128-bit alignment[ix]

Stack push/pop

PUSH/POP Rn list


LDP/STP register pair

The following is an example of ARM v7-A and ARM v8-A NEON assembly.


void add_float_neon2(float* dst, float* src1, float* src2, int count);

//assembly code in .S file

ARM v7-A/AArch32



    .syntax unified


    .align 4

    .global add_float_neon2

    .type add_float_neon2, %function






    vld1.32  {q0}, [r1]!

    vld1.32  {q1}, [r2]!

    vadd.f32 q0, q0, q1

    subs r3, r3, #4

    vst1.32  {q0}, [r0]!

    bgt .L_loop


    bx lr



    .align 4

    .global add_float_neon2

    .type add_float_neon2, %function





    ld1     {v0.4s}, [x1], #16

    ld1     {v1.4s}, [x2], #16

    fadd    v0.4s, v0.4s, v1.4s

    subs x3, x3, #4

    st1  {v0.4s}, [x0], #16

    bgt .L_loop



For more examples, see: https://github.com/projectNe10/Ne10/tree/master/modules/dsp

4.4.2 Inline assembly

You can use NEON inline assembly directly in C/C++ code.


  • The procedure call standard is simple. You do not need to save registers manually.
  • You can use C / C ++ variables and functions, so it can be easily integrated into C / C ++ code.


  • Inline assembly has a complex syntax.
  • NEON assembly code is embedded in C/C ++ code, and it’s not easily ported to other platforms.


// ARM v7-A/AArch32

void add_float_neon3(float* dst, float* src1, float* src2, int count)


    asm volatile (

               "1:                                                                \n"

               "vld1.32         {q0}, [%[src1]]!                          \n"

               "vld1.32         {q1}, [%[src2]]!                          \n"

               "vadd.f32       q0, q0, q1                                 \n"

               "subs            %[count], %[count], #4              \n"

               "vst1.32         {q0}, [%[dst]]!                           \n"

               "bgt             1b                                             \n"

               : [dst] "+r" (dst)

               : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)

               : "memory", "q0", "q1"



// AArch64

void add_float_neon3(float* dst, float* src1, float* src2, int count)


    asm volatile (

               "1:                                                              \n"

               "ld1             {v0.4s}, [%[src1]], #16               \n"

               "ld1             {v1.4s}, [%[src2]], #16               \n"

               "fadd            v0.4s, v0.4s, v1.4s                    \n"

               "subs            %[count],  %[count], #4           \n"

               "st1             {v0.4s}, [%[dst]], #16                 \n"

               "bgt             1b                                           \n"

               : [dst] "+r" (dst)

               : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)

               : "memory", "v0", "v1"




4.5 NEON intrinsics and assembly

NEON intrinsics and assembly are the commonly used NEON. The following table describes the pros and cons of these two approaches:


NEON assembly

NEON intrinsic


Always shows the best performance for the specified platform for an experienced developer.

Depends heavily on the toolchain used


The different ISAs (ARM v7-A/AArch32 and AArch64) have different assembly implementations. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.

Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.


Hard to read/write compared to C.

Similar to C code, it’s easy to read/write.

This is a simple summary. When applying NEON to more complex scenarios, there will be many special cases. This will be described in a future article ARM NEON Optimization.

With the above information, you can choose a NEON implementation and start your NEON programming journey.

For more reference documentation, please check the appendix.

Appendix: NEON reference document


[i] The ARM Architecture Version 6 (ARMv6) David Brash: page 13

[ii] ARM Cortex-A Series Programmer’s Guide Version 4.0: page 7-5

[iii] http://www.arm.com/zh/products/processors/instruction-set-architectures/armv8-architecture.php


[iv] ARM® Compiler toolchain Version 5.02 Assembler Reference: Chapter 4

NEON and VFP Programming

ARM Cortex™-A Series Version: 4.0 Programmer’s Guide: 7.2.4 NEON instruction set

[v] ARMv8 Instruction Set Overview: 5.8 Advanced SIMD

[vi] ARMv8 Instruction Set Overview: 5.8.25 AArch32 Equivalent Advanced SIMD Mnemonics

[vii] http://www.arm.com/zh/products/processors/technologies/neon.php

[viii]Procedure Call Standard for the ARM 64-bit Architecture (AArch64) : 5 THE BASE PROCEDURE CALL STANDARD

[ix] Procedure Call Standard for the ARM 64-bit Architecture (AArch64) : 5.2.2 The Stack

Watch the Demonstration Video!



Mobile computing has never been more powerful. Enabled to a large part by ARM® technology, it has become commonplace to carry a device with more processing power than what a common desktop machine would have had not so very long ago.


The Seamless Computing demonstration was conceived to explore some of the implications of this comparison. If a smartphone can computationally match a desktop, what is preventing us from using these devices in that paradigm? What functionality would a smartphone need to offer to overcome these barriers and become a true primary compute device, meeting all our needs through the day?

We decided to focus on a workplace desktop scenario – sitting in your office, using a device for typical productivity applications. Ideally, there would be a smooth transition from mobile operation to desktop mode. The user would walk into their office, sit down at their desk and almost immediately start using the device in that new context.


This scenario immediately implied a larger, separate display from the smartphone, along with a full sized keyboard and mouse. Previous commercial products have aimed at similar use cases with mixed success. More recently, some Android™ enthusiasts have also experimented in this area – this video is particularly compelling. Both these required a dock of some description, which introduced an immediate extra step into the use case – the user must dock the phone in addition to sitting at their desk. Additionally, the two links above featured either a distinct software environment for the desktop, or simply mirrored the mobile environment. The first creates a discontinuity in workflow. The second results in over-large icons and application layouts unsuitable for desktop working –sized instead for a smaller, touch driven display.


With this in mind we identified the primary functionality of our demonstration:

  • Wirelessly pair all peripherals (input and output devices).
  • Reconfigure the UI – The same environment & apps, but with context appropriate UI layout.
  • Trigger the context change between mobile and desktop, without physically docking the device.


The remainder of this blog deals with the technical detail behind the implementation of these functional requirements.



We assume some basic knowledge of Android and the Android SDK in order to follow the discussion below. If you wish to attempt to replicate the full functionality of the demonstration, be aware that doing so will require root access to your device, and expert level knowledge as you will need to create a non-standard Android development environment. Both of these activities are entertained at your own risk and we must recommend that you inform yourself of the impact on any warranties, etc. We provide an outline of what was done to accomplish the features seen in this demonstration, but unfortunately cannot provide a step-by-step guide or release the source code at this time.


Device Selection


We selected the Samsung™ Galaxy Note 3 as the primary platform for this demonstration. This device utilises the Samsung Exynos™ 5420 System-on-Chip, a 4x4 big.LITTLE™ design built around the ARM Cortex®-A15 and Cortex-A7 application processors with an ARM Mali®-T628 graphics processor. The device was upgraded to Android 4.4, and alongside the powerful processing included NFC, wireless charging capability (with an accessory pack), wireless display mirroring and a few other features we thought might be useful for this specific demonstration.


Context Change Detection

Context sensing is a topic of some current interest in the mobile device market. With the proliferation of available sensors, along with always-on connectivity to a wide variety of cloud services, the devices can more accurately recognise what is happening and adjust their functionality in response to this. For our demonstration, we needed a practical way for the device to recognise proximity to the desk and thus trigger a change to desktop mode, along with recognising the opposite transition to mobile.


Initially, we evaluated NFC as a transition trigger. A tag was placed on the surface of the desk, so the user would simply place the phone on the desk as they sat down to trigger the transition. This was relatively straight forward, as Android provides good support for NFC. However, one complication was that the version of Android being used did not publicly expose an event (an Intent within the Android SDK) for tag removal. So, we could detect when the phone was placed on the tagged desk, but not when it was removed. One can work around this with a rooted phone and 3rd party frameworks or APKs that give deeper hooks into Android.  With this we were able to achieve the desired behaviour – place the phone on the desk to enter desktop mode, pick it up to return to mobile mode.


However, our final context detection relied upon a different mechanism. As we had a phone capable of wireless charging, we constructed a desk with a wireless charger embedded in the surface. Samsung sold a wireless charging kit for the Galaxy Note 3, consisting of a replacement back plate for the phone, and a charging pad with a USB connection for power. We took the pad, and routed a depression for it in a small children’s desk. We then placed thin vinyl tiles over the desk surface. The result was a smoothly finished surface, with a ‘charging zone’ above the embedded charging pad. Detecting charging and not-charging events via the Intent framework in Android is even easier than NFC, so using this as a context trigger was very straight forward.  Additionally, the device would be charging whilst it was a desktop!


The demonstration was implemented as an Android Service, with a simple administrative Activity for manipulating some settings. The Android Intents framework was used to listen for the events described above and trigger the correct context change. This consisted of triggering the peripheral pairing and UI reconfiguration described in the following section.



Wireless Peripheral Pairing

Copy of desk1.png

Bluetooth® support for keyboards, mice, and other devices has long been built into Android. The Android SDK provides support for enabling, disabling and otherwise manipulating the Bluetooth functionality of the device. In theory, we could enable or disable Bluetooth according to the desktop context detection we had. But in practice, Bluetooth is already fairly good at connecting with a peripheral once it has been paired with the device, and is in range. We experimented a little bit with enabling and disabling Bluetooth but settled on just enabling it if it was not already on, and relying on it to establish connections to the keyboard and mouse when in range.


Wireless display mirroring is a little more interesting, and more difficult, than connecting a keyboard or a mouse. More recent versions of Android support the Wi-Fi Certified Miracast® standard. At the time of development of this demonstration, Miracast was included in the Samsung Galaxy Note 3 as Samsung Allshare® Cast. More recent releases of Android are rolling support of Miracast into the core of Android. Miracast is essentially a compressed video stream transmitted over Wi-Fi®.


By default, display mirroring is a feature that the user explicitly turns off and on via a settings menu option or shortcut. For the purposes of our demonstration, we wished to automate this. There is no public API to access this programmatically, neither in Android or provided by the OEM (Samsung).  However, some additional research of the Android source code on GitHub reveals that from around version 19 of the Android SDK, the DisplayManager class does include methods for connecting and disconnecting Wi-Fi displays (aka Miracast), but that these functions are hidden under normal circumstances. There are a few ways to gain access here – reflection has been a popular approach for experimental Android developers, but a slightly more elegant approach is to actually obtain an Android Open Source Project jar archive where the hidden classes and methods have not been stripped out, and then replace the standard android.jar file in your build framework. Obviously the methods exposed here are not generally available, supported, or even guaranteed to work at all – this is not for general application development, but within our remit of creating an interesting technical demonstration was a viable route forward.


Given access to the hidden functions of the DisplayManager class, it was now possible to automatically connect or disconnect from a known Miracast display – in this case a Samsung Allshare Cast dongle connected to a display on our desk.


User Interface Configuration

Simple display mirroring over Miracast is perfect for showing a movie, pictures or similar content on a larger screen. However, it is just simple display mirroring – so the interface of the device remains exactly the same. This means that, in landscape view on our remote monitor, one will see a letterboxed, portrait image of the phone screen… and that all icons and text are sized as if they were to be displayed on a screen a few inches across, rather than a desktop sized display. Two inch wide icons do not look natural, and to compound this much of the UI layout on a mobile device is also aimed at a small screen – a single scroll list or column of input fields for instance. To obtain a more natural desktop experience we employed three approaches.


First, we needed to ensure the phone transitioned to landscape display when in desktop mode. There are apps that will allow you do to this in the Google Play™ Store. Using one of these in conjunction with an automation app such as Tasker, we can automate locking of display rotation to landscape in our desktop context. From our Android Service, on entering or exiting desktop mode, we broadcast some custom Intents. Using Tasker’s ability to receive intents, we set it up to control the rotation locking app appropriately.


The orientation issue now solved, we can move on to the icon size and UI layout issue. Anyone who has developed with Android knows that there is a comprehensive framework in place to define UI layouts and assets that adjust to the wide range of display sizes found in Android devices. Whilst this framework is not generally intended to be leveraged dynamically, there are methods in a normally hidden interface within the Android framework that allow these values to be programmatically set. Whether this works will depend a little on the precise Android build and which device you are using, but if they are enabled then one can set the pixel density and display size, and leave the Android layout and resource framework to do the rest. There are some caveats here in that some applications will not pick up the new settings and refresh their layout automatically. For the purposes of our demonstration we forced some applications to restart – definitely not a recommended approach in standard Android programming, but possible with the root access we’d already obtained to implement this demonstration.


With our desktop experience now utilising more reasonably sized icons, and layouts designed for larger tablet devices (2-pane layouts, etc.), we can focus a little more attention on the home screen itself. On a mobile device, this tends to be given over to a grid of app icons and widgets, and feature multiple pages of such grids which the user can swipe through. A traditional desktop experience usually has only one page, and a few icons, usually towards the edges of the screen. The default launcher screen on our selected device did not ‘feel’ like a desktop even when locked to landscape and with its tablet layout. So, we opted to install a custom Android launcher. With this we could configure the desktop experience to appear exactly as we desired.

                    Screenshot_2015-01-28-17-01-59.png                    Screenshot_2015-01-28-17-04-52.png

However, we still needed to switch between a mobile and desktop experience – i.e. change the home screen layout dynamically. A little bit of reverse engineering revealed where the settings files for our custom launcher were stored. We used something of a blunt instrument here, but with the help of a library enabling root-access shell commands, we swap out the settings files for the launcher and force it to restart on each context switch between mobile and desktop. This is by far the least elegant implementation of the demonstration, and the most prone to error, but probably went the furthest towards providing a compelling user experience upon entering desktop mode – there was a very visible transition to a User Experience that anyone who has touched a PC in the last 30 years would recognise.



Closing Words

This then concludes a brief exploration of the techniques we used to implement the Seamless Computing demonstration. One of the most interesting conclusions was not only that a mobile device has the capability to function in this desktop context, but that actually it is possible to leverage substantial portions of the existing Android software framework to provide a compelling desktop experience, and to be able to dynamically switch into and out of this. It is by no means a production-ready experience – but it was closer than we’d anticipated on commissioning the demonstration.


Whether a single device operating in this manner is the direction the world will take remains to be seen. There are other possibilities – multiple devices all providing a rich-but-thin client experience to a virtual cloud-hosted desktop or homescreen, for instance. Regardless, ARM technology is allowing our partners to experiment with all of these form factors and performance points, from extraordinary compute power in a handheld device, to capable but extremely cost-conscious tablets or clamshells. Mobile computing is a reality, and we can’t wait to see what happens next.

最近,Ne10 v1.2.0 发布了。该更新提供了一个新功能——基3、基5的快速傅立叶变换(FFT)。 在基准测试中可以看到, NEON优化使得FFT得到大幅的性能提升。

1. Ne10项目

Ne10 项目旨在为ARM的生态系统提供高度NEON优化的基础函数,比如图像处理(Image Processing)、数字信号处理(DSP)和数学(math)函数等。想要更多地了解Ne10项目,请移步此博客。想更多地了解Ne10中的FFT功能,请移步此博客

2. Benchmark

2.1. 时间

1给出了在ARMv7-ACortex-A9, 1.0GHz)和AArch64 Cortex-A53, 850MHz 上,四个不同实现的性能数据,包括Ne10 v1.2.0),pffft2013),kissFFT(1.3.0),以及Opus项目 (v1.1.1-beta) 中的FFT实现(基于kissFFT,但经过优化)。其中kissFFTOpus中的实现并没有利用NEON技术,而Ne10pffft是经过深度的NEON优化的。编译器采用的是LLVM 3.5,编译选项是-O2


1中,横坐标是FFT的长度,纵坐标是消耗的时间,时间越少说明性能越好。其中,循环次数是 2.048 x 106 / (FFT的长度)。举个例子,我们将1024FFT执行2000次,然后记录下总运行时间。由于pffft要求FFT的长度是16的倍数,所以对应的曲线是从240开始的。可以看出,经过NEON优化后,性能得到明显的提升。

2.2. 每秒百万次浮点操作数(MFLOPS




3. 使用方法

此次更新并没有改变FFTAPINe10在启动 Initial/Setup)的过程中识别FFT的长度是否包含基-3、基-5,进而选择最优的计算方法。详情请参考此博客

Ne10 v1.2.0 is released. Now radix-3 and radix-5 are supported in floating point complex FFT. Benchmark data below shows that NEON optimization has significantly improved performance of FFT.


1. Project Ne10

The Ne10 project has been set up to provide a set of common, useful functions which have been heavily optimized for the ARM Architecture and provide consistent well tested behavior that can be easily incorporated into applications. C interfaces to the functions are provided for both assembler and NEON™ implementations. The library supports static and dynamic linking and is modular, so that functionality that is not required can be discarded. For details of Ne10, please check this blog. For more details of FFT feature in Ne10, please refer this blog.


2. Benchmark

2.1. Time cost

Figure 1 is benchmark data (time cost) of four FFT implementations, including Ne10 (v1.2.0), pffft (2013), kissFFT (1.3.0), and one inside Opus (v1.1.1-beta). Ne10 and pffft are well NEON-optimized, while kissFFT and Opus FFT are not. All implementations are compiled by LLVM 3.5, with -O2 flag. All these implementations have been tested on ARMv7-A (Cortex-A9, 1.0GHz) and AArch64 (Cortex-A53, 850MHz).

Figure 1

In figure 1, x axis is size of FFT and y axis is time cost (ms), smaller is better. Each FFT has been run for 2.048x106 / (size of FFT) times. Say, we run 2000 times for 1024 points FFT. Only multiple of 16 sizes are supported in pffft, so its curve starts from 240. Performance boost after NEON optimization is obvious.


2.2. Mega Floating-point operations per second (MFLOPS)

Figure 2

Figure 2 is benchmark data in MFLOPS of these four implementations. Data are calculated according to this link. MFLOPS is a measure of performance of different algorithms in solving the same problem, bigger is better. When data are packed and processed by NEON instructions (in Ne10 and Pffft), MFLOPS is much higher.


3. Usage

API of FFT is not modified. Ne10 detects whether the size of FFT is multiple of 3 or 5, and then selects the best algorithms to execute. For more detail, please refer this blog.

The Android team in ARM was lucky enough to be invited to a Linux Plumbers mini-conf to talk about AArch64, porting from 32-bit to 64-bit and our experiences in working on Binder (a key Android feature which relies upon support in the Linux kernel).


Attached to this post are the raw PDFs (no video this time).


First an introduction to the AArch64 ISA (from the lead engineer on our Javascript porting work), next a presentation of porting between AArch32 and AArch64 code (from an engineer who did a lot of work on adding AArch64 support to Skia, a key rendering library in Android). Finally a presentation on the changes to the Binder kernel driver needed to support 64-bit user space code, from the engineer who did that and a lot of the initial bionic porting to 64-bit for Android.


As an added bonus, I've attached the original slides for the 'From Zero to Boot' talk at Linaro, which are missing from the Linaro page on the talk.

Stephen Kyle

The ART of Fuzz Testing

Posted by Stephen Kyle Nov 26, 2014

In the newest version of Android, Lollipop (5.0), the virtual machine (VM) implementation has changed from Dalvik to ART. Like most VMs, ART has an interpreter for executing the bytecode of an application, but also uses an ahead-of-time (AOT) compiler to generate native code. This compilation takes place for the majority of Java methods in an app, when the app is initially installed. The old VM, Dalvik, only produced native code from bytecode as the app was executed, a process called just-in-time (JIT) compilation.


ART currently provides a single compiler for this AOT compilation, called the quick compiler. This backend is relatively simple for a compiler, using a 1:1 mapping from most bytecodes to set sequences of machine instructions, performing a few basic optimisations on top of this. More backends are in various stages of development, such as the portable backend and the optimizing backend. As the complexity of a backend increases, so too does its potential to introduce subtle bugs into the execution of bytecode. In the rest of this post, we will use the term "backend" to refer to the different ways in which code can be executed by ART, be it the interpreter, the quick compiler, or the optimizing compiler, and the term "quick compiler" and "quick backend" should be considered equivalent.


In this post we will consider how we can check that we aren't introducing new bugs as these backends are developed.


A test suite is useful, but is limited in size, and may only test for regressions of bugs the developers have found in the past. Some errors in the VM may not have been detected yet, and there are always rare cases arising from unexpected code sequences. While some bugs may just cause the compiler to crash, or create a program that produces slightly incorrect output, other bugs can be more malicious. Many of these bugs lurk at the fringes of what we would consider "normal" program behaviour, leaving open potential for exploits that use these fringe behaviours, leading to potential security issues.


How do we find these bugs? Fuzz testing (also commonly known as "fuzzing") can allow us to test a greater range of programs. Fuzz testing generally refers to random generation of input to stress test the capabilities of a program or API, particularly to see how it can handle erroneous input. In this case, we generate random programs to see how the backends of ART deal with verifying, compiling and executing them.  Before we discuss our fuzz testing strategy in more detail, let's look at how apps are executed in Android.


From Java code to execution on your Android device


Let's take a look at a simple Java method, and watch how this code is transformed into a sequence of A64 instructions.


public int doSomething(int a, int b) {
  if (a > b) {
    return (a * 2);
  return (a + b);


In Android software development, all Java source files are first compiled to Java bytecode, using the standard javac tool. The Java bytecode format (JVM bytecode) used by Java VMs is not the same as the bytecode used in ART, however. The dx tool is used to translate from JVM bytecode to the executable bytecode used by ART, which is called DEX (Dalvik EXecutable, a holdover from when the VM was called Dalvik.) The DEX code for this Java code looks like:


0000: if-le v2, v3, 0005
0002: mul-int/lit8 v0, v2, #int 2
0004: return v0
0005: add-int v0, v2, v3
0007: goto 0004


In this case, the virtual registers v2 and v3 are the method's parameters, a and b, respectively. For a good reference on DEX bytecode, you can consult this document, but essentially this code compares a to b, and if a is less-than-or-equal-to b it adds a to b and returns that result. Otherwise, it multiplies a by 2 and returns that.


When ART loads this code, it typically compiles the bytecode using the quick backend. This compilation will produce a function that roughly follows the ARM Architecture Procedure Call Standard (AAPCS) used with A64 code - it will expect to find its arguments in r2 and r3*, and will return the correct result in r0. Here is the A64 code that the quick backend will produce, with some simplifications:


  // Reminder: w2 is the 32-bit view of register r2 in A64 code!
  [-- omitted saving of registers w20-w22 to the stack --]
  mov w21, w2
  mov w22, w3
  cmp w21, w22
  b.le doAdd
  lsl w20, w21, #1  // (NB: this is w21 * 2)
  mov w0, w20
  [-- omitted loading of registers w20-w22 from the stack --]
  add w20, w21, w22
  b doLeave


*(Why not r0 and r1? Because r0 is reserved for passing the context of the method that is currently being executed. r1 is used for the implicit first argument of any non-static method - the reference to the this object.)


Before code can be compiled or executed by any backend, the bytecode must always be verified.  Verification involves checking various properties of the bytecode to ensure it is safe to execute. For example, checking that the inputs to a mul-float bytecode are actually float values, or checking that a particular method can be executed from the class we are currently executing within. Many of these properties are checked when the program is compiled from Java source to DEX bytecode, resulting in compiler errors. However, it is important to perform full bytecode verification when apps are about to be executed, to defend against security exploits that target DEX manipulation.


Once verification has taken place at run time, ART will load the arguments for the method into the correct registers, and then jump straight to the native code. Alternatively, ART could use its interpreter to interpret the input DEX bytecode as Dalvik would traditionally have done before attempting JIT compilation. Any bytecode that is executed as native code should do the exact same thing when it is executed in the interpreter. This means that methods should return the same results and produce the same side-effects. We can use these requirements to test for flaws in the various backend implementations. We expect that any code that passes the initial verification should be compilable, and some aspects of compilation will actually rely on properties of the code that verification has proven. Contracts exist between the different stages of the VM, and we would like to be assured that there are no gaps between these contracts.


Fuzz testing


We have developed a fuzz tester for ART, that uses mutation-based fuzzing to create new test cases from already written Java programs. ART comes with an extensive test suite for testing the correctness of the VM, but with a mutation-based fuzz tester, we can use these provided tests as a base from which we can investigate more corner cases of the VM.


The majority of these test programs produce some kind of console output - or at the very least, output any encountered VM errors to the console. The test suite knows exactly what output each test should produce, so it runs the test, and confirms that the output has not changed. Mutation-based fuzzing means that we take a test program, and modify it slightly - this means that the output of the program may have changed, or the program may now produce an error. Since we no longer know what output to expect, we can instead use the fact that ART has multiple backends to verify that they all execute this program the same way. Note however that this approach is not foolproof, as it may be the case that all of the backends execute the program in the same, incorrect way. To overcome this, it is also possible to test program execution on the previous VM, Dalvik, as long as some known differences between the two VMs are tolerated (e.g. the messages they use to report errors.) As we increase the number of backends to test, the likelihood that they are all wrong in the same way should decrease.




This diagram shows the fuzzing and testing process. First, the fuzzer parses the DEX file format into a form such that it can apply various mutations to the code. It randomly selects a subset of the methods of the program to mutate, and for each one, it randomly selects a number of mutations to apply. The fuzzer produces a new, mutated DEX file with the mutated code, and then executes this program using the various backends of the ART VM.


Note that all backends pass through a single verifier, and that some backends have been simplified in this diagram - the quick and optimizing backends are technically split up into compilation and execution phases, while the interpreter only has an execution phase. Ultimately, the execution of the mutated DEX file should produce some kind of output from each backend, and we compare these outputs to find bugs. In this example, the fact that the optimizing backend produces "9" instead of "7" strongly suggests there is a bug with the way the optimizing backend has handled this mutated code.


So how do we do this fuzzing? A naive approach would be to take the DEX file and flip bits randomly to produce a mutated DEX file. However, this is likely to always produce a DEX file that fails to pass verification. A large part of the verification process is checking that the structure of the DEX file format is sound, and this includes a checksum in the file's header - randomly flipping bits in the whole file will almost certainly cause this checksum to become invalid, but also likely break some part of the file's structure. A better approach is to focus applying minor mutations to the sections of the program that directly represent executable code.


Some examples of these minor mutations are as follows:



swap two bytecodesPick two bytecodes to swap with each other.
change the register used by a bytecodePick one of the registers specified by a bytecode and change the register.
change an index into the type/field listSome bytecodes may use an index into a list of methods, types or fields at the start of a DEX file. For example, new-instance v0, type@7 will create a new object with the type listed at index 7 of the type list and puts it in v0. The mutation changes which type, field or method is selected.
change the target of a branch bytecodeMake a branch bytecode point to a new target, changing control-flow.
generate a random new bytecodeGenerate a new random bytecode and insert it into a random position, with randomly generated values for all of its operands.


We limit our mutations to a few simple changes to bytecodes that individually are unlikely to break the verification of the DEX file, but in combination may lead to differences in the way the program executes. At the same time, we do not want to ensure that every mutation results in a legal bytecode state, because we wish to search for holes in the verification of the program. Often holes in verification may lead to a compiler making an incorrect assumption about the code it is compiling, which will manifest as differences in output between the compiler and the interpreter.


Example of Bugs Found


Now we present one of the bugs that we have found and fixed in the Android Open Source Project's (AOSP) code base, using this fuzz testing strategy.


When presented with a bytecode that reads an instance field of an object, such as iget v0, v1, MyClass.status (this writes into v0 the value of the "status" field of the object referred to by v1) the verifier did not confirm that v1 actually contained a reference to an object.


Here's a sequence of bytecodes that creates a new MyClass instance, and sets the status field to its initial value + 1:


const v0, 1
new-instance v1, MyClass
invoke-direct {v1} void MyClass.<init>() // calling MyClass() constructor
iget v2, v1, MyClass.status
add-int v2, v0, v2
iput v2, v1, MyClass.status


If a mutation changed the v1 on line 4 to v0, then iget would now have the constant 1 currently in v0 as an input, instead of the reference to an object that was in v1.  Previously, the verifier would not report this as an error when it should, and so the compiler (which expects the iget bytecode to have been properly verified) would expect an object reference to be in the input register for iget, and just read from the value of that reference plus the offset of the status field. If an attacker ensured that an address they wanted to read from was used as the loaded constant, they could read from any memory address in the process' address space. Java removes the ability to read memory directly (without the use of some mechanism such as JNI), to ensure that, for instance, private fields of classes cannot be accessed from within Java, but this bug allowed this to happen.


While this particular bug was present in the verifier, other bugs have been found and fixed in the quick backend of ART. For some of these bugs, we have contributed patches to the AOSP code base, while other bugs have been reported to the ART team. As a result of our fuzz testing efforts, new tests have been added to ART's test suite that are buildable directly from a description of DEX bytecode, whereas previously all tests had to be built from Java source code. This was necessary because many bugs we have found arise from specially crafted pieces of bytecode that the javac and dx tools would not generate themselves. We have aimed to submit DEX bytecode tests with any patches we submit to AOSP.




In this post we have looked at how fuzz testing can help the development of new backends for a virtual machine, specifically the ART VM that now powers Android.  From the roughly 200 test programs already present in ART's test suite, we have produced a significantly larger number of new tests using fuzzing. Each additional program used for testing increases our confidence that the implementation of ART is sound.  Most of the bugs we found affected the quick backend of ART as it was being developed in AOSP, but as new bugs could arise from complicated interactions between optimisations in the optimizing backend, the use of fuzz testing will increase our chances of finding any bugs and squashing them early.


Further Reading


The initial research into fuzzing was performed by Barton Miller at UW-Madison.


Paul Sabanal fuzzed the experimental release version of ART in Kitkat, and found a few crashes. He presented this work at HITB2014.


For more information about differential testing, various papers have been written about Csmith, a tool that performs differential testing to test C compilers.


Researchers at UC Davis recently presented work about Equivalence Modulo Inputs, where seed programs are fuzzed to produce new programs that are expected to produce the same output as the seed program for a given set of inputs. All produced programs are then compiled and executed, and divergences in output indicate miscompilations.

In this blog I will cover various methods of runtime feature detection on CPUs implementing ARMv8-A architecture. These methods include using HWCAP on Linux and Android, using NDK on Android and using /proc/cpuinfo. I will also provide sample code to detect the new optional features introduced in the ARMv8-A architecture. Before we dig deep in to the different methods, let us understand more about ARMv8-A CPU features.


ARMv8-A CPU features


ARMv7-A CPU features


The ARMv8-A architecture has made many ARMv7-A optional features mandatory, including advanced SIMD (also called NEON). This applies to both the ARMv8-A execution states namely, AArch32 (32-bit execution state, backward compatible with ARMv7-A) and AArch64 (64-bit execution state).


New features


The ARMv8-A architecture introduces a new set of optional instructions including AES. These instructions were not available in ARMv7-A architecture. These optional instructions are grouped into various categories, as listed below.


  • CRC32 instructions - CRC32B, CRC32H, CRC32W, CRC32X, CRC32CB, CRC32CH, CRC32CW, and CRC32CX
  • SHA1 instructions - SHA1C, SHA1P, SHA1M, SHA1H, SHA1SU0, and SHA1SU1
  • SHA2 instructions - SHA256H, SHA256H2, SHA256SU0, and SHA256SU1
  • AES instructions - AESE, AESD, AESMC, and AESIMC
  • PMULL instructions that operate on 64-bit data - PMULL and PMULL2


Runtime CPU feature detection scenarios


User-space programs can detect features supported by an ARMv8-A CPU at runtime, using many mechanisms including /proc/cpuinfo, HWCAP and the Android NDK CPU feature API.  I will describe them in detail below.


Detect CPU feature using /proc/cpuinfo


Parsing /proc/cpuinfo is a popular way to detect CPU features. However I strongly recommend not to use /proc/cpuinfo on ARMv8-A for cpu feature detection, as this is not a portable way of detecting CPU features. Indeed, /proc/cpuinfo reflects the characteristics of the kernel rather than the application which is being executed. This means that /proc/cpuinfo is the same for both 32-bit and 64-bit processes running on an ARMv8-A 64-bit kernel. The ARMv8-A 64-bit kernel's /proc/cpuinfo output is quite different from that of a ARMv7-A 32-bit kernel. For example, ARMv8-A 64-bit kernel uses 'asimd' for advanced SIMD support, while ARMv7-A 32-bit kernel uses 'neon'. Thus, NEON detection code that looks for the "neon" string in /proc/cpuinfo will not work on ARMv8-A 64-bit kernel. Applications using /proc/cpuinfo should migrate to either using HWCAP or the NDK API, as they are maintained and controlled interfaces unlike /proc/cpuinfo.


Detect CPU feature using HWCAP


HWCAP can be used on ARMv8-A processors to detect CPU features at runtime.


HWCAP and Auxiliary vector


First, let me give you a brief overview of HWCAP. HWCAP uses the auxiliary vector feature provided by the Linux kernel. The Linux kernel's ELF binary loader uses the auxiliary vector to pass certain OS and architecture specific information to user space programs. Each entry in the vector consists of two items: the first identifies the type of entry, the second provides the value for that type. Processes can access these auxiliary vectors through the getauxval() API call.


getauxval() is a library function available to user space programs to retrieve a value from the auxiliary vector. This function is supported by both bionic (Android's libc library) and glibc (GNU libc library).  The prototype of this function is unsigned long getauxval(unsigned long type); Given the argument type, getauxval() returns the corresponding value.


<sys/auxv.h> defines various vector types. Amongst them, AT_HWCAP and AT_HWCAP2 are of our interest. These auxiliary vector types specify processor capabilities. For these types, getauxval() returns a bit-mask with different bits indicating various processor capabilities.




Let us look at how HWCAP can be used on ARMv8-A. In ARMv8-A, the values returned by AT_HWCAP and AT_HWCAP2 depend on the execution state.  For AArch32 (32-bit processes), AT_HWCAP provides flags specific to ARMv7 and prior architectures, NEON for example.AT_HWCAP2 provides ARMv8-A related flags like AES, CRC.  In case of AArch64, AT_HWCAP provides ARMv8-A related flags like AES and AT_HWCAP2 bit-space is not used.


Benefits of HWCAP


One of the main benefits of using HWCAP over other mechanisms like /proc/cpuinfo is portability. Existing ARMv7-A programs that use HWCAP to detect features like NEON will run as is on ARMv8-A, without any change. Since the getauxval() is supported in Linux (through glibc) and Android (through bionic), the same code can run on both Android and Linux.


Sample code for AArch32 state


The sample code below shows how to detect CPU features using AT_HWCAP in the AArch32 state.


#include <stdio.h>
#include <sys/auxv.h>
#include <asm/hwcap.h>

int main()
    long hwcaps2 = getauxval(AT_HWCAP2);

    if(hwcaps2 & HWCAP2_AES){
        printf("AES instructions are available\n");
    if(hwcaps2 & HWCAP2_CRC32){
        printf("CRC32 instructions are available\n");
    if(hwcaps2 & HWCAP2_PMULL){
        printf("PMULL/PMULL2 instructions that operate on 64-bit data are available\n");
    if(hwcaps2 & HWCAP2_SHA1){
        printf("SHA1 instructions are available\n");
    if(hwcaps2 & HWCAP2_SHA2){
        printf("SHA2 instructions are available\n");
    return 0;


Sample code for AArch64 state


The code below shows how to detect ARMv8-A CPU features in AArch64 process using HWCAP


#include <stdio.h>
#include <sys/auxv.h>
#include <asm/hwcap.h>

int main()
    long hwcaps= getauxval(AT_HWCAP);

    if(hwcaps & HWCAP_AES){
        printf("AES instructions are available\n");
    if(hwcaps & HWCAP_CRC32){
        printf("CRC32 instructions are available\n");
    if(hwcaps & HWCAP_PMULL){
        printf("PMULL/PMULL2 instructions that operate on 64-bit data are available\n");
    if(hwcaps & HWCAP_SHA1){
        printf("SHA1 instructions are available\n");
    if(hwcaps & HWCAP_SHA2){
        printf("SHA2 instructions are available\n");
    return 0;


Detect CPU feature using Android NDK CPU feature API


The Android NDK provides an API to detect the CPU architecture family and the supported features at run time.


CPU feature API


There are two main functions, android_getCpuFamily() and android_getCpuFeatures().


  • android_getCpuFamily() - Returns the CPU family
  • android_getCpuFeatures() - Returns a bitmap describing a set of supported optional CPU features. The exact flags will depend on CPU family returned by android_getCpuFamily(). These flags are defined in cpu-features.h


Support for ARMv8-A optional features


The latest NDK release (version 10b, September 2014) supports ARMv8-A CPU features detection only for the AArch64 mode. However, the NDK project in AOSP supports both the AArch32 and the AArch64 CPU feature flags. The AArch32 feature flags were added to the AOSP in the change list 106360. The NDK uses HWCAP internally to detect the CPU features.


NDK Sample code to detect ARMv8-A cpu features


Detect CPU family


#include <stdio.h>
#include "cpu-features.h"

int main()
    AndroidCpuFamily family;
    family = android_getCpuFamily();
    if(family == ANDROID_CPU_FAMILY_ARM){
        printf("CPU family is ANDROID_CPU_FAMILY_ARM \n");
    } else if(family == ANDROID_CPU_FAMILY_ARM64){
        printf("CPU family is ANDROID_CPU_FAMILY_ARM64 \n");
    } else {
        printf("CPU family is %d \n", family);
    return 0;


Detect ARMv8-A CPU features


#include <stdio.h>
#include "cpu-features.h"

void printArm64Features(){
    uint64_t features;
    features = android_getCpuFeatures();
    if(features & ANDROID_CPU_ARM64_FEATURE_AES){
        printf("AES instructions are available\n");
    if(features & ANDROID_CPU_ARM64_FEATURE_PMULL){
        printf("PMULL instructions, that operate on 64-bit data, are available\n");
    if(features & ANDROID_CPU_ARM64_FEATURE_SHA1){
        printf("SHA1 instructions are available\n");
    if(features & ANDROID_CPU_ARM64_FEATURE_SHA2){
        printf("SHA2 instructions are available\n");
    if(features & ANDROID_CPU_ARM64_FEATURE_CRC32){
        printf("CRC32 instructions are available\n");

void printArmFeatures(){
    uint64_t features;
    features = android_getCpuFeatures();
    if(features & ANDROID_CPU_ARM_FEATURE_AES){
        printf("AES instructions are available\n");
        printf("PMULL instructions, that operate on 64-bit data, are available\n");
    if(features & ANDROID_CPU_ARM_FEATURE_SHA1){
        printf("SHA1 instructions are available\n");
    if(features & ANDROID_CPU_ARM_FEATURE_SHA2){
        printf("SHA2 instructions are available\n");
    if(features & ANDROID_CPU_ARM_FEATURE_CRC32){
        printf("CRC32 instructions are available\n");

int main(){
    AndroidCpuFamily family;
    family = android_getCpuFamily();
    if(family == ANDROID_CPU_FAMILY_ARM){
    if(family == ANDROID_CPU_FAMILY_ARM64){
    return 0;




The ARMv8-A architecture makes certain ARMv7-A features mandatory and introduces a new set of optional features. The popular way of detecting the features at runtime by parsing /proc/cpuinfo is not portable to ARMv8-A and existing code will not work without tricky changes. Instead, application programmers can easily use HWCAP on Linux and the NDK on Android. For detecting ARMv8-A optional features in the AArch32 mode, programmers should use HWCAP on Android as the latest NDK does not have support for it yet.

The recent Linaro Connect (lhttp://www.linaro.org/connect/lcu/lcu14/) saw several ARM and Linaro presentations about Android and about 64-bit. I think these might be interesting to anyone following Android, ARMv8, AARCH64 or 64-bit progress in mobile.


First is Serban Constantinescu presenting the journey involved in getting AOSP running first on a 64-bit kernel (in 2012) and then booting with a 64-bit userspace - all on ARM Fast Models:

LCU14 411: From zero to booting nandroid with 64bit support - YouTube


Next is Stuart Monteith the story of porting Dalvik to 64-bit - and how Dalvik and ART are related:

LCU14-100: Dalvik is Dead, Long Live Dalvik! OR Tuning ART - YouTube


Then a presentation by Ashok Bhat on some collaborative work between Linaro and ARM on creating some multimedia tests to help with porting several Android codecs to 64-bit

LCU14-502: Android User-Space Tests: Multimedia codec tests, Status and Open Discussions - YouTube


Finally a presentation by Kevin Petit on NEON ARMv8 and the use of intrinsics

LCU14-504: Taming ARMv8 NEON: from theory to benchmark results - YouTube


Hopefully, for those who prefer reading to watching, we will be able to post some blogs on the topics soon.

A few years ago(20?), I bought a programmable calculator and downloaded a program(from a "Bulletin Board" in Europe) to do symbolic Z-transform expansions for a digital signal processing test I had in college. I finished my test in a few minutes and was immediately handed my test with a "perfect!" and 0 for a grade. When I explained that I had downloaded a program to my calculator from a site in Europe - I got "right..." after a 30 second demo(and explanation of how the code worked), the zero had a 10 put in front of it and that professor became my advisor


Since then, billions of people have been downloading apps through an open source VM(I actually wrote some code for) called "Android".  A couple of years ago, I decided to start working on another open source VM - I call "rekam1" mirrowrite(rekam1);  I'll be demoing some consumer programmable projects with this at the world maker faire in NYC(check it out if you happen to be in the area) - and I'll be talking about Virtual Machines for wirelessly connected Cortex M devices at the upcoming TechCon conference in my talk: "The Consumer Programmable IOT"  If you're interested to see how the maker(and consumer developer) community could change how we all write/share code - check out my talk!


ARM TechCon Schedule Builder | Session: The Consumer Programmable IoT

Eirik Aavitsland at Digia has created a blog post about how you can easily make an ODROID-U3 or another device running a recent version of Android boot to Qt.


This blog has been written before but quite a few things have improved in the ease of use and breadth of support of Streamline use on Android in the past few years. For starters, Mac OS X is well supported. Now all three major development platforms (Linux, Windows and Mac) have the ability to run DS-5 Community Edition debugger (gdb) and Streamline with the ADT Eclipse tools from Google as an add-on or pre-packaged as DS-5 CE for Windows and Linux from ARM with ADT as an add-on. Also, and most welcome, is the new Gator driver. The component of Streamline that runs in Android to collect OS and processor counters used to require both a kernel module and a driver daemon. Compiling and flashing any module could be complicated depending on the availability of your Android platform kernel headers. That requirement has been removed and now the Gator driver will run as root on many devices. This July (7/2014), an updated version of gatord in the DS-5 CE 5.19 will be released that greatly expands the kernel versions supported (beyond the 3.12 kernel version supported in the current DS-5 5.18 release). Finally, I’ve found some erroneous and dated info in some blogs that claim to be up to date to DS-5 5.18 and even the yet to be released 5.19. I’ll try to correct that here and support this blog entry.


Streamline is a powerful system analysis tool that will help you speed up your code, reduce energy footprint and balance system resources. The free version in the Community Edition of DS-5 lets you view CPU and OS counters in a powerful graphical view; CPU and GPU activity, cache hits and misses and visibility down in to individual threads and modules. You can find code that is blocking or could be optimized by multithreading or refactoring in NEON or the GPU. Check out more features on the optimize site.



Getting Started:


As of this writing the Android SDK Manager is Revision 22.6.4 bundled in the latest SDK for Mac, adt-bundle-mac-x86_64-20140321. The SDK is available at the Android Developer Site. The Native Development Kit (NDK) is revision 9d. Download both of these for your appropriate platform. I’m downloading the Mac OS X 64-bit versions for this guide but these instructions should work for Windows and Linux just as easily.


Once you unpack these tools, you should add some executable paths to your platform if you plan on using the terminal for anything like the Android Debug tool (adb). It is now possible to use all of the tools from within Eclipse without adjusting your executable paths but for some of us old-schoolers who are wedded to the CLI, I drop my NDK folder in to the SDK folder and put that folder in my Mac’s /Applications directory. You can place them wherever you like on most platforms though. I then added these to my ~/.bashrc


export PATH=$PATH:/Applications/adt-bundle-mac-x86_64-20140321/sdk/platform-tools

export PATH=$PATH:/Applications/adt-bundle-mac-x86_64-20140321/sdk/tools

export PATH=$PATH:/Applications/adt-bundle-mac-x86_64-20140321/android-ndk-r9d


You should now be able to launch common Android tools from your command line:

> which ndk-build


> which fastboot


> which adb


> which android



You can Launch the Android SDK Manager from Eclipse in the “Window” menu or via the command line by typing:

> android


From there, you can update your current SDK, install older APIs, build-tools, platform tools and in “Extras”, the Android Support Library for compatibility with older APIs.

Pasted Graphic 4.jpeg

When you run Eclipse (ADT) for the first time or change versions, you may have to tell it where to find the SDK. The Preferences dialog box is found on Macs via the ADT->Preferences menu, sub heading Android.

Pasted Graphic 3.jpeg

Setting up a demo app to analyze (if you don’t have your own app):


You probably have your own library or application you want to perform system analysis on but just in case you’re checking out the tool, I’ll step through setting up an app that is near and dear to me, ProjectNe10. You can grab the master branch archive from GitHub. For this tool demo, I’ve created a directory /workspace and unzipped the Ne10 archive inside that folder. ProjectNe10 requires the cmake utility. Fortunately, there is a Homebrew solution to install cmake from the command line:


brew install cmake


If you don’t have brew installed, install it. You’ll use it in the future, I promise. You can also just download the binary for any platform from cmake.

Now we can build the Ne10 library from the command line:


Set these to your particular paths:


export NE10PATH=/workspace/projectNe10

export ANDROID_NDK=/Applications/adt-bundle-mac-x86_64-20140321/android-ndk-r9d




cd $NE10PATH

mkdir build && cd build

cmake -DCMAKE_TOOLCHAIN_FILE=../android/android_config.cmake ..


make install


That make install line will copy libNE10_test_demo.so to your /workspace/projectNe10/android/NE10Demo equivalent. Now you can go to the File->Import menu in Eclipse and import an existing Android code base in to your workspace.


Pasted Graphic 6.jpeg

Pasted Graphic 7.jpeg


If all goes well, you should be able to connect your ARM based Android Device (in my case, a Nexus 5 running Android 4.4.4 to match the current SDK at the time of this writing) and run this app from the Run menu as an Android app. As a sanity check, you should run adb devices from the command line to verify you can see your device. This app will iterate through every function in the ProjectNe10 library with both C and NEON implementations. One of the implementations should be faster. I’ll give you a hint. It is the NEON implementation.



Installing DS-5 Community Edition (Free Eclipse Plugin with enhanced ARM debug and system analysis):


Start Eclipse and go to the menu Help->Install New Software.... Click on “Add...”, and paste http://tools.arm.com/eclipse in the location text box, then click OK. Select ARM DS-5 Community Edition, as shown on the screenshot below, and click Next. Eclipse will compute the dependencies of the DS-5 CE plug-ins.

Pasted Graphic 8.jpeg


Click Next again. Read the license agreements and if you accept, hit Finish. After the install is complete, ADT will ask you to reload.

A license dialog should popup if this is a fresh install. Select "Install Community edition license" and click "Continue".


If there was no popup license message go to Help->Generate community edition license, and click "Finish".


Congratulations, you now have ARM DS-5 CE installed with its enhanced and easy to use debugger which you can use to debug Android NDK apps and libraries with the steps in this guide. You also have Streamline; a powerful system analysis tool which we’ll cover in the next section.


Using Streamline and gator to analyze Android apps and the entire system


Before you can gather data for system analysis, you have to install a data collecting driver (daemon) in Android. Gatord will gather processor and kernel counters on the Android device and stream them over to your host machine. It must run as root to do this. Any device with an unlocked boot loader is very simple to root, you usually just flash a custom recovery tool like TWRP and install SuperSU. If you have a locked bootloader, you’ll have to use a device exploit so I can’t recommend this or help you but your favorite search engine might… This is a minor inconvenience now as older versions required a kernel module (gator.ko) which needed to be compiled against your particular device’s kernel headers. Now that Android security terms to pass Android CTS disallow kernel modules, you’d have to compile in to the kernel and flash it. Fortunately the new gatord will expand its kernel version support significantly in July.


First, build gatord. Go to the menu Help->ARM Extras… this will open up a folder with several goodies in it.

Pasted Graphic 9.jpg


I’m going to build this from the command line so fire up your favorite terminal and cd in to this directory. The easiest way in the Mac terminal app is to type “cd ” and dragging the gator folder in to the terminal window. OS X will fill in the path, then:


cd daemon-src

tar zxf gator-daemon.tar.gz

mv gator-daemon jni

cd jni



These steps should unzip the gatord source, and build it for Android (dynamically linked) with the output in ../libs/armeabi/gatord. Copy this binary to your Android device with your favorite method, AirDroid, scp, Droid NAS or very simply:


adb push ../libs/armeabi/gatord /sdcard/gatord


This, of course, assumes you’ve enabled developer options and debugging on your device. “On Android 4.2 and newer, Developer options is hidden by default. To make it available, go to Settings > About phone and tap Build number seven times. Return to the previous screen to find Developer options. In Developer options click USB debugging. If this is a new device, you may have to approve the debug link security the first time you try to use adb. You can also do this with an ARM based Android Virtual Device (AVD) in the emulator if your physical device is too ‘locked down’ but Streamline system data won’t be as useful. You may have to use “mount -o rw,remount rootfs /“ and “chmod 777 /mnt/sdcard” in your AVD to push gatord.


Now, the tricky part, you have to move this binary to an executable location in the filesystem and set executable permissions. The most reliable method I’ve used is ES File Explorer. Go in to the menu, turn on Root Explorer and go to the Mount R/W option, set root “/“ as RW (read/writeable) rather than RO. Then copy and paste gatord in to /system/bin in your Android filesystem. You can also set the permissions to executable in ES File Browser by long pressing on the gatord file, then more->Properties->Permissions->Change. Give the owner any group Execute permission and press Ok.


Back in your host machine terminal you need to set up a pipe for gator to communicate over USB and then get a shell on the device to start it:


adb forward tcp:8080 tcp:8080

adb shell


Now you’ve got a shell on your android device, you can su to root and start gatord. Type:





The rest is pretty straight forward. Go to the Window->Show View->Other…->DS-5->ARM Streamline Data

Click on the gear button

Pasted Graphic 13.jpeg



In the address section, enter “localhost” if you’re streaming the capture data over USB using adb to forward the TCP port. In the Program Images box select the shared library that you want to profile (add ELF image from workspace).

Pasted Graphic 10.jpg




You can now use the red “Start Capture” button at any time.

Pasted Graphic 14.jpeg

Other blogs and tutorials are accurate from this point forward on the features and use of Streamline so I’ll drop a few and let you get to it!

The “CAPTURING DATA AND VIEWING THE ARM STREAMLINE REPORT” section of this blog is accurate.

Events based sampling video, analyzing CPU and GPU performance and customizing charts on YouTube.


At VIA we are aiming to provide more software support for our products. Most ARM based boards have both Linux and Android images that potential partners can try. On our product line we have two Freescale boards, the VAB-800 (single core Cortex-A8) and newer VAB-820 (quad core Cortex-A9). The latter just have a brand new Android Evaluation Package fresh up on our website, ready for testing.


The Android image is based on Jelly Bean 4.2.2 (which puts this package ahead of even our other boards). Among the available features is the CAN bus driver, resistive touch screen, HDMI video and audio output, dual display and mini PCI-E support. On the developer timeline for future releases we have ADV-7180 Capture, WatchDog/GPIO, VIA Smart ETK support for embedded solutions.


Android for Freescale is still a quite new combination with a lot of potential. We are hoping that it will make device developers lives easier both on the software and hardware side. This evaluation package is just the beginning of this conversation.


What do you think, what would make you choose an Android system over others for your next embedded solution?


The Android Evaluation Package (as well as the Linux Evaluation Package) is available on the VIA Embedded website.

There is a very active overclocking community called HWBot, with quite a few organizers here in Taiwan. For a very long time they were doing desktop (and laptop?) overclocking, challenging the hardware, and pushing the boundaries. When they were giving a presentation about their past, and future plans, they were really proud of making the desktop computer industry care more about hardware quality.


Now they want to do the same thing for smartphone hardware. Just recently released the beta version of their Android benchmarking app HWBot Prime, an started to gather data for different devices. My HTC Butterfly (running a Quad Snapdragon C4, I guess) did pretty well on that (whenever I could kill enough apps not to interfere with the benchmark).


The VIA Springboard (that I'm taking care of) is a single-board computer that can also run Android (4.0.3) besides Linux. It has a WM8950 single core 800MHz CPU, so it is not a match for the Butterfly, but the per-core-per-MHz results are better.


So far it's benchmarking only without any overclocking yet, and I'm running on the stock Android image, but it's a good baseline to start improving on the result. Cannot manage something that I cannot measure, right?


The submitted result for the Springboard is on the HWBot leaderboard. I wonder if anyone else wants to benchmark, tune, and overclock their ARM devices?


The whole experience is written up in the VIA Springboard Blog.

As a result of the rapid proliferation of Android smart phones and tablets, embedded developers worldwide are increasingly adopting the operating system for a growing number of embedded systems and connected devices that leverage its rich application framework, native multimedia capabilities, massive app ecosystem, familiar user interface, and faster time to market.


However, although the benefits of adopting Android for embedded systems and devices can be great, particularly for touch-based multimedia applications, utilizing the OS also presents a number of critical challenges, including selecting the right ARM SoC platform for the target system application, porting and customizing the operating system and applications, and ensuring tight integration between the hardware and software to deliver a compelling end-user experience.




In addition to exploring the benefits and challenges of adopting Android for embedded applications, the attached white paper provides an overview of the holistic approach that VIA Embedded has established in order to enable developers to reduce product development times and speed up time to market for innovative new embedded Android systems and devices.


Holistic Approach

VIA is committed to support the entire product development life cycle, from defining product requirements, all the way through development.

  • Best of Breed application specific ARM SoC platforms, with a comprehensive range of Freescale and VIA ARM SoCs
  • Small form factor ARM boards and systems, using VIA's expertise in creating practical form factor standards
  • Android software packages and customization services (see below)
  • Longevity, by supporting specific boards and systems up to 5 years


Android Customization


Via Embedded provides  a wide range of software solution packages and customization services to facilitate the development of Android embedded systems and devices:

  • Customized applications, including system apps
  • Kernel & Framework including security and special devices
  • System management including watchdog, remote monitoring, remote power on/off, silencing app and system upgrades
  • Embedded I/O including legacy I/O


VIA Android Smart Embedded Tool Kit (ETK)


The VIA Embedded Android Smart ETK includes a set of APIs that enable the Android application to access I/O and manageability services provided by the system hardware that are not supported in the Android framework.


APIs include:

  • Watchdog to help applications and the system to recover from failures and breakdowns
  • Scheduled power on/off, and periodic reboots
  • RTC Wake-up to auto power on at a specific time of the day, of the week, or of the month.
  • Legacy I/O Support making RS232, GPIO, I2C, and CAN bus available for apps



More details, and a case study is presented in the attached whitepaper.

Filter Blog

By date: By tag: