1 2 Previous Next

Android Community

26 posts

最近,Ne10 v1.2.0 发布了。该更新提供了一个新功能——基3、基5的快速傅立叶变换(FFT)。 在基准测试中可以看到, NEON优化使得FFT得到大幅的性能提升。


1. Ne10项目

Ne10 项目旨在为ARM的生态系统提供高度NEON优化的基础函数,比如图像处理(Image Processing)、数字信号处理(DSP)和数学(math)函数等。想要更多地了解Ne10项目,请移步此博客。想更多地了解Ne10中的FFT功能,请移步此博客


2. Benchmark

2.1. 时间

1给出了在ARMv7-ACortex-A9, 1.0GHz)和AArch64 Cortex-A53, 850MHz 上,四个不同实现的性能数据,包括Ne10 v1.2.0),pffft2013),kissFFT(1.3.0),以及Opus项目 (v1.1.1-beta) 中的FFT实现(基于kissFFT,但经过优化)。其中kissFFTOpus中的实现并没有利用NEON技术,而Ne10pffft是经过深度的NEON优化的。编译器采用的是LLVM 3.5,编译选项是-O2

1

1中,横坐标是FFT的长度,纵坐标是消耗的时间,时间越少说明性能越好。其中,循环次数是 2.048 x 106 / (FFT的长度)。举个例子,我们将1024FFT执行2000次,然后记录下总运行时间。由于pffft要求FFT的长度是16的倍数,所以对应的曲线是从240开始的。可以看出,经过NEON优化后,性能得到明显的提升。


2.2. 每秒百万次浮点操作数(MFLOPS

 

2

2给出了四种FFT实现的每秒百万次浮点操作数(MFLOPS)。计算方式参考此链接。横坐标是FFT的长度,纵坐标是MFLOPSMFLOPS反映了不同算法求解同一问题时的性能,值越大说明性能越好。从图中可见,NEON指令把数据“打包”处理,大大提高了MFLOPS


3. 使用方法

此次更新并没有改变FFTAPINe10在启动 Initial/Setup)的过程中识别FFT的长度是否包含基-3、基-5,进而选择最优的计算方法。详情请参考此博客

Ne10 v1.2.0 is released. Now radix-3 and radix-5 are supported in floating point complex FFT. Benchmark data below shows that NEON optimization has significantly improved performance of FFT.

 

1. Project Ne10

The Ne10 project has been set up to provide a set of common, useful functions which have been heavily optimized for the ARM Architecture and provide consistent well tested behavior that can be easily incorporated into applications. C interfaces to the functions are provided for both assembler and NEON™ implementations. The library supports static and dynamic linking and is modular, so that functionality that is not required can be discarded. For details of Ne10, please check this blog. For more details of FFT feature in Ne10, please refer this blog.

 

2. Benchmark

2.1. Time cost

Figure 1 is benchmark data (time cost) of four FFT implementations, including Ne10 (v1.2.0), pffft (2013), kissFFT (1.3.0), and one inside Opus (v1.1.1-beta). Ne10 and pffft are well NEON-optimized, while kissFFT and Opus FFT are not. All implementations are compiled by LLVM 3.5, with -O2 flag. All these implementations have been tested on ARMv7-A (Cortex-A9, 1.0GHz) and AArch64 (Cortex-A53, 850MHz).

Figure 1

In figure 1, x axis is size of FFT and y axis is time cost (ms), smaller is better. Each FFT has been run for 2.048x106 / (size of FFT) times. Say, we run 2000 times for 1024 points FFT. Only multiple of 16 sizes are supported in pffft, so its curve starts from 240. Performance boost after NEON optimization is obvious.

 

2.2. Mega Floating-point operations per second (MFLOPS)

Figure 2

Figure 2 is benchmark data in MFLOPS of these four implementations. Data are calculated according to this link. MFLOPS is a measure of performance of different algorithms in solving the same problem, bigger is better. When data are packed and processed by NEON instructions (in Ne10 and Pffft), MFLOPS is much higher.

 

3. Usage

API of FFT is not modified. Ne10 detects whether the size of FFT is multiple of 3 or 5, and then selects the best algorithms to execute. For more detail, please refer this blog.

The Android team in ARM was lucky enough to be invited to a Linux Plumbers mini-conf to talk about AArch64, porting from 32-bit to 64-bit and our experiences in working on Binder (a key Android feature which relies upon support in the Linux kernel).

 

Attached to this post are the raw PDFs (no video this time).

 

First an introduction to the AArch64 ISA (from the lead engineer on our Javascript porting work), next a presentation of porting between AArch32 and AArch64 code (from an engineer who did a lot of work on adding AArch64 support to Skia, a key rendering library in Android). Finally a presentation on the changes to the Binder kernel driver needed to support 64-bit user space code, from the engineer who did that and a lot of the initial bionic porting to 64-bit for Android.

 

As an added bonus, I've attached the original slides for the 'From Zero to Boot' talk at Linaro, which are missing from the Linaro page on the talk.

Stephen Kyle

The ART of Fuzz Testing

Posted by Stephen Kyle Nov 26, 2014

In the newest version of Android, Lollipop (5.0), the virtual machine (VM) implementation has changed from Dalvik to ART. Like most VMs, ART has an interpreter for executing the bytecode of an application, but also uses an ahead-of-time (AOT) compiler to generate native code. This compilation takes place for the majority of Java methods in an app, when the app is initially installed. The old VM, Dalvik, only produced native code from bytecode as the app was executed, a process called just-in-time (JIT) compilation.

 

ART currently provides a single compiler for this AOT compilation, called the quick compiler. This backend is relatively simple for a compiler, using a 1:1 mapping from most bytecodes to set sequences of machine instructions, performing a few basic optimisations on top of this. More backends are in various stages of development, such as the portable backend and the optimizing backend. As the complexity of a backend increases, so too does its potential to introduce subtle bugs into the execution of bytecode. In the rest of this post, we will use the term "backend" to refer to the different ways in which code can be executed by ART, be it the interpreter, the quick compiler, or the optimizing compiler, and the term "quick compiler" and "quick backend" should be considered equivalent.

 

In this post we will consider how we can check that we aren't introducing new bugs as these backends are developed.

 

A test suite is useful, but is limited in size, and may only test for regressions of bugs the developers have found in the past. Some errors in the VM may not have been detected yet, and there are always rare cases arising from unexpected code sequences. While some bugs may just cause the compiler to crash, or create a program that produces slightly incorrect output, other bugs can be more malicious. Many of these bugs lurk at the fringes of what we would consider "normal" program behaviour, leaving open potential for exploits that use these fringe behaviours, leading to potential security issues.

 

How do we find these bugs? Fuzz testing (also commonly known as "fuzzing") can allow us to test a greater range of programs. Fuzz testing generally refers to random generation of input to stress test the capabilities of a program or API, particularly to see how it can handle erroneous input. In this case, we generate random programs to see how the backends of ART deal with verifying, compiling and executing them.  Before we discuss our fuzz testing strategy in more detail, let's look at how apps are executed in Android.

 

From Java code to execution on your Android device

 

Let's take a look at a simple Java method, and watch how this code is transformed into a sequence of A64 instructions.

 

public int doSomething(int a, int b) {
  if (a > b) {
    return (a * 2);
  }
  return (a + b);
}












 

In Android software development, all Java source files are first compiled to Java bytecode, using the standard javac tool. The Java bytecode format (JVM bytecode) used by Java VMs is not the same as the bytecode used in ART, however. The dx tool is used to translate from JVM bytecode to the executable bytecode used by ART, which is called DEX (Dalvik EXecutable, a holdover from when the VM was called Dalvik.) The DEX code for this Java code looks like:

 

0000: if-le v2, v3, 0005
0002: mul-int/lit8 v0, v2, #int 2
0004: return v0
0005: add-int v0, v2, v3
0007: goto 0004












 

In this case, the virtual registers v2 and v3 are the method's parameters, a and b, respectively. For a good reference on DEX bytecode, you can consult this document, but essentially this code compares a to b, and if a is less-than-or-equal-to b it adds a to b and returns that result. Otherwise, it multiplies a by 2 and returns that.

 

When ART loads this code, it typically compiles the bytecode using the quick backend. This compilation will produce a function that roughly follows the ARM Architecture Procedure Call Standard (AAPCS) used with A64 code - it will expect to find its arguments in r2 and r3*, and will return the correct result in r0. Here is the A64 code that the quick backend will produce, with some simplifications:

 

  // Reminder: w2 is the 32-bit view of register r2 in A64 code!
  [-- omitted saving of registers w20-w22 to the stack --]
  mov w21, w2
  mov w22, w3
  cmp w21, w22
  b.le doAdd
  lsl w20, w21, #1  // (NB: this is w21 * 2)
doLeave:
  mov w0, w20
  [-- omitted loading of registers w20-w22 from the stack --]
  ret
doAdd:
  add w20, w21, w22
  b doLeave












 

*(Why not r0 and r1? Because r0 is reserved for passing the context of the method that is currently being executed. r1 is used for the implicit first argument of any non-static method - the reference to the this object.)

 

Before code can be compiled or executed by any backend, the bytecode must always be verified.  Verification involves checking various properties of the bytecode to ensure it is safe to execute. For example, checking that the inputs to a mul-float bytecode are actually float values, or checking that a particular method can be executed from the class we are currently executing within. Many of these properties are checked when the program is compiled from Java source to DEX bytecode, resulting in compiler errors. However, it is important to perform full bytecode verification when apps are about to be executed, to defend against security exploits that target DEX manipulation.

 

Once verification has taken place at run time, ART will load the arguments for the method into the correct registers, and then jump straight to the native code. Alternatively, ART could use its interpreter to interpret the input DEX bytecode as Dalvik would traditionally have done before attempting JIT compilation. Any bytecode that is executed as native code should do the exact same thing when it is executed in the interpreter. This means that methods should return the same results and produce the same side-effects. We can use these requirements to test for flaws in the various backend implementations. We expect that any code that passes the initial verification should be compilable, and some aspects of compilation will actually rely on properties of the code that verification has proven. Contracts exist between the different stages of the VM, and we would like to be assured that there are no gaps between these contracts.

 

Fuzz testing

 

We have developed a fuzz tester for ART, that uses mutation-based fuzzing to create new test cases from already written Java programs. ART comes with an extensive test suite for testing the correctness of the VM, but with a mutation-based fuzz tester, we can use these provided tests as a base from which we can investigate more corner cases of the VM.

 

The majority of these test programs produce some kind of console output - or at the very least, output any encountered VM errors to the console. The test suite knows exactly what output each test should produce, so it runs the test, and confirms that the output has not changed. Mutation-based fuzzing means that we take a test program, and modify it slightly - this means that the output of the program may have changed, or the program may now produce an error. Since we no longer know what output to expect, we can instead use the fact that ART has multiple backends to verify that they all execute this program the same way. Note however that this approach is not foolproof, as it may be the case that all of the backends execute the program in the same, incorrect way. To overcome this, it is also possible to test program execution on the previous VM, Dalvik, as long as some known differences between the two VMs are tolerated (e.g. the messages they use to report errors.) As we increase the number of backends to test, the likelihood that they are all wrong in the same way should decrease.

 

FuzzOverview.png

 

This diagram shows the fuzzing and testing process. First, the fuzzer parses the DEX file format into a form such that it can apply various mutations to the code. It randomly selects a subset of the methods of the program to mutate, and for each one, it randomly selects a number of mutations to apply. The fuzzer produces a new, mutated DEX file with the mutated code, and then executes this program using the various backends of the ART VM.

 

Note that all backends pass through a single verifier, and that some backends have been simplified in this diagram - the quick and optimizing backends are technically split up into compilation and execution phases, while the interpreter only has an execution phase. Ultimately, the execution of the mutated DEX file should produce some kind of output from each backend, and we compare these outputs to find bugs. In this example, the fact that the optimizing backend produces "9" instead of "7" strongly suggests there is a bug with the way the optimizing backend has handled this mutated code.

 

So how do we do this fuzzing? A naive approach would be to take the DEX file and flip bits randomly to produce a mutated DEX file. However, this is likely to always produce a DEX file that fails to pass verification. A large part of the verification process is checking that the structure of the DEX file format is sound, and this includes a checksum in the file's header - randomly flipping bits in the whole file will almost certainly cause this checksum to become invalid, but also likely break some part of the file's structure. A better approach is to focus applying minor mutations to the sections of the program that directly represent executable code.

 

Some examples of these minor mutations are as follows:

 

            

MutationDescription
swap two bytecodesPick two bytecodes to swap with each other.
change the register used by a bytecodePick one of the registers specified by a bytecode and change the register.
change an index into the type/field listSome bytecodes may use an index into a list of methods, types or fields at the start of a DEX file. For example, new-instance v0, type@7 will create a new object with the type listed at index 7 of the type list and puts it in v0. The mutation changes which type, field or method is selected.
change the target of a branch bytecodeMake a branch bytecode point to a new target, changing control-flow.
generate a random new bytecodeGenerate a new random bytecode and insert it into a random position, with randomly generated values for all of its operands.

 

We limit our mutations to a few simple changes to bytecodes that individually are unlikely to break the verification of the DEX file, but in combination may lead to differences in the way the program executes. At the same time, we do not want to ensure that every mutation results in a legal bytecode state, because we wish to search for holes in the verification of the program. Often holes in verification may lead to a compiler making an incorrect assumption about the code it is compiling, which will manifest as differences in output between the compiler and the interpreter.

 

Example of Bugs Found

 

Now we present one of the bugs that we have found and fixed in the Android Open Source Project's (AOSP) code base, using this fuzz testing strategy.

 

When presented with a bytecode that reads an instance field of an object, such as iget v0, v1, MyClass.status (this writes into v0 the value of the "status" field of the object referred to by v1) the verifier did not confirm that v1 actually contained a reference to an object.

 

Here's a sequence of bytecodes that creates a new MyClass instance, and sets the status field to its initial value + 1:

 

const v0, 1
new-instance v1, MyClass
invoke-direct {v1} void MyClass.<init>() // calling MyClass() constructor
iget v2, v1, MyClass.status
add-int v2, v0, v2
iput v2, v1, MyClass.status












 

If a mutation changed the v1 on line 4 to v0, then iget would now have the constant 1 currently in v0 as an input, instead of the reference to an object that was in v1.  Previously, the verifier would not report this as an error when it should, and so the compiler (which expects the iget bytecode to have been properly verified) would expect an object reference to be in the input register for iget, and just read from the value of that reference plus the offset of the status field. If an attacker ensured that an address they wanted to read from was used as the loaded constant, they could read from any memory address in the process' address space. Java removes the ability to read memory directly (without the use of some mechanism such as JNI), to ensure that, for instance, private fields of classes cannot be accessed from within Java, but this bug allowed this to happen.

 

While this particular bug was present in the verifier, other bugs have been found and fixed in the quick backend of ART. For some of these bugs, we have contributed patches to the AOSP code base, while other bugs have been reported to the ART team. As a result of our fuzz testing efforts, new tests have been added to ART's test suite that are buildable directly from a description of DEX bytecode, whereas previously all tests had to be built from Java source code. This was necessary because many bugs we have found arise from specially crafted pieces of bytecode that the javac and dx tools would not generate themselves. We have aimed to submit DEX bytecode tests with any patches we submit to AOSP.

 

Conclusion

 

In this post we have looked at how fuzz testing can help the development of new backends for a virtual machine, specifically the ART VM that now powers Android.  From the roughly 200 test programs already present in ART's test suite, we have produced a significantly larger number of new tests using fuzzing. Each additional program used for testing increases our confidence that the implementation of ART is sound.  Most of the bugs we found affected the quick backend of ART as it was being developed in AOSP, but as new bugs could arise from complicated interactions between optimisations in the optimizing backend, the use of fuzz testing will increase our chances of finding any bugs and squashing them early.

 

Further Reading

 

The initial research into fuzzing was performed by Barton Miller at UW-Madison.

 

Paul Sabanal fuzzed the experimental release version of ART in Kitkat, and found a few crashes. He presented this work at HITB2014.

 

For more information about differential testing, various papers have been written about Csmith, a tool that performs differential testing to test C compilers.

 

Researchers at UC Davis recently presented work about Equivalence Modulo Inputs, where seed programs are fuzzed to produce new programs that are expected to produce the same output as the seed program for a given set of inputs. All produced programs are then compiled and executed, and divergences in output indicate miscompilations.

In this blog I will cover various methods of runtime feature detection on CPUs implementing ARMv8-A architecture. These methods include using HWCAP on Linux and Android, using NDK on Android and using /proc/cpuinfo. I will also provide sample code to detect the new optional features introduced in the ARMv8-A architecture. Before we dig deep in to the different methods, let us understand more about ARMv8-A CPU features.

 

ARMv8-A CPU features

 

ARMv7-A CPU features

 

The ARMv8-A architecture has made many ARMv7-A optional features mandatory, including advanced SIMD (also called NEON). This applies to both the ARMv8-A execution states namely, AArch32 (32-bit execution state, backward compatible with ARMv7-A) and AArch64 (64-bit execution state).

 

New features

 

The ARMv8-A architecture introduces a new set of optional instructions including AES. These instructions were not available in ARMv7-A architecture. These optional instructions are grouped into various categories, as listed below.

 

  • CRC32 instructions - CRC32B, CRC32H, CRC32W, CRC32X, CRC32CB, CRC32CH, CRC32CW, and CRC32CX
  • SHA1 instructions - SHA1C, SHA1P, SHA1M, SHA1H, SHA1SU0, and SHA1SU1
  • SHA2 instructions - SHA256H, SHA256H2, SHA256SU0, and SHA256SU1
  • AES instructions - AESE, AESD, AESMC, and AESIMC
  • PMULL instructions that operate on 64-bit data - PMULL and PMULL2

 

Runtime CPU feature detection scenarios

 

User-space programs can detect features supported by an ARMv8-A CPU at runtime, using many mechanisms including /proc/cpuinfo, HWCAP and the Android NDK CPU feature API.  I will describe them in detail below.

 

Detect CPU feature using /proc/cpuinfo

 

Parsing /proc/cpuinfo is a popular way to detect CPU features. However I strongly recommend not to use /proc/cpuinfo on ARMv8-A for cpu feature detection, as this is not a portable way of detecting CPU features. Indeed, /proc/cpuinfo reflects the characteristics of the kernel rather than the application which is being executed. This means that /proc/cpuinfo is the same for both 32-bit and 64-bit processes running on an ARMv8-A 64-bit kernel. The ARMv8-A 64-bit kernel's /proc/cpuinfo output is quite different from that of a ARMv7-A 32-bit kernel. For example, ARMv8-A 64-bit kernel uses 'asimd' for advanced SIMD support, while ARMv7-A 32-bit kernel uses 'neon'. Thus, NEON detection code that looks for the "neon" string in /proc/cpuinfo will not work on ARMv8-A 64-bit kernel. Applications using /proc/cpuinfo should migrate to either using HWCAP or the NDK API, as they are maintained and controlled interfaces unlike /proc/cpuinfo.

 

Detect CPU feature using HWCAP

 

HWCAP can be used on ARMv8-A processors to detect CPU features at runtime.

 

HWCAP and Auxiliary vector

 

First, let me give you a brief overview of HWCAP. HWCAP uses the auxiliary vector feature provided by the Linux kernel. The Linux kernel's ELF binary loader uses the auxiliary vector to pass certain OS and architecture specific information to user space programs. Each entry in the vector consists of two items: the first identifies the type of entry, the second provides the value for that type. Processes can access these auxiliary vectors through the getauxval() API call.

 

getauxval() is a library function available to user space programs to retrieve a value from the auxiliary vector. This function is supported by both bionic (Android's libc library) and glibc (GNU libc library).  The prototype of this function is unsigned long getauxval(unsigned long type); Given the argument type, getauxval() returns the corresponding value.

 

<sys/auxv.h> defines various vector types. Amongst them, AT_HWCAP and AT_HWCAP2 are of our interest. These auxiliary vector types specify processor capabilities. For these types, getauxval() returns a bit-mask with different bits indicating various processor capabilities.

 

HWCAP and ARMv8-A

 

Let us look at how HWCAP can be used on ARMv8-A. In ARMv8-A, the values returned by AT_HWCAP and AT_HWCAP2 depend on the execution state.  For AArch32 (32-bit processes), AT_HWCAP provides flags specific to ARMv7 and prior architectures, NEON for example.AT_HWCAP2 provides ARMv8-A related flags like AES, CRC.  In case of AArch64, AT_HWCAP provides ARMv8-A related flags like AES and AT_HWCAP2 bit-space is not used.

 

Benefits of HWCAP

 

One of the main benefits of using HWCAP over other mechanisms like /proc/cpuinfo is portability. Existing ARMv7-A programs that use HWCAP to detect features like NEON will run as is on ARMv8-A, without any change. Since the getauxval() is supported in Linux (through glibc) and Android (through bionic), the same code can run on both Android and Linux.

 

Sample code for AArch32 state

 

The sample code below shows how to detect CPU features using AT_HWCAP in the AArch32 state.

 

#include <stdio.h>
#include <sys/auxv.h>
#include <asm/hwcap.h>

int main()
{
    long hwcaps2 = getauxval(AT_HWCAP2);

    if(hwcaps2 & HWCAP2_AES){
        printf("AES instructions are available\n");
    }
    if(hwcaps2 & HWCAP2_CRC32){
        printf("CRC32 instructions are available\n");
    }
    if(hwcaps2 & HWCAP2_PMULL){
        printf("PMULL/PMULL2 instructions that operate on 64-bit data are available\n");
    }
    if(hwcaps2 & HWCAP2_SHA1){
        printf("SHA1 instructions are available\n");
    }
    if(hwcaps2 & HWCAP2_SHA2){
        printf("SHA2 instructions are available\n");
    }
    return 0;
}

 

Sample code for AArch64 state

 

The code below shows how to detect ARMv8-A CPU features in AArch64 process using HWCAP

 

#include <stdio.h>
#include <sys/auxv.h>
#include <asm/hwcap.h>

int main()
{
    long hwcaps= getauxval(AT_HWCAP);

    if(hwcaps & HWCAP_AES){
        printf("AES instructions are available\n");
    }
    if(hwcaps & HWCAP_CRC32){
        printf("CRC32 instructions are available\n");
    }
    if(hwcaps & HWCAP_PMULL){
        printf("PMULL/PMULL2 instructions that operate on 64-bit data are available\n");
    }
    if(hwcaps & HWCAP_SHA1){
        printf("SHA1 instructions are available\n");
    }
    if(hwcaps & HWCAP_SHA2){
        printf("SHA2 instructions are available\n");
    }
    return 0;
}

 

Detect CPU feature using Android NDK CPU feature API

 

The Android NDK provides an API to detect the CPU architecture family and the supported features at run time.

 

CPU feature API

 

There are two main functions, android_getCpuFamily() and android_getCpuFeatures().

 

  • android_getCpuFamily() - Returns the CPU family
  • android_getCpuFeatures() - Returns a bitmap describing a set of supported optional CPU features. The exact flags will depend on CPU family returned by android_getCpuFamily(). These flags are defined in cpu-features.h

 

Support for ARMv8-A optional features

 

The latest NDK release (version 10b, September 2014) supports ARMv8-A CPU features detection only for the AArch64 mode. However, the NDK project in AOSP supports both the AArch32 and the AArch64 CPU feature flags. The AArch32 feature flags were added to the AOSP in the change list 106360. The NDK uses HWCAP internally to detect the CPU features.

 

NDK Sample code to detect ARMv8-A cpu features

 

Detect CPU family

 

#include <stdio.h>
#include "cpu-features.h"

int main()
{
    AndroidCpuFamily family;
    family = android_getCpuFamily();
    if(family == ANDROID_CPU_FAMILY_ARM){
        printf("CPU family is ANDROID_CPU_FAMILY_ARM \n");
    } else if(family == ANDROID_CPU_FAMILY_ARM64){
        printf("CPU family is ANDROID_CPU_FAMILY_ARM64 \n");
    } else {
        printf("CPU family is %d \n", family);
    }
    return 0;
}

 

Detect ARMv8-A CPU features

 

#include <stdio.h>
#include "cpu-features.h"

void printArm64Features(){
    uint64_t features;
    features = android_getCpuFeatures();
    if(features & ANDROID_CPU_ARM64_FEATURE_AES){
        printf("AES instructions are available\n");
    }
    if(features & ANDROID_CPU_ARM64_FEATURE_PMULL){
        printf("PMULL instructions, that operate on 64-bit data, are available\n");
    }
    if(features & ANDROID_CPU_ARM64_FEATURE_SHA1){
        printf("SHA1 instructions are available\n");
    }
    if(features & ANDROID_CPU_ARM64_FEATURE_SHA2){
        printf("SHA2 instructions are available\n");
    }
    if(features & ANDROID_CPU_ARM64_FEATURE_CRC32){
        printf("CRC32 instructions are available\n");
    }
}

void printArmFeatures(){
    uint64_t features;
    features = android_getCpuFeatures();
    if(features & ANDROID_CPU_ARM_FEATURE_AES){
        printf("AES instructions are available\n");
    }
    if(features & ANDROID_CPU_ARM_FEATURE_PMULL){
        printf("PMULL instructions, that operate on 64-bit data, are available\n");
    }
    if(features & ANDROID_CPU_ARM_FEATURE_SHA1){
        printf("SHA1 instructions are available\n");
    }
    if(features & ANDROID_CPU_ARM_FEATURE_SHA2){
        printf("SHA2 instructions are available\n");
    }
    if(features & ANDROID_CPU_ARM_FEATURE_CRC32){
        printf("CRC32 instructions are available\n");
    }
}

int main(){
    AndroidCpuFamily family;
    family = android_getCpuFamily();
    if(family == ANDROID_CPU_FAMILY_ARM){
        printArmFeatures();
    }
    if(family == ANDROID_CPU_FAMILY_ARM64){
        printArm64Features();
    }
    return 0;
}

 

Conclusion

 

The ARMv8-A architecture makes certain ARMv7-A features mandatory and introduces a new set of optional features. The popular way of detecting the features at runtime by parsing /proc/cpuinfo is not portable to ARMv8-A and existing code will not work without tricky changes. Instead, application programmers can easily use HWCAP on Linux and the NDK on Android. For detecting ARMv8-A optional features in the AArch32 mode, programmers should use HWCAP on Android as the latest NDK does not have support for it yet.

The recent Linaro Connect (lhttp://www.linaro.org/connect/lcu/lcu14/) saw several ARM and Linaro presentations about Android and about 64-bit. I think these might be interesting to anyone following Android, ARMv8, AARCH64 or 64-bit progress in mobile.

 

First is Serban Constantinescu presenting the journey involved in getting AOSP running first on a 64-bit kernel (in 2012) and then booting with a 64-bit userspace - all on ARM Fast Models:

LCU14 411: From zero to booting nandroid with 64bit support - YouTube

 

Next is Stuart Monteith the story of porting Dalvik to 64-bit - and how Dalvik and ART are related:

LCU14-100: Dalvik is Dead, Long Live Dalvik! OR Tuning ART - YouTube

 

Then a presentation by Ashok Bhat on some collaborative work between Linaro and ARM on creating some multimedia tests to help with porting several Android codecs to 64-bit

LCU14-502: Android User-Space Tests: Multimedia codec tests, Status and Open Discussions - YouTube

 

Finally a presentation by Kevin Petit on NEON ARMv8 and the use of intrinsics

LCU14-504: Taming ARMv8 NEON: from theory to benchmark results - YouTube

 

Hopefully, for those who prefer reading to watching, we will be able to post some blogs on the topics soon.

A few years ago(20?), I bought a programmable calculator and downloaded a program(from a "Bulletin Board" in Europe) to do symbolic Z-transform expansions for a digital signal processing test I had in college. I finished my test in a few minutes and was immediately handed my test with a "perfect!" and 0 for a grade. When I explained that I had downloaded a program to my calculator from a site in Europe - I got "right..." after a 30 second demo(and explanation of how the code worked), the zero had a 10 put in front of it and that professor became my advisor

 

Since then, billions of people have been downloading apps through an open source VM(I actually wrote some code for) called "Android".  A couple of years ago, I decided to start working on another open source VM - I call "rekam1" mirrowrite(rekam1);  I'll be demoing some consumer programmable projects with this at the world maker faire in NYC(check it out if you happen to be in the area) - and I'll be talking about Virtual Machines for wirelessly connected Cortex M devices at the upcoming TechCon conference in my talk: "The Consumer Programmable IOT"  If you're interested to see how the maker(and consumer developer) community could change how we all write/share code - check out my talk!

 

ARM TechCon Schedule Builder | Session: The Consumer Programmable IoT

Eirik Aavitsland at Digia has created a blog post about how you can easily make an ODROID-U3 or another device running a recent version of Android boot to Qt.

 

This blog has been written before but quite a few things have improved in the ease of use and breadth of support of Streamline use on Android in the past few years. For starters, Mac OS X is well supported. Now all three major development platforms (Linux, Windows and Mac) have the ability to run DS-5 Community Edition debugger (gdb) and Streamline with the ADT Eclipse tools from Google as an add-on or pre-packaged as DS-5 CE for Windows and Linux from ARM with ADT as an add-on. Also, and most welcome, is the new Gator driver. The component of Streamline that runs in Android to collect OS and processor counters used to require both a kernel module and a driver daemon. Compiling and flashing any module could be complicated depending on the availability of your Android platform kernel headers. That requirement has been removed and now the Gator driver will run as root on many devices. This July (7/2014), an updated version of gatord in the DS-5 CE 5.19 will be released that greatly expands the kernel versions supported (beyond the 3.12 kernel version supported in the current DS-5 5.18 release). Finally, I’ve found some erroneous and dated info in some blogs that claim to be up to date to DS-5 5.18 and even the yet to be released 5.19. I’ll try to correct that here and support this blog entry.

 

Streamline is a powerful system analysis tool that will help you speed up your code, reduce energy footprint and balance system resources. The free version in the Community Edition of DS-5 lets you view CPU and OS counters in a powerful graphical view; CPU and GPU activity, cache hits and misses and visibility down in to individual threads and modules. You can find code that is blocking or could be optimized by multithreading or refactoring in NEON or the GPU. Check out more features on the optimize site.

 

 

Getting Started:

 

As of this writing the Android SDK Manager is Revision 22.6.4 bundled in the latest SDK for Mac, adt-bundle-mac-x86_64-20140321. The SDK is available at the Android Developer Site. The Native Development Kit (NDK) is revision 9d. Download both of these for your appropriate platform. I’m downloading the Mac OS X 64-bit versions for this guide but these instructions should work for Windows and Linux just as easily.

 

Once you unpack these tools, you should add some executable paths to your platform if you plan on using the terminal for anything like the Android Debug tool (adb). It is now possible to use all of the tools from within Eclipse without adjusting your executable paths but for some of us old-schoolers who are wedded to the CLI, I drop my NDK folder in to the SDK folder and put that folder in my Mac’s /Applications directory. You can place them wherever you like on most platforms though. I then added these to my ~/.bashrc

 

export PATH=$PATH:/Applications/adt-bundle-mac-x86_64-20140321/sdk/platform-tools

export PATH=$PATH:/Applications/adt-bundle-mac-x86_64-20140321/sdk/tools

export PATH=$PATH:/Applications/adt-bundle-mac-x86_64-20140321/android-ndk-r9d

 

You should now be able to launch common Android tools from your command line:

> which ndk-build

/Applications/adt-bundle-mac-x86_64-20140321/android-ndk-r9d/ndk-build

> which fastboot

/Applications/adt-bundle-mac-x86_64-20140321/sdk/platform-tools/fastboot

> which adb

/Applications/adt-bundle-mac-x86_64-20140321/sdk/platform-tools/adb

> which android

/Applications/adt-bundle-mac-x86_64-20140321/sdk/tools/android

 

You can Launch the Android SDK Manager from Eclipse in the “Window” menu or via the command line by typing:

> android

 

From there, you can update your current SDK, install older APIs, build-tools, platform tools and in “Extras”, the Android Support Library for compatibility with older APIs.

Pasted Graphic 4.jpeg

When you run Eclipse (ADT) for the first time or change versions, you may have to tell it where to find the SDK. The Preferences dialog box is found on Macs via the ADT->Preferences menu, sub heading Android.

Pasted Graphic 3.jpeg

Setting up a demo app to analyze (if you don’t have your own app):

 

You probably have your own library or application you want to perform system analysis on but just in case you’re checking out the tool, I’ll step through setting up an app that is near and dear to me, ProjectNe10. You can grab the master branch archive from GitHub. For this tool demo, I’ve created a directory /workspace and unzipped the Ne10 archive inside that folder. ProjectNe10 requires the cmake utility. Fortunately, there is a Homebrew solution to install cmake from the command line:

 

brew install cmake

 

If you don’t have brew installed, install it. You’ll use it in the future, I promise. You can also just download the binary for any platform from cmake.

Now we can build the Ne10 library from the command line:

 

Set these to your particular paths:

 

export NE10PATH=/workspace/projectNe10

export ANDROID_NDK=/Applications/adt-bundle-mac-x86_64-20140321/android-ndk-r9d

 

Then:

 

cd $NE10PATH

mkdir build && cd build

cmake -DCMAKE_TOOLCHAIN_FILE=../android/android_config.cmake ..

make

make install

 

That make install line will copy libNE10_test_demo.so to your /workspace/projectNe10/android/NE10Demo equivalent. Now you can go to the File->Import menu in Eclipse and import an existing Android code base in to your workspace.

 

Pasted Graphic 6.jpeg

Pasted Graphic 7.jpeg

 

If all goes well, you should be able to connect your ARM based Android Device (in my case, a Nexus 5 running Android 4.4.4 to match the current SDK at the time of this writing) and run this app from the Run menu as an Android app. As a sanity check, you should run adb devices from the command line to verify you can see your device. This app will iterate through every function in the ProjectNe10 library with both C and NEON implementations. One of the implementations should be faster. I’ll give you a hint. It is the NEON implementation.

 

 

Installing DS-5 Community Edition (Free Eclipse Plugin with enhanced ARM debug and system analysis):

 

Start Eclipse and go to the menu Help->Install New Software.... Click on “Add...”, and paste http://tools.arm.com/eclipse in the location text box, then click OK. Select ARM DS-5 Community Edition, as shown on the screenshot below, and click Next. Eclipse will compute the dependencies of the DS-5 CE plug-ins.

Pasted Graphic 8.jpeg

 

Click Next again. Read the license agreements and if you accept, hit Finish. After the install is complete, ADT will ask you to reload.

A license dialog should popup if this is a fresh install. Select "Install Community edition license" and click "Continue".

 

If there was no popup license message go to Help->Generate community edition license, and click "Finish".

 

Congratulations, you now have ARM DS-5 CE installed with its enhanced and easy to use debugger which you can use to debug Android NDK apps and libraries with the steps in this guide. You also have Streamline; a powerful system analysis tool which we’ll cover in the next section.

 

Using Streamline and gator to analyze Android apps and the entire system

 

Before you can gather data for system analysis, you have to install a data collecting driver (daemon) in Android. Gatord will gather processor and kernel counters on the Android device and stream them over to your host machine. It must run as root to do this. Any device with an unlocked boot loader is very simple to root, you usually just flash a custom recovery tool like TWRP and install SuperSU. If you have a locked bootloader, you’ll have to use a device exploit so I can’t recommend this or help you but your favorite search engine might… This is a minor inconvenience now as older versions required a kernel module (gator.ko) which needed to be compiled against your particular device’s kernel headers. Now that Android security terms to pass Android CTS disallow kernel modules, you’d have to compile in to the kernel and flash it. Fortunately the new gatord will expand its kernel version support significantly in July.

 

First, build gatord. Go to the menu Help->ARM Extras… this will open up a folder with several goodies in it.

Pasted Graphic 9.jpg

 

I’m going to build this from the command line so fire up your favorite terminal and cd in to this directory. The easiest way in the Mac terminal app is to type “cd ” and dragging the gator folder in to the terminal window. OS X will fill in the path, then:

 

cd daemon-src

tar zxf gator-daemon.tar.gz

mv gator-daemon jni

cd jni

ndk-build

 

These steps should unzip the gatord source, and build it for Android (dynamically linked) with the output in ../libs/armeabi/gatord. Copy this binary to your Android device with your favorite method, AirDroid, scp, Droid NAS or very simply:

 

adb push ../libs/armeabi/gatord /sdcard/gatord

 

This, of course, assumes you’ve enabled developer options and debugging on your device. “On Android 4.2 and newer, Developer options is hidden by default. To make it available, go to Settings > About phone and tap Build number seven times. Return to the previous screen to find Developer options. In Developer options click USB debugging. If this is a new device, you may have to approve the debug link security the first time you try to use adb. You can also do this with an ARM based Android Virtual Device (AVD) in the emulator if your physical device is too ‘locked down’ but Streamline system data won’t be as useful. You may have to use “mount -o rw,remount rootfs /“ and “chmod 777 /mnt/sdcard” in your AVD to push gatord.

 

Now, the tricky part, you have to move this binary to an executable location in the filesystem and set executable permissions. The most reliable method I’ve used is ES File Explorer. Go in to the menu, turn on Root Explorer and go to the Mount R/W option, set root “/“ as RW (read/writeable) rather than RO. Then copy and paste gatord in to /system/bin in your Android filesystem. You can also set the permissions to executable in ES File Browser by long pressing on the gatord file, then more->Properties->Permissions->Change. Give the owner any group Execute permission and press Ok.

 

Back in your host machine terminal you need to set up a pipe for gator to communicate over USB and then get a shell on the device to start it:

 

adb forward tcp:8080 tcp:8080

adb shell

 

Now you’ve got a shell on your android device, you can su to root and start gatord. Type:

 

su

/system/bin/gatord&

 

The rest is pretty straight forward. Go to the Window->Show View->Other…->DS-5->ARM Streamline Data

Click on the gear button

Pasted Graphic 13.jpeg

 

 

In the address section, enter “localhost” if you’re streaming the capture data over USB using adb to forward the TCP port. In the Program Images box select the shared library that you want to profile (add ELF image from workspace).

Pasted Graphic 10.jpg

 

 

 

You can now use the red “Start Capture” button at any time.

Pasted Graphic 14.jpeg

Other blogs and tutorials are accurate from this point forward on the features and use of Streamline so I’ll drop a few and let you get to it!

The “CAPTURING DATA AND VIEWING THE ARM STREAMLINE REPORT” section of this blog is accurate.

Events based sampling video, analyzing CPU and GPU performance and customizing charts on YouTube.

 

At VIA we are aiming to provide more software support for our products. Most ARM based boards have both Linux and Android images that potential partners can try. On our product line we have two Freescale boards, the VAB-800 (single core Cortex-A8) and newer VAB-820 (quad core Cortex-A9). The latter just have a brand new Android Evaluation Package fresh up on our website, ready for testing.

 

The Android image is based on Jelly Bean 4.2.2 (which puts this package ahead of even our other boards). Among the available features is the CAN bus driver, resistive touch screen, HDMI video and audio output, dual display and mini PCI-E support. On the developer timeline for future releases we have ADV-7180 Capture, WatchDog/GPIO, VIA Smart ETK support for embedded solutions.

 

Android for Freescale is still a quite new combination with a lot of potential. We are hoping that it will make device developers lives easier both on the software and hardware side. This evaluation package is just the beginning of this conversation.

vab-820_1.jpg

What do you think, what would make you choose an Android system over others for your next embedded solution?

 

The Android Evaluation Package (as well as the Linux Evaluation Package) is available on the VIA Embedded website.

There is a very active overclocking community called HWBot, with quite a few organizers here in Taiwan. For a very long time they were doing desktop (and laptop?) overclocking, challenging the hardware, and pushing the boundaries. When they were giving a presentation about their past, and future plans, they were really proud of making the desktop computer industry care more about hardware quality.

 

Now they want to do the same thing for smartphone hardware. Just recently released the beta version of their Android benchmarking app HWBot Prime, an started to gather data for different devices. My HTC Butterfly (running a Quad Snapdragon C4, I guess) did pretty well on that (whenever I could kill enough apps not to interfere with the benchmark).

 

The VIA Springboard (that I'm taking care of) is a single-board computer that can also run Android (4.0.3) besides Linux. It has a WM8950 single core 800MHz CPU, so it is not a match for the Butterfly, but the per-core-per-MHz results are better.

android_benchmark_2.jpg

So far it's benchmarking only without any overclocking yet, and I'm running on the stock Android image, but it's a good baseline to start improving on the result. Cannot manage something that I cannot measure, right?

 

The submitted result for the Springboard is on the HWBot leaderboard. I wonder if anyone else wants to benchmark, tune, and overclock their ARM devices?

 

The whole experience is written up in the VIA Springboard Blog.

As a result of the rapid proliferation of Android smart phones and tablets, embedded developers worldwide are increasingly adopting the operating system for a growing number of embedded systems and connected devices that leverage its rich application framework, native multimedia capabilities, massive app ecosystem, familiar user interface, and faster time to market.

 

However, although the benefits of adopting Android for embedded systems and devices can be great, particularly for touch-based multimedia applications, utilizing the OS also presents a number of critical challenges, including selecting the right ARM SoC platform for the target system application, porting and customizing the operating system and applications, and ensuring tight integration between the hardware and software to deliver a compelling end-user experience.

 

viaspring.jpg

 

In addition to exploring the benefits and challenges of adopting Android for embedded applications, the attached white paper provides an overview of the holistic approach that VIA Embedded has established in order to enable developers to reduce product development times and speed up time to market for innovative new embedded Android systems and devices.

 

Holistic Approach

VIA is committed to support the entire product development life cycle, from defining product requirements, all the way through development.

  • Best of Breed application specific ARM SoC platforms, with a comprehensive range of Freescale and VIA ARM SoCs
  • Small form factor ARM boards and systems, using VIA's expertise in creating practical form factor standards
  • Android software packages and customization services (see below)
  • Longevity, by supporting specific boards and systems up to 5 years

 

Android Customization

 

Via Embedded provides  a wide range of software solution packages and customization services to facilitate the development of Android embedded systems and devices:

  • Customized applications, including system apps
  • Kernel & Framework including security and special devices
  • System management including watchdog, remote monitoring, remote power on/off, silencing app and system upgrades
  • Embedded I/O including legacy I/O

 

VIA Android Smart Embedded Tool Kit (ETK)

 

The VIA Embedded Android Smart ETK includes a set of APIs that enable the Android application to access I/O and manageability services provided by the system hardware that are not supported in the Android framework.

 

APIs include:

  • Watchdog to help applications and the system to recover from failures and breakdowns
  • Scheduled power on/off, and periodic reboots
  • RTC Wake-up to auto power on at a specific time of the day, of the week, or of the month.
  • Legacy I/O Support making RS232, GPIO, I2C, and CAN bus available for apps

 

 

More details, and a case study is presented in the attached whitepaper.

We would like to share with you an Android-based application that enables Intel software on ARM-based devices. To demo you this approach we took Intel version of DOOM and run it on Android-based device. Application is freely available on Google Play:

Original DOOM - Android Apps on Google Play


This application is a mixture of virtualization and binary translation technology that translates in run-time Intel x86 code to ARM one. Along with translation this engine applies sophisticated optimization algorithms to bring high performance experience to end-users.


By the end of the day we can bring desktop applications to mobile devices at no costs. In this particular case we took original Intel x86 version of DOOM and launched it with no alterations or modifications on ARM-based Android device. In nearest future this approach could be extended to other applications.


For more details visit http://eltechs.com/exagear-mobile/


 


yangzhang

Ne10 FFT feature

Posted by yangzhang Dec 18, 2013

FFT feature in ProjectNe10

1 Introduction

Project Ne10 recently received an updated version of FFT, which is heavily NEON optimized for both ARM v7-A/v8-A AArch32 and v8-A AArch64 and is faster than almost all of the other existing open source FFT implementations such as FFTW and the FFT routine in OpenMax DL. This article will introduce this a bit.

2 Performance comparison with some other FFT’s on ARM v7-A

The following chart illustrates the benchmarking results of the complex FFT (32-bit float data type) of Ne10, FFTW and OpenMax. The test platform is ARM Cortex A9. The X-axis of the chart represents the length of FFT. The Y-axis represents the execution time of FFT. Smaller is better.

From this chart, we can find that Ne10 is better than FFTW, OpenMax DL in most of cases.

3 FFT on ARM v7-A/v8-A AArch32 and ARM v8-A AArch64

3.1 NEON usage

 

To utilize NEON accelerator, usually we have two choices:

  • NEON assembly
  • NEON intrinsic

The following table describes the pros and cons of using assembly/intrinsic.

 

NEON assembly

NEON intrinsic

Performance

Always shows the best performance for the specified platform

Depends heavily on the toolchain that is used

Portability

The different ISA (i.e. ARM v7-A/v8-A AArch32 and ARM v8-A AArch64) has different assembly implementation. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.

Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.

Maintainability

Hard to read/write compared with C.

Similar to C code, it’s easy to read/write.

3.2 ARM v7-A/v8-A AArch32 and v8-A AArch64 FFT implementations

According to the aforementioned pros/cons comparison, the intrinsic is preferred for the implementation of the Ne10 library

But for FFT, we still have different versions of implementations for ARM v7-A/v8-A AArch32 and v8-A AArch64 due to the reason described as follows:

// radix 4 butterfly with twiddles

scratch[0].r = scratch_in[0].r;

scratch[0].i = scratch_in[0].i;

scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;

scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;

scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;

scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;

scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;

scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;

The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, we can conclude that:

  • 20 64-bit NEON registers are needed if 2 radix4 butterflies are executed in one loop.
  • 20 128-bit NEON registers are needed if 4 radix4 butterflies are executed in one loop.

And, for ARM v7-A/v8-A AArch32 and v8-A AArch64,

  • There are 32 64-bit or 16 128-bit NEON registers for ARM v7-A/v8-A AArch32.
  • There are 32 128-bit NEON registers for ARM v8-A AArch64.

 

Considering the above factors, in practice the implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARM v7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARM v8-A AArch64.

 

3.3 C/NEON performance boost

The following charts show the C/NEON performance boosts in ARM v8-A AArch32 and AArch64 on the same Cortex-A53 CPU of Juno. Larger is better.

All the blue bars show the data in the AArch32 mode. The NEON code is v7-A/v8-A AArch32 assembly. The toolchain used is gcc 4.9.

All the red bars show the data in the AArch64 mode. The NEON code is intrinsic. The performance of intrinsic depends on toolchains greatly. The toolchain used here is llvm3.5.

From these charts, we can conclude that float complex FFT shows the similar or better performance boost between the AArch64 mode and the AArch32 mode. But for int32/16 complex FFT, the performance boost in the AArch32 mode is usually better than in the AArch64 mode (but this doesn’t mean the int32/16 complex FFT performs faster in the AArch32 mode than in the AArch64 mode!) 

The data from this exercise is useful to analyze the performance boost for ARM v8-A AArch64 mode but we still need more data to verify and reinforce our concept.

3.4 AArch32/AArch64 performance boost

The following charts are based on performance of the AArch32 C version and show the performance ratios of the AArch32 NEON version and the AArch64 C version, and the AArch64 NEON version on the same Cortex-A53 CPU on Juno. Larger is better.

From these charts, we can conclude that FFT in the AArch64 mode performs faster than in the AArch32 mode, no matter C or NEON.

4 Usage

4.1 APIs

The FFT still supports the following features:

Feature

Data type

Length

c2c FFT/IFFT

float/int32/int16

2^N (N is 2, 3….)

r2c FFT

float/int32/int16

2^N (N is 3, 4….)

c2r IFFT

float/int32/int16

2^N (N is 3, 4….)

But the APIs have changed. The old users need to update to latest version v1.1.2 or master.

More API details, please check http://projectne10.github.io/Ne10/doc/group__C2C__FFT__IFFT.html.

 

4.2 Example

Take the float c2c FFT/IFFT as an example, current APIs are used as follows.

#include "NE10.h"

……

{

    fftSize = 2^N; //N is 2, 3, 4, 5, 6....

    in = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));

    out = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));

    ne10_fft_cfg_float32_t cfg;

    cfg = ne10_fft_alloc_c2c_float32 (fftSize);

    ……

    //FFT

    ne10_fft_c2c_1d_float32_neon (out, in, cfg, 0);

    ……

    //IFFT

    ne10_fft_c2c_1d_float32_neon (out, in, cfg, 1);

    ……

    NE10_FREE (in);

NE10_FREE (out);

    NE10_FREE (cfg);

}

5 Conclusion

The FFT shows that you can get a significant performance boost in the ARM v8-A AArch64 mode. You may find more use cases of course. We welcome feedback and are looking to publish use cases to cross promote ProjectNe10 and the projects that use it.

For more details, please access http://projectne10.github.com/Ne10/

In the early days if multicore design, Intel claimed that Android devices did not benefit from multiple cores as the apps often lacked threading. A lot has changed and since then they have, of course, released dual-core ATOM for Android.   Today It’s hard to find a CPU in a widespread mobile device that isn’t multicore anymore because most apps need to be multi-threaded.

 

Most ARM partners are producing dual and quad core devices, more than what Intel produce so we maintain an advantage as Android becomes more and more multicore friendly.  So app developers need consider threading in their designs as Android becomes more intelligent about its power and thread management (i.e. ARM big.LITTLE in 64-bit SoCs)

 

With that in mind I teamed up with Matthew Du Puy to develop a couple of articles about Android software design and the tools ARM offers developers.  The first article, Android software design for multicore systems in Dalvik, we look at the Dalvik runtime environment so that many apps are developed in. Dalvik applications are developed using the Android SDK, usually in the GUI environment provided by the Android Development Tools (ADT) plugin for Eclipse.  In the second we take a look at Android multicore design options in C/C++ or assembly language development.

 

Let us know what you think.