1 2 Previous Next

Android Community

27 posts

Watch the Demonstration Video!



Mobile computing has never been more powerful. Enabled to a large part by ARM® technology, it has become commonplace to carry a device with more processing power than what a common desktop machine would have had not so very long ago.


The Seamless Computing demonstration was conceived to explore some of the implications of this comparison. If a smartphone can computationally match a desktop, what is preventing us from using these devices in that paradigm? What functionality would a smartphone need to offer to overcome these barriers and become a true primary compute device, meeting all our needs through the day?

We decided to focus on a workplace desktop scenario – sitting in your office, using a device for typical productivity applications. Ideally, there would be a smooth transition from mobile operation to desktop mode. The user would walk into their office, sit down at their desk and almost immediately start using the device in that new context.


This scenario immediately implied a larger, separate display from the smartphone, along with a full sized keyboard and mouse. Previous commercial products have aimed at similar use cases with mixed success. More recently, some Android™ enthusiasts have also experimented in this area – this video is particularly compelling. Both these required a dock of some description, which introduced an immediate extra step into the use case – the user must dock the phone in addition to sitting at their desk. Additionally, the two links above featured either a distinct software environment for the desktop, or simply mirrored the mobile environment. The first creates a discontinuity in workflow. The second results in over-large icons and application layouts unsuitable for desktop working –sized instead for a smaller, touch driven display.


With this in mind we identified the primary functionality of our demonstration:

  • Wirelessly pair all peripherals (input and output devices).
  • Reconfigure the UI – The same environment & apps, but with context appropriate UI layout.
  • Trigger the context change between mobile and desktop, without physically docking the device.


The remainder of this blog deals with the technical detail behind the implementation of these functional requirements.



We assume some basic knowledge of Android and the Android SDK in order to follow the discussion below. If you wish to attempt to replicate the full functionality of the demonstration, be aware that doing so will require root access to your device, and expert level knowledge as you will need to create a non-standard Android development environment. Both of these activities are entertained at your own risk and we must recommend that you inform yourself of the impact on any warranties, etc. We provide an outline of what was done to accomplish the features seen in this demonstration, but unfortunately cannot provide a step-by-step guide or release the source code at this time.


Device Selection


We selected the Samsung™ Galaxy Note 3 as the primary platform for this demonstration. This device utilises the Samsung Exynos™ 5420 System-on-Chip, a 4x4 big.LITTLE™ design built around the ARM Cortex®-A15 and Cortex-A7 application processors with an ARM Mali®-T628 graphics processor. The device was upgraded to Android 4.4, and alongside the powerful processing included NFC, wireless charging capability (with an accessory pack), wireless display mirroring and a few other features we thought might be useful for this specific demonstration.


Context Change Detection

Context sensing is a topic of some current interest in the mobile device market. With the proliferation of available sensors, along with always-on connectivity to a wide variety of cloud services, the devices can more accurately recognise what is happening and adjust their functionality in response to this. For our demonstration, we needed a practical way for the device to recognise proximity to the desk and thus trigger a change to desktop mode, along with recognising the opposite transition to mobile.


Initially, we evaluated NFC as a transition trigger. A tag was placed on the surface of the desk, so the user would simply place the phone on the desk as they sat down to trigger the transition. This was relatively straight forward, as Android provides good support for NFC. However, one complication was that the version of Android being used did not publicly expose an event (an Intent within the Android SDK) for tag removal. So, we could detect when the phone was placed on the tagged desk, but not when it was removed. One can work around this with a rooted phone and 3rd party frameworks or APKs that give deeper hooks into Android.  With this we were able to achieve the desired behaviour – place the phone on the desk to enter desktop mode, pick it up to return to mobile mode.


However, our final context detection relied upon a different mechanism. As we had a phone capable of wireless charging, we constructed a desk with a wireless charger embedded in the surface. Samsung sold a wireless charging kit for the Galaxy Note 3, consisting of a replacement back plate for the phone, and a charging pad with a USB connection for power. We took the pad, and routed a depression for it in a small children’s desk. We then placed thin vinyl tiles over the desk surface. The result was a smoothly finished surface, with a ‘charging zone’ above the embedded charging pad. Detecting charging and not-charging events via the Intent framework in Android is even easier than NFC, so using this as a context trigger was very straight forward.  Additionally, the device would be charging whilst it was a desktop!


The demonstration was implemented as an Android Service, with a simple administrative Activity for manipulating some settings. The Android Intents framework was used to listen for the events described above and trigger the correct context change. This consisted of triggering the peripheral pairing and UI reconfiguration described in the following section.



Wireless Peripheral Pairing

Copy of desk1.png

Bluetooth® support for keyboards, mice, and other devices has long been built into Android. The Android SDK provides support for enabling, disabling and otherwise manipulating the Bluetooth functionality of the device. In theory, we could enable or disable Bluetooth according to the desktop context detection we had. But in practice, Bluetooth is already fairly good at connecting with a peripheral once it has been paired with the device, and is in range. We experimented a little bit with enabling and disabling Bluetooth but settled on just enabling it if it was not already on, and relying on it to establish connections to the keyboard and mouse when in range.


Wireless display mirroring is a little more interesting, and more difficult, than connecting a keyboard or a mouse. More recent versions of Android support the Wi-Fi Certified Miracast® standard. At the time of development of this demonstration, Miracast was included in the Samsung Galaxy Note 3 as Samsung Allshare® Cast. More recent releases of Android are rolling support of Miracast into the core of Android. Miracast is essentially a compressed video stream transmitted over Wi-Fi®.


By default, display mirroring is a feature that the user explicitly turns off and on via a settings menu option or shortcut. For the purposes of our demonstration, we wished to automate this. There is no public API to access this programmatically, neither in Android or provided by the OEM (Samsung).  However, some additional research of the Android source code on GitHub reveals that from around version 19 of the Android SDK, the DisplayManager class does include methods for connecting and disconnecting Wi-Fi displays (aka Miracast), but that these functions are hidden under normal circumstances. There are a few ways to gain access here – reflection has been a popular approach for experimental Android developers, but a slightly more elegant approach is to actually obtain an Android Open Source Project jar archive where the hidden classes and methods have not been stripped out, and then replace the standard android.jar file in your build framework. Obviously the methods exposed here are not generally available, supported, or even guaranteed to work at all – this is not for general application development, but within our remit of creating an interesting technical demonstration was a viable route forward.


Given access to the hidden functions of the DisplayManager class, it was now possible to automatically connect or disconnect from a known Miracast display – in this case a Samsung Allshare Cast dongle connected to a display on our desk.


User Interface Configuration

Simple display mirroring over Miracast is perfect for showing a movie, pictures or similar content on a larger screen. However, it is just simple display mirroring – so the interface of the device remains exactly the same. This means that, in landscape view on our remote monitor, one will see a letterboxed, portrait image of the phone screen… and that all icons and text are sized as if they were to be displayed on a screen a few inches across, rather than a desktop sized display. Two inch wide icons do not look natural, and to compound this much of the UI layout on a mobile device is also aimed at a small screen – a single scroll list or column of input fields for instance. To obtain a more natural desktop experience we employed three approaches.


First, we needed to ensure the phone transitioned to landscape display when in desktop mode. There are apps that will allow you do to this in the Google Play™ Store. Using one of these in conjunction with an automation app such as Tasker, we can automate locking of display rotation to landscape in our desktop context. From our Android Service, on entering or exiting desktop mode, we broadcast some custom Intents. Using Tasker’s ability to receive intents, we set it up to control the rotation locking app appropriately.


The orientation issue now solved, we can move on to the icon size and UI layout issue. Anyone who has developed with Android knows that there is a comprehensive framework in place to define UI layouts and assets that adjust to the wide range of display sizes found in Android devices. Whilst this framework is not generally intended to be leveraged dynamically, there are methods in a normally hidden interface within the Android framework that allow these values to be programmatically set. Whether this works will depend a little on the precise Android build and which device you are using, but if they are enabled then one can set the pixel density and display size, and leave the Android layout and resource framework to do the rest. There are some caveats here in that some applications will not pick up the new settings and refresh their layout automatically. For the purposes of our demonstration we forced some applications to restart – definitely not a recommended approach in standard Android programming, but possible with the root access we’d already obtained to implement this demonstration.


With our desktop experience now utilising more reasonably sized icons, and layouts designed for larger tablet devices (2-pane layouts, etc.), we can focus a little more attention on the home screen itself. On a mobile device, this tends to be given over to a grid of app icons and widgets, and feature multiple pages of such grids which the user can swipe through. A traditional desktop experience usually has only one page, and a few icons, usually towards the edges of the screen. The default launcher screen on our selected device did not ‘feel’ like a desktop even when locked to landscape and with its tablet layout. So, we opted to install a custom Android launcher. With this we could configure the desktop experience to appear exactly as we desired.

                    Screenshot_2015-01-28-17-01-59.png                    Screenshot_2015-01-28-17-04-52.png

However, we still needed to switch between a mobile and desktop experience – i.e. change the home screen layout dynamically. A little bit of reverse engineering revealed where the settings files for our custom launcher were stored. We used something of a blunt instrument here, but with the help of a library enabling root-access shell commands, we swap out the settings files for the launcher and force it to restart on each context switch between mobile and desktop. This is by far the least elegant implementation of the demonstration, and the most prone to error, but probably went the furthest towards providing a compelling user experience upon entering desktop mode – there was a very visible transition to a User Experience that anyone who has touched a PC in the last 30 years would recognise.



Closing Words

This then concludes a brief exploration of the techniques we used to implement the Seamless Computing demonstration. One of the most interesting conclusions was not only that a mobile device has the capability to function in this desktop context, but that actually it is possible to leverage substantial portions of the existing Android software framework to provide a compelling desktop experience, and to be able to dynamically switch into and out of this. It is by no means a production-ready experience – but it was closer than we’d anticipated on commissioning the demonstration.


Whether a single device operating in this manner is the direction the world will take remains to be seen. There are other possibilities – multiple devices all providing a rich-but-thin client experience to a virtual cloud-hosted desktop or homescreen, for instance. Regardless, ARM technology is allowing our partners to experiment with all of these form factors and performance points, from extraordinary compute power in a handheld device, to capable but extremely cost-conscious tablets or clamshells. Mobile computing is a reality, and we can’t wait to see what happens next.

最近,Ne10 v1.2.0 发布了。该更新提供了一个新功能——基3、基5的快速傅立叶变换(FFT)。 在基准测试中可以看到, NEON优化使得FFT得到大幅的性能提升。

1. Ne10项目

Ne10 项目旨在为ARM的生态系统提供高度NEON优化的基础函数,比如图像处理(Image Processing)、数字信号处理(DSP)和数学(math)函数等。想要更多地了解Ne10项目,请移步此博客。想更多地了解Ne10中的FFT功能,请移步此博客

2. Benchmark

2.1. 时间

1给出了在ARMv7-ACortex-A9, 1.0GHz)和AArch64 Cortex-A53, 850MHz 上,四个不同实现的性能数据,包括Ne10 v1.2.0),pffft2013),kissFFT(1.3.0),以及Opus项目 (v1.1.1-beta) 中的FFT实现(基于kissFFT,但经过优化)。其中kissFFTOpus中的实现并没有利用NEON技术,而Ne10pffft是经过深度的NEON优化的。编译器采用的是LLVM 3.5,编译选项是-O2


1中,横坐标是FFT的长度,纵坐标是消耗的时间,时间越少说明性能越好。其中,循环次数是 2.048 x 106 / (FFT的长度)。举个例子,我们将1024FFT执行2000次,然后记录下总运行时间。由于pffft要求FFT的长度是16的倍数,所以对应的曲线是从240开始的。可以看出,经过NEON优化后,性能得到明显的提升。

2.2. 每秒百万次浮点操作数(MFLOPS




3. 使用方法

此次更新并没有改变FFTAPINe10在启动 Initial/Setup)的过程中识别FFT的长度是否包含基-3、基-5,进而选择最优的计算方法。详情请参考此博客

Ne10 v1.2.0 is released. Now radix-3 and radix-5 are supported in floating point complex FFT. Benchmark data below shows that NEON optimization has significantly improved performance of FFT.


1. Project Ne10

The Ne10 project has been set up to provide a set of common, useful functions which have been heavily optimized for the ARM Architecture and provide consistent well tested behavior that can be easily incorporated into applications. C interfaces to the functions are provided for both assembler and NEON™ implementations. The library supports static and dynamic linking and is modular, so that functionality that is not required can be discarded. For details of Ne10, please check this blog. For more details of FFT feature in Ne10, please refer this blog.


2. Benchmark

2.1. Time cost

Figure 1 is benchmark data (time cost) of four FFT implementations, including Ne10 (v1.2.0), pffft (2013), kissFFT (1.3.0), and one inside Opus (v1.1.1-beta). Ne10 and pffft are well NEON-optimized, while kissFFT and Opus FFT are not. All implementations are compiled by LLVM 3.5, with -O2 flag. All these implementations have been tested on ARMv7-A (Cortex-A9, 1.0GHz) and AArch64 (Cortex-A53, 850MHz).

Figure 1

In figure 1, x axis is size of FFT and y axis is time cost (ms), smaller is better. Each FFT has been run for 2.048x106 / (size of FFT) times. Say, we run 2000 times for 1024 points FFT. Only multiple of 16 sizes are supported in pffft, so its curve starts from 240. Performance boost after NEON optimization is obvious.


2.2. Mega Floating-point operations per second (MFLOPS)

Figure 2

Figure 2 is benchmark data in MFLOPS of these four implementations. Data are calculated according to this link. MFLOPS is a measure of performance of different algorithms in solving the same problem, bigger is better. When data are packed and processed by NEON instructions (in Ne10 and Pffft), MFLOPS is much higher.


3. Usage

API of FFT is not modified. Ne10 detects whether the size of FFT is multiple of 3 or 5, and then selects the best algorithms to execute. For more detail, please refer this blog.

The Android team in ARM was lucky enough to be invited to a Linux Plumbers mini-conf to talk about AArch64, porting from 32-bit to 64-bit and our experiences in working on Binder (a key Android feature which relies upon support in the Linux kernel).


Attached to this post are the raw PDFs (no video this time).


First an introduction to the AArch64 ISA (from the lead engineer on our Javascript porting work), next a presentation of porting between AArch32 and AArch64 code (from an engineer who did a lot of work on adding AArch64 support to Skia, a key rendering library in Android). Finally a presentation on the changes to the Binder kernel driver needed to support 64-bit user space code, from the engineer who did that and a lot of the initial bionic porting to 64-bit for Android.


As an added bonus, I've attached the original slides for the 'From Zero to Boot' talk at Linaro, which are missing from the Linaro page on the talk.

Stephen Kyle

The ART of Fuzz Testing

Posted by Stephen Kyle Nov 26, 2014

In the newest version of Android, Lollipop (5.0), the virtual machine (VM) implementation has changed from Dalvik to ART. Like most VMs, ART has an interpreter for executing the bytecode of an application, but also uses an ahead-of-time (AOT) compiler to generate native code. This compilation takes place for the majority of Java methods in an app, when the app is initially installed. The old VM, Dalvik, only produced native code from bytecode as the app was executed, a process called just-in-time (JIT) compilation.


ART currently provides a single compiler for this AOT compilation, called the quick compiler. This backend is relatively simple for a compiler, using a 1:1 mapping from most bytecodes to set sequences of machine instructions, performing a few basic optimisations on top of this. More backends are in various stages of development, such as the portable backend and the optimizing backend. As the complexity of a backend increases, so too does its potential to introduce subtle bugs into the execution of bytecode. In the rest of this post, we will use the term "backend" to refer to the different ways in which code can be executed by ART, be it the interpreter, the quick compiler, or the optimizing compiler, and the term "quick compiler" and "quick backend" should be considered equivalent.


In this post we will consider how we can check that we aren't introducing new bugs as these backends are developed.


A test suite is useful, but is limited in size, and may only test for regressions of bugs the developers have found in the past. Some errors in the VM may not have been detected yet, and there are always rare cases arising from unexpected code sequences. While some bugs may just cause the compiler to crash, or create a program that produces slightly incorrect output, other bugs can be more malicious. Many of these bugs lurk at the fringes of what we would consider "normal" program behaviour, leaving open potential for exploits that use these fringe behaviours, leading to potential security issues.


How do we find these bugs? Fuzz testing (also commonly known as "fuzzing") can allow us to test a greater range of programs. Fuzz testing generally refers to random generation of input to stress test the capabilities of a program or API, particularly to see how it can handle erroneous input. In this case, we generate random programs to see how the backends of ART deal with verifying, compiling and executing them.  Before we discuss our fuzz testing strategy in more detail, let's look at how apps are executed in Android.


From Java code to execution on your Android device


Let's take a look at a simple Java method, and watch how this code is transformed into a sequence of A64 instructions.


public int doSomething(int a, int b) {
  if (a > b) {
    return (a * 2);
  return (a + b);


In Android software development, all Java source files are first compiled to Java bytecode, using the standard javac tool. The Java bytecode format (JVM bytecode) used by Java VMs is not the same as the bytecode used in ART, however. The dx tool is used to translate from JVM bytecode to the executable bytecode used by ART, which is called DEX (Dalvik EXecutable, a holdover from when the VM was called Dalvik.) The DEX code for this Java code looks like:


0000: if-le v2, v3, 0005
0002: mul-int/lit8 v0, v2, #int 2
0004: return v0
0005: add-int v0, v2, v3
0007: goto 0004


In this case, the virtual registers v2 and v3 are the method's parameters, a and b, respectively. For a good reference on DEX bytecode, you can consult this document, but essentially this code compares a to b, and if a is less-than-or-equal-to b it adds a to b and returns that result. Otherwise, it multiplies a by 2 and returns that.


When ART loads this code, it typically compiles the bytecode using the quick backend. This compilation will produce a function that roughly follows the ARM Architecture Procedure Call Standard (AAPCS) used with A64 code - it will expect to find its arguments in r2 and r3*, and will return the correct result in r0. Here is the A64 code that the quick backend will produce, with some simplifications:


  // Reminder: w2 is the 32-bit view of register r2 in A64 code!
  [-- omitted saving of registers w20-w22 to the stack --]
  mov w21, w2
  mov w22, w3
  cmp w21, w22
  b.le doAdd
  lsl w20, w21, #1  // (NB: this is w21 * 2)
  mov w0, w20
  [-- omitted loading of registers w20-w22 from the stack --]
  add w20, w21, w22
  b doLeave


*(Why not r0 and r1? Because r0 is reserved for passing the context of the method that is currently being executed. r1 is used for the implicit first argument of any non-static method - the reference to the this object.)


Before code can be compiled or executed by any backend, the bytecode must always be verified.  Verification involves checking various properties of the bytecode to ensure it is safe to execute. For example, checking that the inputs to a mul-float bytecode are actually float values, or checking that a particular method can be executed from the class we are currently executing within. Many of these properties are checked when the program is compiled from Java source to DEX bytecode, resulting in compiler errors. However, it is important to perform full bytecode verification when apps are about to be executed, to defend against security exploits that target DEX manipulation.


Once verification has taken place at run time, ART will load the arguments for the method into the correct registers, and then jump straight to the native code. Alternatively, ART could use its interpreter to interpret the input DEX bytecode as Dalvik would traditionally have done before attempting JIT compilation. Any bytecode that is executed as native code should do the exact same thing when it is executed in the interpreter. This means that methods should return the same results and produce the same side-effects. We can use these requirements to test for flaws in the various backend implementations. We expect that any code that passes the initial verification should be compilable, and some aspects of compilation will actually rely on properties of the code that verification has proven. Contracts exist between the different stages of the VM, and we would like to be assured that there are no gaps between these contracts.


Fuzz testing


We have developed a fuzz tester for ART, that uses mutation-based fuzzing to create new test cases from already written Java programs. ART comes with an extensive test suite for testing the correctness of the VM, but with a mutation-based fuzz tester, we can use these provided tests as a base from which we can investigate more corner cases of the VM.


The majority of these test programs produce some kind of console output - or at the very least, output any encountered VM errors to the console. The test suite knows exactly what output each test should produce, so it runs the test, and confirms that the output has not changed. Mutation-based fuzzing means that we take a test program, and modify it slightly - this means that the output of the program may have changed, or the program may now produce an error. Since we no longer know what output to expect, we can instead use the fact that ART has multiple backends to verify that they all execute this program the same way. Note however that this approach is not foolproof, as it may be the case that all of the backends execute the program in the same, incorrect way. To overcome this, it is also possible to test program execution on the previous VM, Dalvik, as long as some known differences between the two VMs are tolerated (e.g. the messages they use to report errors.) As we increase the number of backends to test, the likelihood that they are all wrong in the same way should decrease.




This diagram shows the fuzzing and testing process. First, the fuzzer parses the DEX file format into a form such that it can apply various mutations to the code. It randomly selects a subset of the methods of the program to mutate, and for each one, it randomly selects a number of mutations to apply. The fuzzer produces a new, mutated DEX file with the mutated code, and then executes this program using the various backends of the ART VM.


Note that all backends pass through a single verifier, and that some backends have been simplified in this diagram - the quick and optimizing backends are technically split up into compilation and execution phases, while the interpreter only has an execution phase. Ultimately, the execution of the mutated DEX file should produce some kind of output from each backend, and we compare these outputs to find bugs. In this example, the fact that the optimizing backend produces "9" instead of "7" strongly suggests there is a bug with the way the optimizing backend has handled this mutated code.


So how do we do this fuzzing? A naive approach would be to take the DEX file and flip bits randomly to produce a mutated DEX file. However, this is likely to always produce a DEX file that fails to pass verification. A large part of the verification process is checking that the structure of the DEX file format is sound, and this includes a checksum in the file's header - randomly flipping bits in the whole file will almost certainly cause this checksum to become invalid, but also likely break some part of the file's structure. A better approach is to focus applying minor mutations to the sections of the program that directly represent executable code.


Some examples of these minor mutations are as follows:



swap two bytecodesPick two bytecodes to swap with each other.
change the register used by a bytecodePick one of the registers specified by a bytecode and change the register.
change an index into the type/field listSome bytecodes may use an index into a list of methods, types or fields at the start of a DEX file. For example, new-instance v0, type@7 will create a new object with the type listed at index 7 of the type list and puts it in v0. The mutation changes which type, field or method is selected.
change the target of a branch bytecodeMake a branch bytecode point to a new target, changing control-flow.
generate a random new bytecodeGenerate a new random bytecode and insert it into a random position, with randomly generated values for all of its operands.


We limit our mutations to a few simple changes to bytecodes that individually are unlikely to break the verification of the DEX file, but in combination may lead to differences in the way the program executes. At the same time, we do not want to ensure that every mutation results in a legal bytecode state, because we wish to search for holes in the verification of the program. Often holes in verification may lead to a compiler making an incorrect assumption about the code it is compiling, which will manifest as differences in output between the compiler and the interpreter.


Example of Bugs Found


Now we present one of the bugs that we have found and fixed in the Android Open Source Project's (AOSP) code base, using this fuzz testing strategy.


When presented with a bytecode that reads an instance field of an object, such as iget v0, v1, MyClass.status (this writes into v0 the value of the "status" field of the object referred to by v1) the verifier did not confirm that v1 actually contained a reference to an object.


Here's a sequence of bytecodes that creates a new MyClass instance, and sets the status field to its initial value + 1:


const v0, 1
new-instance v1, MyClass
invoke-direct {v1} void MyClass.<init>() // calling MyClass() constructor
iget v2, v1, MyClass.status
add-int v2, v0, v2
iput v2, v1, MyClass.status


If a mutation changed the v1 on line 4 to v0, then iget would now have the constant 1 currently in v0 as an input, instead of the reference to an object that was in v1.  Previously, the verifier would not report this as an error when it should, and so the compiler (which expects the iget bytecode to have been properly verified) would expect an object reference to be in the input register for iget, and just read from the value of that reference plus the offset of the status field. If an attacker ensured that an address they wanted to read from was used as the loaded constant, they could read from any memory address in the process' address space. Java removes the ability to read memory directly (without the use of some mechanism such as JNI), to ensure that, for instance, private fields of classes cannot be accessed from within Java, but this bug allowed this to happen.


While this particular bug was present in the verifier, other bugs have been found and fixed in the quick backend of ART. For some of these bugs, we have contributed patches to the AOSP code base, while other bugs have been reported to the ART team. As a result of our fuzz testing efforts, new tests have been added to ART's test suite that are buildable directly from a description of DEX bytecode, whereas previously all tests had to be built from Java source code. This was necessary because many bugs we have found arise from specially crafted pieces of bytecode that the javac and dx tools would not generate themselves. We have aimed to submit DEX bytecode tests with any patches we submit to AOSP.




In this post we have looked at how fuzz testing can help the development of new backends for a virtual machine, specifically the ART VM that now powers Android.  From the roughly 200 test programs already present in ART's test suite, we have produced a significantly larger number of new tests using fuzzing. Each additional program used for testing increases our confidence that the implementation of ART is sound.  Most of the bugs we found affected the quick backend of ART as it was being developed in AOSP, but as new bugs could arise from complicated interactions between optimisations in the optimizing backend, the use of fuzz testing will increase our chances of finding any bugs and squashing them early.


Further Reading


The initial research into fuzzing was performed by Barton Miller at UW-Madison.


Paul Sabanal fuzzed the experimental release version of ART in Kitkat, and found a few crashes. He presented this work at HITB2014.


For more information about differential testing, various papers have been written about Csmith, a tool that performs differential testing to test C compilers.


Researchers at UC Davis recently presented work about Equivalence Modulo Inputs, where seed programs are fuzzed to produce new programs that are expected to produce the same output as the seed program for a given set of inputs. All produced programs are then compiled and executed, and divergences in output indicate miscompilations.

In this blog I will cover various methods of runtime feature detection on CPUs implementing ARMv8-A architecture. These methods include using HWCAP on Linux and Android, using NDK on Android and using /proc/cpuinfo. I will also provide sample code to detect the new optional features introduced in the ARMv8-A architecture. Before we dig deep in to the different methods, let us understand more about ARMv8-A CPU features.


ARMv8-A CPU features


ARMv7-A CPU features


The ARMv8-A architecture has made many ARMv7-A optional features mandatory, including advanced SIMD (also called NEON). This applies to both the ARMv8-A execution states namely, AArch32 (32-bit execution state, backward compatible with ARMv7-A) and AArch64 (64-bit execution state).


New features


The ARMv8-A architecture introduces a new set of optional instructions including AES. These instructions were not available in ARMv7-A architecture. These optional instructions are grouped into various categories, as listed below.


  • CRC32 instructions - CRC32B, CRC32H, CRC32W, CRC32X, CRC32CB, CRC32CH, CRC32CW, and CRC32CX
  • SHA1 instructions - SHA1C, SHA1P, SHA1M, SHA1H, SHA1SU0, and SHA1SU1
  • SHA2 instructions - SHA256H, SHA256H2, SHA256SU0, and SHA256SU1
  • AES instructions - AESE, AESD, AESMC, and AESIMC
  • PMULL instructions that operate on 64-bit data - PMULL and PMULL2


Runtime CPU feature detection scenarios


User-space programs can detect features supported by an ARMv8-A CPU at runtime, using many mechanisms including /proc/cpuinfo, HWCAP and the Android NDK CPU feature API.  I will describe them in detail below.


Detect CPU feature using /proc/cpuinfo


Parsing /proc/cpuinfo is a popular way to detect CPU features. However I strongly recommend not to use /proc/cpuinfo on ARMv8-A for cpu feature detection, as this is not a portable way of detecting CPU features. Indeed, /proc/cpuinfo reflects the characteristics of the kernel rather than the application which is being executed. This means that /proc/cpuinfo is the same for both 32-bit and 64-bit processes running on an ARMv8-A 64-bit kernel. The ARMv8-A 64-bit kernel's /proc/cpuinfo output is quite different from that of a ARMv7-A 32-bit kernel. For example, ARMv8-A 64-bit kernel uses 'asimd' for advanced SIMD support, while ARMv7-A 32-bit kernel uses 'neon'. Thus, NEON detection code that looks for the "neon" string in /proc/cpuinfo will not work on ARMv8-A 64-bit kernel. Applications using /proc/cpuinfo should migrate to either using HWCAP or the NDK API, as they are maintained and controlled interfaces unlike /proc/cpuinfo.


Detect CPU feature using HWCAP


HWCAP can be used on ARMv8-A processors to detect CPU features at runtime.


HWCAP and Auxiliary vector


First, let me give you a brief overview of HWCAP. HWCAP uses the auxiliary vector feature provided by the Linux kernel. The Linux kernel's ELF binary loader uses the auxiliary vector to pass certain OS and architecture specific information to user space programs. Each entry in the vector consists of two items: the first identifies the type of entry, the second provides the value for that type. Processes can access these auxiliary vectors through the getauxval() API call.


getauxval() is a library function available to user space programs to retrieve a value from the auxiliary vector. This function is supported by both bionic (Android's libc library) and glibc (GNU libc library).  The prototype of this function is unsigned long getauxval(unsigned long type); Given the argument type, getauxval() returns the corresponding value.


<sys/auxv.h> defines various vector types. Amongst them, AT_HWCAP and AT_HWCAP2 are of our interest. These auxiliary vector types specify processor capabilities. For these types, getauxval() returns a bit-mask with different bits indicating various processor capabilities.




Let us look at how HWCAP can be used on ARMv8-A. In ARMv8-A, the values returned by AT_HWCAP and AT_HWCAP2 depend on the execution state.  For AArch32 (32-bit processes), AT_HWCAP provides flags specific to ARMv7 and prior architectures, NEON for example.AT_HWCAP2 provides ARMv8-A related flags like AES, CRC.  In case of AArch64, AT_HWCAP provides ARMv8-A related flags like AES and AT_HWCAP2 bit-space is not used.


Benefits of HWCAP


One of the main benefits of using HWCAP over other mechanisms like /proc/cpuinfo is portability. Existing ARMv7-A programs that use HWCAP to detect features like NEON will run as is on ARMv8-A, without any change. Since the getauxval() is supported in Linux (through glibc) and Android (through bionic), the same code can run on both Android and Linux.


Sample code for AArch32 state


The sample code below shows how to detect CPU features using AT_HWCAP in the AArch32 state.


#include <stdio.h>
#include <sys/auxv.h>
#include <asm/hwcap.h>

int main()
    long hwcaps2 = getauxval(AT_HWCAP2);

    if(hwcaps2 & HWCAP2_AES){
        printf("AES instructions are available\n");
    if(hwcaps2 & HWCAP2_CRC32){
        printf("CRC32 instructions are available\n");
    if(hwcaps2 & HWCAP2_PMULL){
        printf("PMULL/PMULL2 instructions that operate on 64-bit data are available\n");
    if(hwcaps2 & HWCAP2_SHA1){
        printf("SHA1 instructions are available\n");
    if(hwcaps2 & HWCAP2_SHA2){
        printf("SHA2 instructions are available\n");
    return 0;


Sample code for AArch64 state


The code below shows how to detect ARMv8-A CPU features in AArch64 process using HWCAP


#include <stdio.h>
#include <sys/auxv.h>
#include <asm/hwcap.h>

int main()
    long hwcaps= getauxval(AT_HWCAP);

    if(hwcaps & HWCAP_AES){
        printf("AES instructions are available\n");
    if(hwcaps & HWCAP_CRC32){
        printf("CRC32 instructions are available\n");
    if(hwcaps & HWCAP_PMULL){
        printf("PMULL/PMULL2 instructions that operate on 64-bit data are available\n");
    if(hwcaps & HWCAP_SHA1){
        printf("SHA1 instructions are available\n");
    if(hwcaps & HWCAP_SHA2){
        printf("SHA2 instructions are available\n");
    return 0;


Detect CPU feature using Android NDK CPU feature API


The Android NDK provides an API to detect the CPU architecture family and the supported features at run time.


CPU feature API


There are two main functions, android_getCpuFamily() and android_getCpuFeatures().


  • android_getCpuFamily() - Returns the CPU family
  • android_getCpuFeatures() - Returns a bitmap describing a set of supported optional CPU features. The exact flags will depend on CPU family returned by android_getCpuFamily(). These flags are defined in cpu-features.h


Support for ARMv8-A optional features


The latest NDK release (version 10b, September 2014) supports ARMv8-A CPU features detection only for the AArch64 mode. However, the NDK project in AOSP supports both the AArch32 and the AArch64 CPU feature flags. The AArch32 feature flags were added to the AOSP in the change list 106360. The NDK uses HWCAP internally to detect the CPU features.


NDK Sample code to detect ARMv8-A cpu features


Detect CPU family


#include <stdio.h>
#include "cpu-features.h"

int main()
    AndroidCpuFamily family;
    family = android_getCpuFamily();
    if(family == ANDROID_CPU_FAMILY_ARM){
        printf("CPU family is ANDROID_CPU_FAMILY_ARM \n");
    } else if(family == ANDROID_CPU_FAMILY_ARM64){
        printf("CPU family is ANDROID_CPU_FAMILY_ARM64 \n");
    } else {
        printf("CPU family is %d \n", family);
    return 0;


Detect ARMv8-A CPU features


#include <stdio.h>
#include "cpu-features.h"

void printArm64Features(){
    uint64_t features;
    features = android_getCpuFeatures();
    if(features & ANDROID_CPU_ARM64_FEATURE_AES){
        printf("AES instructions are available\n");
    if(features & ANDROID_CPU_ARM64_FEATURE_PMULL){
        printf("PMULL instructions, that operate on 64-bit data, are available\n");
    if(features & ANDROID_CPU_ARM64_FEATURE_SHA1){
        printf("SHA1 instructions are available\n");
    if(features & ANDROID_CPU_ARM64_FEATURE_SHA2){
        printf("SHA2 instructions are available\n");
    if(features & ANDROID_CPU_ARM64_FEATURE_CRC32){
        printf("CRC32 instructions are available\n");

void printArmFeatures(){
    uint64_t features;
    features = android_getCpuFeatures();
    if(features & ANDROID_CPU_ARM_FEATURE_AES){
        printf("AES instructions are available\n");
        printf("PMULL instructions, that operate on 64-bit data, are available\n");
    if(features & ANDROID_CPU_ARM_FEATURE_SHA1){
        printf("SHA1 instructions are available\n");
    if(features & ANDROID_CPU_ARM_FEATURE_SHA2){
        printf("SHA2 instructions are available\n");
    if(features & ANDROID_CPU_ARM_FEATURE_CRC32){
        printf("CRC32 instructions are available\n");

int main(){
    AndroidCpuFamily family;
    family = android_getCpuFamily();
    if(family == ANDROID_CPU_FAMILY_ARM){
    if(family == ANDROID_CPU_FAMILY_ARM64){
    return 0;




The ARMv8-A architecture makes certain ARMv7-A features mandatory and introduces a new set of optional features. The popular way of detecting the features at runtime by parsing /proc/cpuinfo is not portable to ARMv8-A and existing code will not work without tricky changes. Instead, application programmers can easily use HWCAP on Linux and the NDK on Android. For detecting ARMv8-A optional features in the AArch32 mode, programmers should use HWCAP on Android as the latest NDK does not have support for it yet.

The recent Linaro Connect (lhttp://www.linaro.org/connect/lcu/lcu14/) saw several ARM and Linaro presentations about Android and about 64-bit. I think these might be interesting to anyone following Android, ARMv8, AARCH64 or 64-bit progress in mobile.


First is Serban Constantinescu presenting the journey involved in getting AOSP running first on a 64-bit kernel (in 2012) and then booting with a 64-bit userspace - all on ARM Fast Models:

LCU14 411: From zero to booting nandroid with 64bit support - YouTube


Next is Stuart Monteith the story of porting Dalvik to 64-bit - and how Dalvik and ART are related:

LCU14-100: Dalvik is Dead, Long Live Dalvik! OR Tuning ART - YouTube


Then a presentation by Ashok Bhat on some collaborative work between Linaro and ARM on creating some multimedia tests to help with porting several Android codecs to 64-bit

LCU14-502: Android User-Space Tests: Multimedia codec tests, Status and Open Discussions - YouTube


Finally a presentation by Kevin Petit on NEON ARMv8 and the use of intrinsics

LCU14-504: Taming ARMv8 NEON: from theory to benchmark results - YouTube


Hopefully, for those who prefer reading to watching, we will be able to post some blogs on the topics soon.

A few years ago(20?), I bought a programmable calculator and downloaded a program(from a "Bulletin Board" in Europe) to do symbolic Z-transform expansions for a digital signal processing test I had in college. I finished my test in a few minutes and was immediately handed my test with a "perfect!" and 0 for a grade. When I explained that I had downloaded a program to my calculator from a site in Europe - I got "right..." after a 30 second demo(and explanation of how the code worked), the zero had a 10 put in front of it and that professor became my advisor


Since then, billions of people have been downloading apps through an open source VM(I actually wrote some code for) called "Android".  A couple of years ago, I decided to start working on another open source VM - I call "rekam1" mirrowrite(rekam1);  I'll be demoing some consumer programmable projects with this at the world maker faire in NYC(check it out if you happen to be in the area) - and I'll be talking about Virtual Machines for wirelessly connected Cortex M devices at the upcoming TechCon conference in my talk: "The Consumer Programmable IOT"  If you're interested to see how the maker(and consumer developer) community could change how we all write/share code - check out my talk!


ARM TechCon Schedule Builder | Session: The Consumer Programmable IoT

Eirik Aavitsland at Digia has created a blog post about how you can easily make an ODROID-U3 or another device running a recent version of Android boot to Qt.


This blog has been written before but quite a few things have improved in the ease of use and breadth of support of Streamline use on Android in the past few years. For starters, Mac OS X is well supported. Now all three major development platforms (Linux, Windows and Mac) have the ability to run DS-5 Community Edition debugger (gdb) and Streamline with the ADT Eclipse tools from Google as an add-on or pre-packaged as DS-5 CE for Windows and Linux from ARM with ADT as an add-on. Also, and most welcome, is the new Gator driver. The component of Streamline that runs in Android to collect OS and processor counters used to require both a kernel module and a driver daemon. Compiling and flashing any module could be complicated depending on the availability of your Android platform kernel headers. That requirement has been removed and now the Gator driver will run as root on many devices. This July (7/2014), an updated version of gatord in the DS-5 CE 5.19 will be released that greatly expands the kernel versions supported (beyond the 3.12 kernel version supported in the current DS-5 5.18 release). Finally, I’ve found some erroneous and dated info in some blogs that claim to be up to date to DS-5 5.18 and even the yet to be released 5.19. I’ll try to correct that here and support this blog entry.


Streamline is a powerful system analysis tool that will help you speed up your code, reduce energy footprint and balance system resources. The free version in the Community Edition of DS-5 lets you view CPU and OS counters in a powerful graphical view; CPU and GPU activity, cache hits and misses and visibility down in to individual threads and modules. You can find code that is blocking or could be optimized by multithreading or refactoring in NEON or the GPU. Check out more features on the optimize site.



Getting Started:


As of this writing the Android SDK Manager is Revision 22.6.4 bundled in the latest SDK for Mac, adt-bundle-mac-x86_64-20140321. The SDK is available at the Android Developer Site. The Native Development Kit (NDK) is revision 9d. Download both of these for your appropriate platform. I’m downloading the Mac OS X 64-bit versions for this guide but these instructions should work for Windows and Linux just as easily.


Once you unpack these tools, you should add some executable paths to your platform if you plan on using the terminal for anything like the Android Debug tool (adb). It is now possible to use all of the tools from within Eclipse without adjusting your executable paths but for some of us old-schoolers who are wedded to the CLI, I drop my NDK folder in to the SDK folder and put that folder in my Mac’s /Applications directory. You can place them wherever you like on most platforms though. I then added these to my ~/.bashrc


export PATH=$PATH:/Applications/adt-bundle-mac-x86_64-20140321/sdk/platform-tools

export PATH=$PATH:/Applications/adt-bundle-mac-x86_64-20140321/sdk/tools

export PATH=$PATH:/Applications/adt-bundle-mac-x86_64-20140321/android-ndk-r9d


You should now be able to launch common Android tools from your command line:

> which ndk-build


> which fastboot


> which adb


> which android



You can Launch the Android SDK Manager from Eclipse in the “Window” menu or via the command line by typing:

> android


From there, you can update your current SDK, install older APIs, build-tools, platform tools and in “Extras”, the Android Support Library for compatibility with older APIs.

Pasted Graphic 4.jpeg

When you run Eclipse (ADT) for the first time or change versions, you may have to tell it where to find the SDK. The Preferences dialog box is found on Macs via the ADT->Preferences menu, sub heading Android.

Pasted Graphic 3.jpeg

Setting up a demo app to analyze (if you don’t have your own app):


You probably have your own library or application you want to perform system analysis on but just in case you’re checking out the tool, I’ll step through setting up an app that is near and dear to me, ProjectNe10. You can grab the master branch archive from GitHub. For this tool demo, I’ve created a directory /workspace and unzipped the Ne10 archive inside that folder. ProjectNe10 requires the cmake utility. Fortunately, there is a Homebrew solution to install cmake from the command line:


brew install cmake


If you don’t have brew installed, install it. You’ll use it in the future, I promise. You can also just download the binary for any platform from cmake.

Now we can build the Ne10 library from the command line:


Set these to your particular paths:


export NE10PATH=/workspace/projectNe10

export ANDROID_NDK=/Applications/adt-bundle-mac-x86_64-20140321/android-ndk-r9d




cd $NE10PATH

mkdir build && cd build

cmake -DCMAKE_TOOLCHAIN_FILE=../android/android_config.cmake ..


make install


That make install line will copy libNE10_test_demo.so to your /workspace/projectNe10/android/NE10Demo equivalent. Now you can go to the File->Import menu in Eclipse and import an existing Android code base in to your workspace.


Pasted Graphic 6.jpeg

Pasted Graphic 7.jpeg


If all goes well, you should be able to connect your ARM based Android Device (in my case, a Nexus 5 running Android 4.4.4 to match the current SDK at the time of this writing) and run this app from the Run menu as an Android app. As a sanity check, you should run adb devices from the command line to verify you can see your device. This app will iterate through every function in the ProjectNe10 library with both C and NEON implementations. One of the implementations should be faster. I’ll give you a hint. It is the NEON implementation.



Installing DS-5 Community Edition (Free Eclipse Plugin with enhanced ARM debug and system analysis):


Start Eclipse and go to the menu Help->Install New Software.... Click on “Add...”, and paste http://tools.arm.com/eclipse in the location text box, then click OK. Select ARM DS-5 Community Edition, as shown on the screenshot below, and click Next. Eclipse will compute the dependencies of the DS-5 CE plug-ins.

Pasted Graphic 8.jpeg


Click Next again. Read the license agreements and if you accept, hit Finish. After the install is complete, ADT will ask you to reload.

A license dialog should popup if this is a fresh install. Select "Install Community edition license" and click "Continue".


If there was no popup license message go to Help->Generate community edition license, and click "Finish".


Congratulations, you now have ARM DS-5 CE installed with its enhanced and easy to use debugger which you can use to debug Android NDK apps and libraries with the steps in this guide. You also have Streamline; a powerful system analysis tool which we’ll cover in the next section.


Using Streamline and gator to analyze Android apps and the entire system


Before you can gather data for system analysis, you have to install a data collecting driver (daemon) in Android. Gatord will gather processor and kernel counters on the Android device and stream them over to your host machine. It must run as root to do this. Any device with an unlocked boot loader is very simple to root, you usually just flash a custom recovery tool like TWRP and install SuperSU. If you have a locked bootloader, you’ll have to use a device exploit so I can’t recommend this or help you but your favorite search engine might… This is a minor inconvenience now as older versions required a kernel module (gator.ko) which needed to be compiled against your particular device’s kernel headers. Now that Android security terms to pass Android CTS disallow kernel modules, you’d have to compile in to the kernel and flash it. Fortunately the new gatord will expand its kernel version support significantly in July.


First, build gatord. Go to the menu Help->ARM Extras… this will open up a folder with several goodies in it.

Pasted Graphic 9.jpg


I’m going to build this from the command line so fire up your favorite terminal and cd in to this directory. The easiest way in the Mac terminal app is to type “cd ” and dragging the gator folder in to the terminal window. OS X will fill in the path, then:


cd daemon-src

tar zxf gator-daemon.tar.gz

mv gator-daemon jni

cd jni



These steps should unzip the gatord source, and build it for Android (dynamically linked) with the output in ../libs/armeabi/gatord. Copy this binary to your Android device with your favorite method, AirDroid, scp, Droid NAS or very simply:


adb push ../libs/armeabi/gatord /sdcard/gatord


This, of course, assumes you’ve enabled developer options and debugging on your device. “On Android 4.2 and newer, Developer options is hidden by default. To make it available, go to Settings > About phone and tap Build number seven times. Return to the previous screen to find Developer options. In Developer options click USB debugging. If this is a new device, you may have to approve the debug link security the first time you try to use adb. You can also do this with an ARM based Android Virtual Device (AVD) in the emulator if your physical device is too ‘locked down’ but Streamline system data won’t be as useful. You may have to use “mount -o rw,remount rootfs /“ and “chmod 777 /mnt/sdcard” in your AVD to push gatord.


Now, the tricky part, you have to move this binary to an executable location in the filesystem and set executable permissions. The most reliable method I’ve used is ES File Explorer. Go in to the menu, turn on Root Explorer and go to the Mount R/W option, set root “/“ as RW (read/writeable) rather than RO. Then copy and paste gatord in to /system/bin in your Android filesystem. You can also set the permissions to executable in ES File Browser by long pressing on the gatord file, then more->Properties->Permissions->Change. Give the owner any group Execute permission and press Ok.


Back in your host machine terminal you need to set up a pipe for gator to communicate over USB and then get a shell on the device to start it:


adb forward tcp:8080 tcp:8080

adb shell


Now you’ve got a shell on your android device, you can su to root and start gatord. Type:





The rest is pretty straight forward. Go to the Window->Show View->Other…->DS-5->ARM Streamline Data

Click on the gear button

Pasted Graphic 13.jpeg



In the address section, enter “localhost” if you’re streaming the capture data over USB using adb to forward the TCP port. In the Program Images box select the shared library that you want to profile (add ELF image from workspace).

Pasted Graphic 10.jpg




You can now use the red “Start Capture” button at any time.

Pasted Graphic 14.jpeg

Other blogs and tutorials are accurate from this point forward on the features and use of Streamline so I’ll drop a few and let you get to it!

The “CAPTURING DATA AND VIEWING THE ARM STREAMLINE REPORT” section of this blog is accurate.

Events based sampling video, analyzing CPU and GPU performance and customizing charts on YouTube.


At VIA we are aiming to provide more software support for our products. Most ARM based boards have both Linux and Android images that potential partners can try. On our product line we have two Freescale boards, the VAB-800 (single core Cortex-A8) and newer VAB-820 (quad core Cortex-A9). The latter just have a brand new Android Evaluation Package fresh up on our website, ready for testing.


The Android image is based on Jelly Bean 4.2.2 (which puts this package ahead of even our other boards). Among the available features is the CAN bus driver, resistive touch screen, HDMI video and audio output, dual display and mini PCI-E support. On the developer timeline for future releases we have ADV-7180 Capture, WatchDog/GPIO, VIA Smart ETK support for embedded solutions.


Android for Freescale is still a quite new combination with a lot of potential. We are hoping that it will make device developers lives easier both on the software and hardware side. This evaluation package is just the beginning of this conversation.


What do you think, what would make you choose an Android system over others for your next embedded solution?


The Android Evaluation Package (as well as the Linux Evaluation Package) is available on the VIA Embedded website.

There is a very active overclocking community called HWBot, with quite a few organizers here in Taiwan. For a very long time they were doing desktop (and laptop?) overclocking, challenging the hardware, and pushing the boundaries. When they were giving a presentation about their past, and future plans, they were really proud of making the desktop computer industry care more about hardware quality.


Now they want to do the same thing for smartphone hardware. Just recently released the beta version of their Android benchmarking app HWBot Prime, an started to gather data for different devices. My HTC Butterfly (running a Quad Snapdragon C4, I guess) did pretty well on that (whenever I could kill enough apps not to interfere with the benchmark).


The VIA Springboard (that I'm taking care of) is a single-board computer that can also run Android (4.0.3) besides Linux. It has a WM8950 single core 800MHz CPU, so it is not a match for the Butterfly, but the per-core-per-MHz results are better.


So far it's benchmarking only without any overclocking yet, and I'm running on the stock Android image, but it's a good baseline to start improving on the result. Cannot manage something that I cannot measure, right?


The submitted result for the Springboard is on the HWBot leaderboard. I wonder if anyone else wants to benchmark, tune, and overclock their ARM devices?


The whole experience is written up in the VIA Springboard Blog.

As a result of the rapid proliferation of Android smart phones and tablets, embedded developers worldwide are increasingly adopting the operating system for a growing number of embedded systems and connected devices that leverage its rich application framework, native multimedia capabilities, massive app ecosystem, familiar user interface, and faster time to market.


However, although the benefits of adopting Android for embedded systems and devices can be great, particularly for touch-based multimedia applications, utilizing the OS also presents a number of critical challenges, including selecting the right ARM SoC platform for the target system application, porting and customizing the operating system and applications, and ensuring tight integration between the hardware and software to deliver a compelling end-user experience.




In addition to exploring the benefits and challenges of adopting Android for embedded applications, the attached white paper provides an overview of the holistic approach that VIA Embedded has established in order to enable developers to reduce product development times and speed up time to market for innovative new embedded Android systems and devices.


Holistic Approach

VIA is committed to support the entire product development life cycle, from defining product requirements, all the way through development.

  • Best of Breed application specific ARM SoC platforms, with a comprehensive range of Freescale and VIA ARM SoCs
  • Small form factor ARM boards and systems, using VIA's expertise in creating practical form factor standards
  • Android software packages and customization services (see below)
  • Longevity, by supporting specific boards and systems up to 5 years


Android Customization


Via Embedded provides  a wide range of software solution packages and customization services to facilitate the development of Android embedded systems and devices:

  • Customized applications, including system apps
  • Kernel & Framework including security and special devices
  • System management including watchdog, remote monitoring, remote power on/off, silencing app and system upgrades
  • Embedded I/O including legacy I/O


VIA Android Smart Embedded Tool Kit (ETK)


The VIA Embedded Android Smart ETK includes a set of APIs that enable the Android application to access I/O and manageability services provided by the system hardware that are not supported in the Android framework.


APIs include:

  • Watchdog to help applications and the system to recover from failures and breakdowns
  • Scheduled power on/off, and periodic reboots
  • RTC Wake-up to auto power on at a specific time of the day, of the week, or of the month.
  • Legacy I/O Support making RS232, GPIO, I2C, and CAN bus available for apps



More details, and a case study is presented in the attached whitepaper.

We would like to share with you an Android-based application that enables Intel software on ARM-based devices. To demo you this approach we took Intel version of DOOM and run it on Android-based device. Application is freely available on Google Play:

Original DOOM - Android Apps on Google Play

This application is a mixture of virtualization and binary translation technology that translates in run-time Intel x86 code to ARM one. Along with translation this engine applies sophisticated optimization algorithms to bring high performance experience to end-users.

By the end of the day we can bring desktop applications to mobile devices at no costs. In this particular case we took original Intel x86 version of DOOM and launched it with no alterations or modifications on ARM-based Android device. In nearest future this approach could be extended to other applications.

For more details visit http://eltechs.com/exagear-mobile/



Ne10 FFT feature

Posted by yangzhang Dec 18, 2013

FFT feature in ProjectNe10

1 Introduction

Project Ne10 recently received an updated version of FFT, which is heavily NEON optimized for both ARM v7-A/v8-A AArch32 and v8-A AArch64 and is faster than almost all of the other existing open source FFT implementations such as FFTW and the FFT routine in OpenMax DL. This article will introduce this a bit.

2 Performance comparison with some other FFT’s on ARM v7-A

The following chart illustrates the benchmarking results of the complex FFT (32-bit float data type) of Ne10, FFTW and OpenMax. The test platform is ARM Cortex A9. The X-axis of the chart represents the length of FFT. The Y-axis represents the execution time of FFT. Smaller is better.

From this chart, we can find that Ne10 is better than FFTW, OpenMax DL in most of cases.

3 FFT on ARM v7-A/v8-A AArch32 and ARM v8-A AArch64

3.1 NEON usage


To utilize NEON accelerator, usually we have two choices:

  • NEON assembly
  • NEON intrinsic

The following table describes the pros and cons of using assembly/intrinsic.


NEON assembly

NEON intrinsic


Always shows the best performance for the specified platform

Depends heavily on the toolchain that is used


The different ISA (i.e. ARM v7-A/v8-A AArch32 and ARM v8-A AArch64) has different assembly implementation. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.

Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.


Hard to read/write compared with C.

Similar to C code, it’s easy to read/write.

3.2 ARM v7-A/v8-A AArch32 and v8-A AArch64 FFT implementations

According to the aforementioned pros/cons comparison, the intrinsic is preferred for the implementation of the Ne10 library

But for FFT, we still have different versions of implementations for ARM v7-A/v8-A AArch32 and v8-A AArch64 due to the reason described as follows:

// radix 4 butterfly with twiddles

scratch[0].r = scratch_in[0].r;

scratch[0].i = scratch_in[0].i;

scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;

scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;

scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;

scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;

scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;

scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;

The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, we can conclude that:

  • 20 64-bit NEON registers are needed if 2 radix4 butterflies are executed in one loop.
  • 20 128-bit NEON registers are needed if 4 radix4 butterflies are executed in one loop.

And, for ARM v7-A/v8-A AArch32 and v8-A AArch64,

  • There are 32 64-bit or 16 128-bit NEON registers for ARM v7-A/v8-A AArch32.
  • There are 32 128-bit NEON registers for ARM v8-A AArch64.


Considering the above factors, in practice the implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARM v7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARM v8-A AArch64.


3.3 C/NEON performance boost

The following charts show the C/NEON performance boosts in ARM v8-A AArch32 and AArch64 on the same Cortex-A53 CPU of Juno. Larger is better.

All the blue bars show the data in the AArch32 mode. The NEON code is v7-A/v8-A AArch32 assembly. The toolchain used is gcc 4.9.

All the red bars show the data in the AArch64 mode. The NEON code is intrinsic. The performance of intrinsic depends on toolchains greatly. The toolchain used here is llvm3.5.

From these charts, we can conclude that float complex FFT shows the similar or better performance boost between the AArch64 mode and the AArch32 mode. But for int32/16 complex FFT, the performance boost in the AArch32 mode is usually better than in the AArch64 mode (but this doesn’t mean the int32/16 complex FFT performs faster in the AArch32 mode than in the AArch64 mode!)

The data from this exercise is useful to analyze the performance boost for ARM v8-A AArch64 mode but we still need more data to verify and reinforce our concept.

3.4 AArch32/AArch64 performance boost

The following charts are based on performance of the AArch32 C version and show the performance ratios of the AArch32 NEON version and the AArch64 C version, and the AArch64 NEON version on the same Cortex-A53 CPU on Juno. Larger is better.

From these charts, we can conclude that FFT in the AArch64 mode performs faster than in the AArch32 mode, no matter C or NEON.

4 Usage

4.1 APIs

The FFT still supports the following features:


Data type




2^N (N is 2, 3….)

r2c FFT


2^N (N is 3, 4….)

c2r IFFT


2^N (N is 3, 4….)

But the APIs have changed. The old users need to update to latest version v1.1.2 or master.

More API details, please check http://projectne10.github.io/Ne10/doc/group__C2C__FFT__IFFT.html.


4.2 Example

Take the float c2c FFT/IFFT as an example, current APIs are used as follows.

#include "NE10.h"



    fftSize = 2^N; //N is 2, 3, 4, 5, 6....

    in = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));

    out = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));

    ne10_fft_cfg_float32_t cfg;

    cfg = ne10_fft_alloc_c2c_float32 (fftSize);



    ne10_fft_c2c_1d_float32_neon (out, in, cfg, 0);



    ne10_fft_c2c_1d_float32_neon (out, in, cfg, 1);


    NE10_FREE (in);

    NE10_FREE (out);

    NE10_FREE (cfg);


5 Conclusion

The FFT shows that you can get a significant performance boost in the ARM v8-A AArch64 mode. You may find more use cases of course. We welcome feedback and are looking to publish use cases to cross promote ProjectNe10 and the projects that use it.

For more details, please access http://projectne10.github.com/Ne10/

Filter Blog

By date: By tag: