In the newest version of Android, Lollipop (5.0), the virtual machine (VM) implementation has changed from Dalvik to ART. Like most VMs, ART has an interpreter for executing the bytecode of an application, but also uses an ahead-of-time (AOT) compiler to generate native code. This compilation takes place for the majority of Java methods in an app, when the app is initially installed. The old VM, Dalvik, only produced native code from bytecode as the app was executed, a process called just-in-time (JIT) compilation.
ART currently provides a single compiler for this AOT compilation, called the quick compiler. This backend is relatively simple for a compiler, using a 1:1 mapping from most bytecodes to set sequences of machine instructions, performing a few basic optimisations on top of this. More backends are in various stages of development, such as the portable backend and the optimizing backend. As the complexity of a backend increases, so too does its potential to introduce subtle bugs into the execution of bytecode. In the rest of this post, we will use the term "backend" to refer to the different ways in which code can be executed by ART, be it the interpreter, the quick compiler, or the optimizing compiler, and the term "quick compiler" and "quick backend" should be considered equivalent.
In this post we will consider how we can check that we aren't introducing new bugs as these backends are developed.
A test suite is useful, but is limited in size, and may only test for regressions of bugs the developers have found in the past. Some errors in the VM may not have been detected yet, and there are always rare cases arising from unexpected code sequences. While some bugs may just cause the compiler to crash, or create a program that produces slightly incorrect output, other bugs can be more malicious. Many of these bugs lurk at the fringes of what we would consider "normal" program behaviour, leaving open potential for exploits that use these fringe behaviours, leading to potential security issues.
How do we find these bugs? Fuzz testing (also commonly known as "fuzzing") can allow us to test a greater range of programs. Fuzz testing generally refers to random generation of input to stress test the capabilities of a program or API, particularly to see how it can handle erroneous input. In this case, we generate random programs to see how the backends of ART deal with verifying, compiling and executing them. Before we discuss our fuzz testing strategy in more detail, let's look at how apps are executed in Android.
Let's take a look at a simple Java method, and watch how this code is transformed into a sequence of A64 instructions.
public int doSomething(int a, int b) { if (a > b) { return (a * 2); } return (a + b); }
In Android software development, all Java source files are first compiled to Java bytecode, using the standard javac tool. The Java bytecode format (JVM bytecode) used by Java VMs is not the same as the bytecode used in ART, however. The dx tool is used to translate from JVM bytecode to the executable bytecode used by ART, which is called DEX (Dalvik EXecutable, a holdover from when the VM was called Dalvik.) The DEX code for this Java code looks like:
0000: if-le v2, v3, 0005 0002: mul-int/lit8 v0, v2, #int 2 0004: return v0 0005: add-int v0, v2, v3 0007: goto 0004
In this case, the virtual registers v2 and v3 are the method's parameters, a and b, respectively. For a good reference on DEX bytecode, you can consult this document, but essentially this code compares a to b, and if a is less-than-or-equal-to b it adds a to b and returns that result. Otherwise, it multiplies a by 2 and returns that.
When ART loads this code, it typically compiles the bytecode using the quick backend. This compilation will produce a function that roughly follows the ARM Architecture Procedure Call Standard (AAPCS) used with A64 code - it will expect to find its arguments in r2 and r3*, and will return the correct result in r0. Here is the A64 code that the quick backend will produce, with some simplifications:
// Reminder: w2 is the 32-bit view of register r2 in A64 code! [-- omitted saving of registers w20-w22 to the stack --] mov w21, w2 mov w22, w3 cmp w21, w22 b.le doAdd lsl w20, w21, #1 // (NB: this is w21 * 2) doLeave: mov w0, w20 [-- omitted loading of registers w20-w22 from the stack --] ret doAdd: add w20, w21, w22 b doLeave
*(Why not r0 and r1? Because r0 is reserved for passing the context of the method that is currently being executed. r1 is used for the implicit first argument of any non-static method - the reference to the this object.)
Before code can be compiled or executed by any backend, the bytecode must always be verified. Verification involves checking various properties of the bytecode to ensure it is safe to execute. For example, checking that the inputs to a mul-float bytecode are actually float values, or checking that a particular method can be executed from the class we are currently executing within. Many of these properties are checked when the program is compiled from Java source to DEX bytecode, resulting in compiler errors. However, it is important to perform full bytecode verification when apps are about to be executed, to defend against security exploits that target DEX manipulation.
Once verification has taken place at run time, ART will load the arguments for the method into the correct registers, and then jump straight to the native code. Alternatively, ART could use its interpreter to interpret the input DEX bytecode as Dalvik would traditionally have done before attempting JIT compilation. Any bytecode that is executed as native code should do the exact same thing when it is executed in the interpreter. This means that methods should return the same results and produce the same side-effects. We can use these requirements to test for flaws in the various backend implementations. We expect that any code that passes the initial verification should be compilable, and some aspects of compilation will actually rely on properties of the code that verification has proven. Contracts exist between the different stages of the VM, and we would like to be assured that there are no gaps between these contracts.
We have developed a fuzz tester for ART, that uses mutation-based fuzzing to create new test cases from already written Java programs. ART comes with an extensive test suite for testing the correctness of the VM, but with a mutation-based fuzz tester, we can use these provided tests as a base from which we can investigate more corner cases of the VM.
The majority of these test programs produce some kind of console output - or at the very least, output any encountered VM errors to the console. The test suite knows exactly what output each test should produce, so it runs the test, and confirms that the output has not changed. Mutation-based fuzzing means that we take a test program, and modify it slightly - this means that the output of the program may have changed, or the program may now produce an error. Since we no longer know what output to expect, we can instead use the fact that ART has multiple backends to verify that they all execute this program the same way. Note however that this approach is not foolproof, as it may be the case that all of the backends execute the program in the same, incorrect way. To overcome this, it is also possible to test program execution on the previous VM, Dalvik, as long as some known differences between the two VMs are tolerated (e.g. the messages they use to report errors.) As we increase the number of backends to test, the likelihood that they are all wrong in the same way should decrease.
This diagram shows the fuzzing and testing process. First, the fuzzer parses the DEX file format into a form such that it can apply various mutations to the code. It randomly selects a subset of the methods of the program to mutate, and for each one, it randomly selects a number of mutations to apply. The fuzzer produces a new, mutated DEX file with the mutated code, and then executes this program using the various backends of the ART VM.
Note that all backends pass through a single verifier, and that some backends have been simplified in this diagram - the quick and optimizing backends are technically split up into compilation and execution phases, while the interpreter only has an execution phase. Ultimately, the execution of the mutated DEX file should produce some kind of output from each backend, and we compare these outputs to find bugs. In this example, the fact that the optimizing backend produces "9" instead of "7" strongly suggests there is a bug with the way the optimizing backend has handled this mutated code.
So how do we do this fuzzing? A naive approach would be to take the DEX file and flip bits randomly to produce a mutated DEX file. However, this is likely to always produce a DEX file that fails to pass verification. A large part of the verification process is checking that the structure of the DEX file format is sound, and this includes a checksum in the file's header - randomly flipping bits in the whole file will almost certainly cause this checksum to become invalid, but also likely break some part of the file's structure. A better approach is to focus applying minor mutations to the sections of the program that directly represent executable code.
Some examples of these minor mutations are as follows:
We limit our mutations to a few simple changes to bytecodes that individually are unlikely to break the verification of the DEX file, but in combination may lead to differences in the way the program executes. At the same time, we do not want to ensure that every mutation results in a legal bytecode state, because we wish to search for holes in the verification of the program. Often holes in verification may lead to a compiler making an incorrect assumption about the code it is compiling, which will manifest as differences in output between the compiler and the interpreter.
Now we present one of the bugs that we have found and fixed in the Android Open Source Project's (AOSP) code base, using this fuzz testing strategy.
When presented with a bytecode that reads an instance field of an object, such as iget v0, v1, MyClass.status (this writes into v0 the value of the "status" field of the object referred to by v1) the verifier did not confirm that v1 actually contained a reference to an object.
Here's a sequence of bytecodes that creates a new MyClass instance, and sets the status field to its initial value + 1:
const v0, 1 new-instance v1, MyClass invoke-direct {v1} void MyClass.<init>() // calling MyClass() constructor iget v2, v1, MyClass.status add-int v2, v0, v2 iput v2, v1, MyClass.status
If a mutation changed the v1 on line 4 to v0, then iget would now have the constant 1 currently in v0 as an input, instead of the reference to an object that was in v1. Previously, the verifier would not report this as an error when it should, and so the compiler (which expects the iget bytecode to have been properly verified) would expect an object reference to be in the input register for iget, and just read from the value of that reference plus the offset of the status field. If an attacker ensured that an address they wanted to read from was used as the loaded constant, they could read from any memory address in the process' address space. Java removes the ability to read memory directly (without the use of some mechanism such as JNI), to ensure that, for instance, private fields of classes cannot be accessed from within Java, but this bug allowed this to happen.
While this particular bug was present in the verifier, other bugs have been found and fixed in the quick backend of ART. For some of these bugs, we have contributed patches to the AOSP code base, while other bugs have been reported to the ART team. As a result of our fuzz testing efforts, new tests have been added to ART's test suite that are buildable directly from a description of DEX bytecode, whereas previously all tests had to be built from Java source code. This was necessary because many bugs we have found arise from specially crafted pieces of bytecode that the javac and dx tools would not generate themselves. We have aimed to submit DEX bytecode tests with any patches we submit to AOSP.
In this post we have looked at how fuzz testing can help the development of new backends for a virtual machine, specifically the ART VM that now powers Android. From the roughly 200 test programs already present in ART's test suite, we have produced a significantly larger number of new tests using fuzzing. Each additional program used for testing increases our confidence that the implementation of ART is sound. Most of the bugs we found affected the quick backend of ART as it was being developed in AOSP, but as new bugs could arise from complicated interactions between optimisations in the optimizing backend, the use of fuzz testing will increase our chances of finding any bugs and squashing them early.
The initial research into fuzzing was performed by Barton Miller at UW-Madison.
Paul Sabanal fuzzed the experimental release version of ART in Kitkat, and found a few crashes. He presented this work at HITB2014.
For more information about differential testing, various papers have been written about Csmith, a tool that performs differential testing to test C compilers.
Researchers at UC Davis recently presented work about Equivalence Modulo Inputs, where seed programs are fuzzed to produce new programs that are expected to produce the same output as the seed program for a given set of inputs. All produced programs are then compiled and executed, and divergences in output indicate miscompilations.