In March this year, Arm introduced the next-generation Arm v9 architecture with increasingly capable security and artificial intelligence (AI). One of the most relevant improvements in terms of security and debugging is the support of Memory Tagging Extension (MTE), which is a key feature of the new Arm v9 CPUs. MTE aims to mitigate memory-related vulnerabilities to improve the security of connected devices. In this blog post, we are sharing the motivation behind the introduction of MTE, an overview of this technology and how useful it is for developers.
C/C++ languages are used for most of end-user software in mobile and client devices, including a variety of fields such as operating systems, machine learning (ML) libraries, games, and embedded software. The reason why these languages are widely used is that they have features that other programming languages do not have. These include fast execution speeds and the ability to directly manipulate memory addresses, which are essential for low-level languages. On the other hand, the ability to directly manipulate memory addresses can cause system malfunctions if not used correctly. This malfunction can also cause vulnerabilities in the software. An attacker could potentially exploit these vulnerabilities to access sensitive information, to maliciously alter the program’s behavior, or even taking full control over the control-flow. These could cause extensive damages - some of which cannot be measured in money alone.
The most important classes of security vulnerability in code written in C/C++ languages are violations of memory safety. According to a Google security blog post, most Android vulnerabilities are caused by use-after-free (UAF) and out-of-bounds (OOB) reads/writes.
Figure 1: Vulnerabilities by cause in Android (from this source)
MTE has been developed in collaboration with Google with the aim of detecting memory safety bugs in both existing codebases and in new code as it is written. As Google recently announced that the Android open-source Project (AOSP) supports the Rust programming language for developing the OS itself, some specific software components may transition to the more safe language. However, MTE offers an effective solution for the vast ecosystem of C/C++ code. Also, MTE can be a great tool not only for OS developers but also for application developers, because it allows them to quickly find common memory bugs and speed up their production pipeline.
Memory safety bugs happen when software accidentally accesses memory beyond its allocated size or memory addresses. Some examples of memory safety bugs can be found here. Unfortunately, it is often very difficult to detect and fix these bugs because the erroneous state must actually be triggered in code to be detected. This would require large-scale efforts for testing, which would take a great deal of time and cost. Bug fixing is also a long and costly process.
Figure 2: Steps for fixing bugs (from this source)
For a complex C/C++ project, it is not uncommon for only a handful of people to be able to develop and review the fix. Moreover, even if they put a lot of effort into fixing such bugs, the programs may not get fixed. The following figure shows the age of memory safety bugs in Android. This indicates that there are bugs that remain undetected and unfixed for years after they were first introduced.
Figure 3: Age of memory safety bugs in Android (from this source)
Also, the later a bug is discovered, the more costly and longer the process becomes. Therefore, early detection is critical to reducing the risks of these memory safety bugs. Again, memory safety bugs are not an Android-specific problem. MTE will mitigate security vulnerabilities, helping developers identify and fix them and discouraging malicious attackers from exploiting them.
MTE provides the mechanism that allows us to detect both use-after-free and out-of-bounds types of bugs. Also, it has been designed in such a way that no source code modifications are required for most applications.
The following figure illustrates the concept of MTE. The MTE’s underlying model is a "lock and key" scheme. That means, when memory is allocated or freed, it is given a Tag (Lock). And then all access to that memory must be made by an address with the same Tag (Key). If the lock and key do not match, then the CPU raises an error.
Figure 4: Lock and key scheme in MTE
For the first two pointers in the figure, the key matches the lock of the accessed location. Accesses using these pointers would succeed as normal. However, for the last two pointers, the key does not match the lock of the accessed location. This is captured as a tag check failure. With this mechanism, hard-to-catch memory safety errors can be detected easily. It means that even rare bugs will be detected immediately once they get triggered, which also aids general debugging. More information can be found in the architectures documentation page. Also, these two articles on AnandTech and Semiconductor Engineering, and this white paper on MTE provide excellent introductions to MTE. In MTE, there is heap tagging and stack tagging, and code is handled slightly differently. In this blog, we talk mostly about heap tagging.
MTE is built into the 1st generation of Arm v9 CPUs and will be available next year. Software support for using MTE is being introduced as part of Android 12. So, there is no MTE-enabled hardware yet, but you can see how it works by using a simulation on the Fixed Virtual Platform (FVP) environment. FVP is designed to emulate the work of a complete system accurately for software development. This is especially useful when developers are writing code for new technologies such as MTE. A prebuilt FVP model is available from the Arm Developer site under “Armv-A Base RevC AEM FVP”. To build and run Android with MTE support on the FVP, you can follow the instructions on this page. Please note that the instructions use the master branch of AOSP, which is not a supported release by Google.
To demonstrate how MTE works for a real case, we have selected a vulnerability (CVE-2020-13790) in the libjpeg-turbo library that was fixed in 2020. libjpeg-turbo is a JPEG image codec that is used in Android. The vulnerability was a heap-based buffer overflow when loading a malformed image file.
We reproduced the bug on the FVP environment and confirmed that MTE can easily detect it. When a bug is discovered by MTE, the process is terminated with a segmentation fault. Its crash dump can be printed by Android’s logcat command. The following is a snippet of the crash report output by the logcat command. If you are using a debugger from Android Studio, this is caught in a similar way to any other crash.
Output 1: An example snippet from an MTE crash report
You can see that MTE detects a buffer overflow occurred at a certain address. In addition, the crash report shows the process id, the thread id, the crash cause, the bad memory access address. This is followed by the content of the CPU registers at the time the signal was received and the backtrace for the stack frames. Users can use this information to locate a bug easier than before and start working on fixing it. The basics surrounding crash dumps in logcat output can be found here. For example, the function name and location in a file of the bug can be easily identified using the addr2line command as shown below. Future real Android devices may show the location in a slightly more polished way in the report.
Output 2: Identifying bug location
You may be wondering what the performance overhead of MTE is. For this, the architecture provides both synchronous and asynchronous mode to report tag comparison failures. The following table compares the features of the two modes.
Table 1: MTE’s two modes to report tag comparison failures
Synchronous checking makes debugging simpler, as it allows you to identify the precise instruction and address that caused the failure. However, synchronous checking has a slightly larger performance impact. The performance overhead varies depending on the type of the processor, but the performance impact is perfectly acceptable in a development environment. However, it might be too high for deployment. On the other hand, asynchronous mode is less costly, with a performance overhead estimate of 1-2% across workloads and benchmarks tested. This means that asynchronous checking is generally acceptable even on production systems. Although asynchronous checking provides less precise information on where the tag comparison failure occurred, it can provide some mitigation and be used for profiling. Profiling allows problem areas to be identified, narrowing down the search area for bugs. MTE’s flexibility allows trade-offs between lower overhead and more accuracy in reporting. Please note that the performance overhead figures may change in development in the future.
There are several software mitigation techniques targeting memory safety, such as ASan and HWASan. However, these are costly in terms of performance. It makes them unsuitable for widespread deployment. MTE aims to work even in production code with less of a performance drop. This is very important because bugs that occur in production code sometimes do not occur in development code. MTE finds more bugs at a fraction of the cost.
The Android platform support for MTE will be completed by the time chips with MTE are released at the end of 2021/ start of 2022. Also, the supporting materials on the Arm Developer site contain loads of interesting insights and information to help partners and developers enhance memory safety for security. We are really excited that MTE can contribute to increased memory safety by detecting bugs that were not easily found before. We look forward to the wider deployment of MTE in the Android software and hardware ecosystem.
Sign up to the Arm DevSummit technical sessions where you will learn more about achieving memory safety in Android-S with MTE .
[CTAToken URL = "https://devsummit.arm.com/en/about" target="_blank" text="Sign up to Arm DevSummit" class ="green"]