Part 3: Enabling PAC and BTI on AArch64 for Linux

November 20, 2024

10 minute read time.

This is Part 3 of a 3-part blog series. See Part 1 and Part 2.

In Part 2, we looked at enabling PAC and BTI together, optimizations and the hint space. In part 3, we will look at C++ style exception handling and how DWARF interacts with runtimes to provide this support and the needed modifications to support PAC. We also look at using the other signing key available for PAC and adding support in the assembly code.

Source code for the examples can be found at https://gitlab.arm.com/pac-and-bti-blog/blog-example and the tag will be referenced with the "Tag" keyword before source examples.

Exception Handling: DWARF and CFI

If you wanted to support exception handling across assembly routines, you must implement the CFI directives to do so. The CFI, or Call Frame Information, are a set of assembler directives that handle generating the DWARF data needed to unwind the call frames and stack when a C++ exception occurs. DWARF itself is a Turing complete stack-based virtual machine, and the CFI directives can be thought of as programming that virtual machine. The DWARF code is executed to generate the required data for handling exceptions. Let's modify our program to throw an exception and ensure it gets handled.

Tag: Example-7

Makefile:

ASFLAGS ?= $(CXXFLAGS)

OBJS := main.o \
	call_function.o

main: $(OBJS)
	$(CXX) $(CXXLAGS) $(LDFLAGS) -o $@ $^

.PHONY: clean
clean:
	@printf "Cleaning...\n" && rm -rf $(OBJS) main

call_function.S:

#include "aarch64.h"

.section .text
.global call_function

// Function prototype
// void call_function(void (*func)())
call_function:
    .cfi_startproc
    SIGN_LR
    CFI_WINDOW_SAVE
    // Save link register and frame pointer, allocating enough space for
    // saving the return location.
    stp x29, x30, [sp, #-16]!
    .cfi_def_cfa_offset 16
    .cfi_offset 29, -16
    .cfi_offset 30, -8
    mov x29, sp

    // x0 is the caller's first argument, so jump
    // to the "function" pointed by x0 and save
    // the return address to the stack
    blr x0
return_loc:
    // Restore link register and frame pointer
    ldp x29, x30, [sp], #16

    .cfi_restore 30
    .cfi_restore 29
    .cfi_def_cfa_offset 0

    // Return from the function
    VERIFY_LR
    ret
    .cfi_endproc

main.cpp:

#include <iostream>

// Declaration of the assembly routines
extern "C" {
void call_function(void (*func)());
};
static void my_exception() {
    std::cout << "Throwing exception..." << std::endl;
    throw 42;
}

int main() {
    try {
         // Call the assembly routine **indirectly** using a function pointer
        // and pass the jump location as well.
        void (*fn)(void (*func)()) = call_function;
        fn(my_exception);
    } catch (int e) {
        std::cout << "Caught exception: " << e << std::endl;
    }
    return 0;
}

Now we need to compile and run the C++ example:

make clean
CXXFLAGS="-mbranch-protection=standard" make
./main
Throwing exception...
Caught exception: 42

The major differences between this and our previous examples is that instead of main.c we now have main.cpp so we can use C++ exceptions and thus main.c is no longer needed and can be removed. We also modified call_function to call the C++ routine that throws an exception by using blr and not just br and thus my_jump is no longer needed. Additionally, the code was augmented with the required CFI directives. Note that clang and gcc will output the CFI directives in their assembly code when generating assembly from C/C++ code using the option -S. We can now examine how to propagate an exception through an assembly layer so various parts of the runtime can make use of it.

An important part of using CFI directives is to understand the meaning of "CFA". The CFA, or Canonical Frame Address, is what the DWARF system uses, and ultimately the unwinder, to unwind the call stack. Debuggers will also make use of this additional DWARF data. The way that DWARF works in practice, is that each function gets its own FDE, or Function Description Entry. Additionally, each FDE is related to a CIE, or Common Information Entry, which, as implied, has common information used by a set of FDEs. By default, the CIE states that the sp is the CFA, so anytime the sp is modified we need to let DWARF know through those CFI directives. That is what .cfi_def_cfa_offset does, it lets DWARF know that the CFA is the current sp plus an offset of 16 bytes. The next thing DWARF needs to know is where to find the lr and the fp relative to the CFA. This is what .cfi_offset does, it informs DWARF that the value for the fp or x29, it is the same register, can be found at the current CFA at offset -16 bytes. Similarly, the same is done for x30 , or the lr with the appropriate offset. The next CFI directive, .cfi_restore, just restores the rule for the register to the same state when .cfi_start_proc was issued. After that, .cfi_def_cfa_offset indicates that the CFA is equal to sp and finally .cfi_endproc ends the FDE entry. All of this instruments the DWARF system, which in-turn is used by debuggers, runtimes and the unwinder. All of these systems need to know that the address in the pushed lr is signed and they need to potentially verify the pointer and demangle the address before using it. The unwinder uses the autia1716 or autib1716 instructions to demangle the return address. Both of these are within the hint space as hint 12 and hint 14 respectively. The pointer must be demangled, as the pointer is modified to include the PAC signature, so removing the signature restores the pointer to a valid pointer.

Depending on the implementation, the auti(a|b)1716 instructions may return an invalid pointer or throw illegal instruction on signature failures.

Our header files and discussions thus far have indicated that PAC supports two keys: the A and B keys. These keys can be changed at build time through compiler options. This can be done be specifying -mbranch-protection=pac-ret+b-key. Let's modify our latest C++ example, namely my_function.S and aarch64.h to support the B key within the required DWARF code:

Tag: Example-8

aarch64.h:

#ifndef _AARCH_64_H_
#define _AARCH_64_H_

/*
 * References:
 *  - https://developer.arm.com/documentation/101028/0012/5--Feature-test-macros
 *  - https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst
 */

#if defined(__ARM_FEATURE_BTI_DEFAULT) && __ARM_FEATURE_BTI_DEFAULT == 1
  #define BTI_J hint 36 /* bti j: for jumps, IE br instructions */
  #define BTI_C hint 34  /* bti c: for calls, IE bl instructions */
  #define GNU_PROPERTY_AARCH64_BTI 1 /* bit 0 GNU Notes is for BTI support */
#else
  #define BTI_J
  #define BTI_C
  #define GNU_PROPERTY_AARCH64_BTI 0
#endif

#if defined(__ARM_FEATURE_PAC_DEFAULT)
  #if __ARM_FEATURE_PAC_DEFAULT & 1
    #define SIGN_LR hint 25 /* paciasp: sign with the A key */
    #define VERIFY_LR hint 29 /* autiasp: verify with the A key */
    #define CFI_B_KEY_FRAME /* empty is no B key */
   #elif __ARM_FEATURE_PAC_DEFAULT & 2
    #define SIGN_LR hint 27 /* pacibsp: sign with the b key */
    #define VERIFY_LR hint 32 /* autibsp: verify with the b key */
    #define CFI_B_KEY_FRAME .cfi_b_key_frame
#endif
  #define CFI_WINDOW_SAVE .cfi_window_save
  #define GNU_PROPERTY_AARCH64_POINTER_AUTH 2 /* bit 1 GNU Notes is for PAC support */
#else
  #define SIGN_LR BTI_C
  #define VERIFY_LR
  #define CFI_WINDOW_SAVE
  #define CFI_B_KEY_FRAME
  #define GNU_PROPERTY_AARCH64_POINTER_AUTH 0
#endif

/* Add the BTI support to GNU Notes section */
#if GNU_PROPERTY_AARCH64_BTI != 0 || GNU_PROPERTY_AARCH64_POINTER_AUTH != 0
    .pushsection .note.gnu.property, "a"; /* Start a new allocatable section */
    .balign 8; /* align it on a byte boundry */
    .long 4; /* size of "GNU\0" */
    .long 0x10; /* size of descriptor */
    .long 0x5; /* NT_GNU_PROPERTY_TYPE_0 */
    .asciz "GNU";
    .long 0xc0000000; /* GNU_PROPERTY_AARCH64_FEATURE_1_AND */
    .long 4; /* Four bytes of data */
    .long (GNU_PROPERTY_AARCH64_BTI|GNU_PROPERTY_AARCH64_POINTER_AUTH); /* BTI or PAC is enabled */
    .long 0; /* padding for 8 byte alignment */
    .popsection; /* end the section */
#endif

#endif

call_function.S:

#include "aarch64.h"

.section .text
.global call_function

// Function prototype
// void call_function(void (*func)())
call_function:
    .cfi_startproc
    SIGN_LR
    CFI_WINDOW_SAVE
    CFI_B_KEY_FRAME
    // Save link register and frame pointer, allocating enough space for
    // saving the return location.
    stp x29, x30, [sp, #-16]!
    .cfi_def_cfa_offset 16
    .cfi_offset 29, -16
    .cfi_offset 30, -8
    mov x29, sp

    // x0 is the caller's first argument, so jump
    // to the "function" pointed by x0 and save
    // the return address to the stack
    blr x0
return_loc:
    // Restore link register and frame pointer
    ldp x29, x30, [sp], #16

    .cfi_restore 30
    .cfi_restore 29
    .cfi_def_cfa_offset 0

    // Return from the function
    VERIFY_LR
    ret
    .cfi_endproc

Compile and run the program:

make clean
CXXFLAGS="-mbranch-protection=pac-ret+b-key+bti" make
./main
Throwing exception...
Caught exception: 42

Debugging DWARF

As previously mentioned, DWARF is byte code for a virtual machine. This DWARF information is then embedded within different sections in the generated ELF files for the various consumers like the unwinder and debuggers. It is possible to dump these DWARF instructions as a dissasembled version which is rather nice for debugging. Note, we will add -g to produce some debug info for the upcoming addr2line example.

make clean
CXXFLAGS="-mbranch-protection=pac-ret+b-key+bti -g" make
readelf --debug-dump=frames call_function.o
Contents of the .eh_frame section:


00000000 0000000000000010 00000000 CIE
  Version:               1
  Augmentation:          "zR"
  Code alignment factor: 4
  Data alignment factor: -8
  Return address column: 30
  Augmentation data:     1b
  DW_CFA_def_cfa: r31 (sp) ofs 0

00000014 0000000000000020 00000018 FDE cie=00000000 pc=0000000000000000..0000000000000014
  DW_CFA_advance_loc: 4 to 0000000000000004
  DW_CFA_def_cfa_offset: 16
  DW_CFA_offset: r29 (x29) at cfa-16
  DW_CFA_offset: r30 (x30) at cfa-8
  DW_CFA_advance_loc: 12 to 0000000000000010
  DW_CFA_restore: r30 (x30)
  DW_CFA_restore: r29 (x29)
  DW_CFA_def_cfa_offset: 0
  DW_CFA_nop
  DW_CFA_nop
  DW_CFA_nop
  DW_CFA_nop
  DW_CFA_nop
  DW_CFA_nop
  DW_CFA_nop

The noteworthy elements here, for starters, is the "B" in the Augmentation string. This is within the CIE, which will be inherited by all FDEs that use it. The "B" indicates that the PAC B signing key is used. If "B" is not present, then the "A" key is in use. An example usage is demonstrated by unwinders to choose the right instruction, either autib1716 or autia1716, when demangling PAC signed addresses. The other important item to note, is the DW_CFA_AARCH64_negate_ra_state which is the output from the CFI directive .cfi_window_save. This DWARF opcode indicates that the lr is signed and that that anything interpreting the lr needs to demangle it.

Each FDE has a corresponding CIE shown by the cie= and their can be multiple CIEs. Each FDE also has an associated pc range that its valid for.

It is possible to associate and FDE to a function using addr2line, note it needs -g in the compilation flags or you will see ? in the addr2line output:

addr2line -f -e call_function.o 0
call_function
/home/bill/workspace/blog-example/call_function.S:10

Jumping to Functions

When an indirect transfer of control flow occurs, BTI enabled hardware and its corresponding software enabled stacks, will ensure that indirect control flow transfers land on landing pad. Another way to state this, is that direct control flow changes are not checked. This is because the target address is encoded in the instruction itself and not provided externally with a potentially attacker controlled value. Consequently, instructions like br and brl and their associated instructions are checked that they land on proper landing pads. Typically, the branch instructions with a link, like brl ,are used to call functions and thus the control flow change needs to land on a bti c or bti jc instruction. For branches that do not modify the link register, like br, they are used for a "jump" and thus must transfer control flow to a bti j or bti jc landing pad. However, in certain scenarios where jump oriented programing models are used, a branch or jump may be used to transfer control flow to a function that is typically called. In some cases, that function that was "jumped to" using a branch instruction is compiled code from a C or C++ compiler and thus the landing pad for that function will be a bti c instruction. Because of this, BTI enforcement will occur and an exception thrown because jumps or branches without the link expect the first instruction for the landing pad as a bti j instruction. To work around this possible issue, the architecture supports that if the target address is in register x16 or x17, that the BTI enforcement will allow the jump to occur to a bti c label or a bti j label as expected. This is further discussed in Jump Oriented Programing.

Conclusion

This multi-part tutorial shows how to enable PAC and BTI through assembly functions, how PAC instructions can also serve as BTI landing pads, and how to handle PAC A and B keys in source. It also highlights how exception handling needs to be augmented through the use of CFI directives, and how to dump the CFI generated DWARF data.

References

Architectures and Processors blog

Caches and Self-Modifying Code: Implementing `__clear_cache`

Jacob Bramley

How to implement `__clear_cache` using assembly.
- January 20, 2025
The when, why and how of waiting and backoff in multi-threaded applications on Arm

Ola Liljedahl

Read about the different user space delays and wait implementations for the Armv8+ architecture and best practices for the purpose of improving throughput and fair access to shared resources.
- December 13, 2024
Using SVE in C#

Alan Hayward

.NET 9 introduces SVE support on Arm, allowing users to write simplified vectorised code. This blog post gives examples in C# and compares it to C++.
- November 20, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Part 3: Enabling PAC and BTI on AArch64 for Linux

Exception Handling: DWARF and CFI

Debugging DWARF

Jumping to Functions

Conclusion

References

Caches and Self-Modifying Code: Implementing `__clear_cache`

The when, why and how of waiting and backoff in multi-threaded applications on Arm

Using SVE in C#