A long-standing limitation of the Arm A-profile architecture has been the lack of support for non-maskable interrupts (NMIs). However, as announced in Arm A-Profile Architecture Developments 2021 Arm is adding support in both the CPU and Generic Interrupt Controller (GIC) architecture for NMIs. But what exactly is an NMI, how does operating systems software use these features, and why are they called non-maskable when there are several ways to mask them? This blog post explores these questions in more detail.
Let us begin by exploring why there is a need for a special class of interrupts and why we call them non-maskable. The first thing to realize is that all interrupts are maskable in certain circumstances. For example, if the interrupt controller is entirely turned off, no interrupts are delivered to the CPU. The term non-maskable interrupts actually covers a class of interrupts which can still be delivered to the CPU even when “normal” interrupts are masked. NMIs can still be masked, but through separate control state which can be less accessible to standard kernel code. The ability to mask all interrupts in some circumstances is also present on other architectures with a long history of NMIs.
There are two main reasons why operating systems software requires a separate class of interrupts which is only masked in specific situations. The first reason is simply that interrupts can be accidentally left disabled, leaving the system in an unresponsive state. At first glance, it might seem like software could easily be written to always enable interrupts after disabling them, but it is not so. Modern operating system kernels often span millions of lines of code which can directly manipulate interrupt flags. Interrupt handling routines can cause synchronous exceptions which result in non-linear code paths and potentially lead to deadlocks with interrupts disabled. Also consider third-party drivers which run with full kernel privileges and are able to directly manipulate the CPU interrupt flags. It should become clear that simple code inspection or static analysis is insufficient to prevent interrupts being left disabled. There are several scenarios in which it is crucial to be able to deliver an interrupt quickly, for example debugging, cross-PE synchronization, and hot patching.
The second reason is that operating systems rely on interrupts for performance profiling support. For example, Linux’s perf subsystem on AArch64 relies on PMU overflow interrupts. Code paths are profiled by programming the cycle counter to overflow with a specific interval, and on every overflow a PMU interrupt is delivered to the CPU. perf handles the interrupt and samples the PC of the running process. When only a single interrupt mask is available to the programmer, any critical section which disables interrupts, cannot be profiled. A common complaint from users was that their profiling results showed that much CPU time was spent during local_irq_enable(). This is obviously not the case, as the function will be implemented by a single instruction that manipulated the PSTATE.I flag. The misleading information was caused by the PMU interrupt always being delivered when IRQs became re-enabled and perf always sampling the PC at this function.
To work around this problem before the introduction of architectural NMI support, Linux relied on pseudo-NMIs which used the interrupt priorities feature from the GIC architecture. Linux programs PMU overflow interrupts with a higher priority than all other interrupts. This rewrites the arm64-specific interrupt enable and disable functions to change the CPU interrupt priority mask (ICC_PMR_EL1), instead of directly manipulating the CPU IRQ exception flag (PSTATE.I). As a reminder, the Arm architecture splits the CPU exception (IRQ and FIQ exceptions) from handling and configuration of individual interrupts which is done in the GIC. Although most systems today use the GIC, it is possible to build Arm-based systems with a different interrupt controller architecture.
Using priorities to implement pseudo-NMIs worked well to support profiling on Linux, but unfortunately introduces additional software complexity in critical paths of the operating system. For example, when entering guest virtual machines in the KVM built-in hypervisor, the hypervisor has to carefully perform several steps. First, the priority mask is used to mask non-profiling interrupts such that the VM entry path can be profiled. Second, before setting up the CPU return state to perform the exception return to the VM, interrupts must be masked using PSTATE.I to prevent corrupting exception return state. Third, and finally, the priority mask must be lowered to allow any host interrupt to cause exit from the VM at a later point in time. This flow causes overhead in the most critical path of the VM handling code and would be best avoided, and will improve with hardware NMI support.
Non-maskable is a bit of a misnomer, since NMIs can be masked. The key is rather that some interrupts must be masked separately from other interrupts, and it must be possible to prevent software from easily and directly masking specific interrupts. Some operating systems only require separate masking, and others require both separate masking and mechanisms to prevent easily masking interrupts.
On a RISC-style architecture such as Arm, it must be possible to mask all interrupts within an Exception level during exception entry and exception return. This is because exception information is stored in the ELR_ELx and SPSR_ELx registers. If an IRQ exception was delivered sufficiently early during the exception handling path of another exception, these registers would be overwritten. Critical information would be lost with no path to recovery.
This may be a good time to remind the reader of another Arm architecture feature, PSTATE.SP. On taking an exception to ELx, a dedicated stack pointer, SP_ELx, for the target Exception level is used for accessing the stack. Software can then choose to switch to using the SP_EL0 stack pointer after initial exception entry, for example if it wants to reuse the user-space thread stack pointer for kernel execution. This allows a dedicated exception entry stack to be used, separate from executing in thread context. Switching the stack pointer is done by writing to the Special-purpose register SPSel. Some operating systems use this model and others do not. Linux, for example, always uses the dedicated stack pointer SP_EL1 or SP_EL2, when running at EL1 or EL2, respectively, and uses SP_EL0 as an additional register.
The 2021 extensions introduce the ability to configure interrupts in the GIC with a dedicated NMI priority, which is strictly higher than other priority levels. (This ignores priorities across multiple security states. For example, it is still possible to configure a Secure world interrupt to have higher priority than an NMI in the Non-secure world, but an in-depth discussion of NMIs and security states is beyond this post. We refer to the Arm GIC and CPU architecture documentation for more details.) Interrupts configured with the NMI priority, from here on referred to as “NMIs”, are masked separately from other interrupts. The CPU architecture then permits the operating system to choose how NMIs are masked using a new selection control in the SCTLR_ELx system register. Normal interrupts, without the NMI priority, are masked using the existing PSTATE.I bit.
The first masking mode leverages the existing SP selection logic. This masks all interrupts including NMIs during the exception entry and exit path and prevents the rest of the operating system kernel from masking NMIs. This mode works by masking NMIs when the dedicated SP_ELx stack pointer is used. When the operating system has completed the exception entry path and switches the stack pointer to SP_EL0, NMIs can only be masked by changing the stack pointer back to SP_ELx, which in practice prevents software from directly masking NMIs. A hypervisor running at EL2 can prevent an operating system running at EL1 from changing the stack pointer outside the exception entry path by using the fine-grained traps introduced in Armv8.6 architecture extensions.
The second masking mode simply introduces a separate mask bit, PSTATE.AllInt, which gets set during exception entry the same way as PSTATE.I gets set. PSTATE.AllInt is independent from the stack pointer selection and is directly set and cleared by software, although it is also possible for a hypervisor running at EL2 to trap writes from EL1 to PSTATE.AllInt. This masking mode is expected to be used by operating systems which primarily need NMIs for profiling support and wish to avoid the complexity of implementing priority-based pseudo-NMIs. Arm expects that the Linux kernel benefits from this masking mode. For example, the VM entry path in KVM would mask normal interrupts by setting PSTATE.I when it is about to run a VM. KVM would subsequently set PSTATE.AllInt just prior to setting up the exception return registers to enter a VM. When running the VM in EL1, physical interrupts targeting EL2 are not masked by PSTATE bits (they now mask only virtual interrupts). By using this new PSTATE.AllInt control, the hypervisor avoids having to temporarily manage priority masks and PSTATE.I as described in the previous section.
The 2021 extensions also add a new interrupt acknowledgment register to the GIC, ICC_NMIAR1_EL1, which can be used to acknowledge NMIs separately from other interrupts. This functionality is introduced to avoid a situation where software unintendedly acknowledges an interrupt from a context where it is unable to process that interrupt. For example, this situation can occur when a level-sensitive NMI is signaled during a critical section which has PSTATE.I set, but the NMI’s level is deasserted before software acknowledges the interrupt.
The 2021 extensions support virtual NMIs for a guest operating system running in a virtual machine at EL1 under the control of a hypervisor running at EL2. The CPU architecture supports selecting the masking mode separately for EL1 and EL2. The masking mode is selected using SCTLR_EL1 and SCTRL_EL2 for EL1 and EL2, respectively, and the masking state in PSTATE is naturally saved and restored by the hypervisor with SPSR_EL2. The virtual CPU interface for the GIC gains the same abilities as the physical CPU interface for separate acknowledgment and identification of virtual NMIs.
Hypervisor software running on hardware systems with support for NMIs needs to be extended to support virtual NMIs. The 2021 extensions include virtualization support for NMIs by extending the EL2 List Registers (LRs), which hold virtual interrupts to be delivered to the VM. The LRs gain a new NMI bit to indicate that a virtual interrupt has the NMI priority when presented to the virtual CPU interface. Implementations of the GIC with support for direct injection of virtual SGIs also support configuration state to directly inject virtual SGIs with the NMI priority.
The virtual GIC Distributor and Redistributors are emulated by the hypervisor in software. The software emulation has to be expanded with the new registers used to configure the NMI priority for interrupts, and finally a small change is required to expose the support for NMIs in the virtual feature registers.
In summary, the introduction of NMIs in the 2021 architecture extensions provides a simple programming model that enables common NMI use cases. In designing this feature, we have worked with the ecosystem to ensure that we can provide the necessary flexibility to cover the different requirements for NMIs across common kernel and hypervisors.