A Beginner’s Guide on Interrupt Latency - and Interrupt Latency of the Arm Cortex-M processors

April 1, 2016

15 minute read time.

Introduction

All experienced embedded system designers know that interrupt latency is one of the key characteristics of a microcontrolller, and are aware that this is crucial for many applications with real time requirements. However, the descriptions of interrupt latency in various microcontroller literature often oversimplifies exactly what is included in the ‘interrupt latency’ detail.

This blog will cover the basics of interrupt latency, and what users need to be aware of when selecting a microcontroller with low interrupt latency requirements.

The Definition of Interrupt Latency

The term interrupt latency refers to the number of clock cycles required for a processor to respond to an interrupt request, this is typically a measure based on the number of clock cycles between the assertion of the interrupt request up to the cycle where the first instruction of the interrupt handler expected (figure 1).

Definition of interrupt latency

Figure 1: Definition of interrupt latency

In many cases, when the clock frequency of the system is known, the interrupt latency can also be expressed in terms of time delay, for example, in µsec.

In many processors, the exact interrupt latency depends on what the processor is executing at the time the interrupt occurs. For example, in many processor architectures, the processor starts to respond to a interrupt request only when the current executing instruction completes, which can add a number of extra clock cycles. As a result, the interrupt latency value can contain a best case and a worst case value. This variation can result in jitters of interrupt responses, which could be problematic in certain applications like audio processing (with the introduction of signal distortions) and motor control (which can result in harmonics or vibrations).

Ideally, a processor should have the following characteristics:

The interrupt latency should be low
The interrupt response is deterministic and low jitter
The interrupt handler take as short a time to execute as possible
Can be configured to enter sleep mode on the last instruction of the interrupt service routine if no other interrupt needs service (for interrupt driven applications)

The interrupt latency itself is not the full story. A microcontroller marketing leaflet highlighting an extremely low interrupt latency doesn’t necessarily mean that the microcontroller can satisfy the real-time requirements of a product. A real embedded system might have many interrupt sources and normally each interrupt source has an associated priority level. Many processor architectures support the nesting of interrupts, which means during the execution of a low priority interrupt service routine (ISR), a high priority service can pre-empt and the low priority ISR is suspended, and resume when the high priority ISR completed (figure 2).

Figure 2: Nested Interrupt support

Figure 2: Nested Interrupt support

Many embedded systems require nested interrupt handling, and when a high priority level is running, services to low priority interrupt requests would be delayed. Thus the interrupt latency is normally a lot worse for low priority interrupts, as would be expected.

The nested interrupt handling requirement means that the interrupt controller in the system needs to be flexible in interrupt management, and ideally provide all the essential interrupt prioritization and masking capability. In some cases this could be handled in software, but this can increase the software overhead of the interrupt processing (and code size) and increase the effective latency of serving interrupts. This is discussed in more detail later.

Cortex-M processor family and NVIC

The Nested Vector Interrupt Controller (NVIC) in the Cortex-M processor family is an example of an interrupt controller with extremely flexible interrupt priority management. It enables programmable priority levels, automatic nested interrupt support, along with support for multiple interrupt masking, whilst still being very easy to use by the programmer

For the Cortex-M0 and Cortex-M0+ processors, the NVIC design supports up to 32 interrupt inputs plus a number of built-in system exceptions (figure 3). For each interrupt input, there are four programmable priority levels (figure 4). For the Cortex-M3 and Cortex-M4 processors the NVIC supports up to 240 interrupt inputs, with 8 up to 256 programmable priority levels (also shown in figure 4). Bear in mind that in practice the number of interrupt inputs and the number of priority levels are likely to be driven by the application requirements, and defined by silicon designers based on the needs of the chip design.

Figure 3: The NVIC in the Cortex-M processor family supports multiple interrupt and exception sources

Figure 3: The NVIC in the Cortex-M processor family supports multiple interrupt and exception sources

Figure 4: Priority levels in Cortex-M processors

Figure 4: Priority levels in Cortex-M processors

In addition to the interrupt requests from peripherals, the NVIC design supports internal exceptions, for example, an exception input from a 24-bit timer call SysTick, which is often used by the OS. There are also additional system exceptions to support OS operations, and a Non-Maskable Interrupt (NMI) input. The NMI and HardFault (one of the system exceptions) have fixed priority levels.

Interrupt Latency on the Cortex-M processor family

The interrupt latency of all of the Cortex-M processors is extremely low. The latency count is listed in table 1, and is the exact number of cycles from the assertion of the interrupt request up to the cycle where the first instruction of the interrupt handler is ready to be expected, in a system with zero wait state memory systems:

Processors	Cycles with zero wait state memory
Cortex-M0	16
Cortex-M0+	15
Cortex-M3	12
Cortex-M4	12

Table 1: Interrupt latency of Cortex-M processors with zero wait state memory systems

The interrupt latency listed in table 1 makes a number of simple assumptions:

The memory system has zero wait state (and with resources not being used by other bus masters)
The system level design of the chip does not add delay in the interrupt signal connections between the interrupt sources and the processor
The Interupt service is not blocked by another current running exception/interrupt service
For Cortex-M4, with FPU enabled, the lazy stacking feature is enabled (this is the default)
The current executing instruction is not doing an unaligned transfer/bitband transfer (which can take 1 extra transfer cycle)

To make the Cortex-M devices easy to use and program, and to support the automatic handling of nested exceptions or interrupts, the interrupt response sequence includes a number of stack push operations. This enables all of the interrupt handlers to be written as normal C subroutines, and enables the ISR to start real work immediately without the need to spend time on saving current context.

The stacking operation of the Cortex-M3/M4 processor is shown in figure 5. The diagram shows that register R0 to R3, and R12 are pushed onto the stack within the 12 cycle interrupt latency. If the processing inside the ISR only needs five registers or less, there is no need for additional stacking.

Interrupt entry sequence (stacking) on the Cortex-M3 processor

Figure 5: Interrupt entry sequence (stacking) on the Cortex-M3 processor

The Myth of Interrupt Latency

‘So if I choose a processor with the lowest interrupt latency then that must be good, right?’ Unfortunately it is not as simple as that. The interrupt latency figures often only provide one aspect of the interrupt handling performance, but does not give the complete picture:

Interrupt latency figures do not include any software overhead.

In a number of processor architectures, additional software wrapper code is needed for interrupt handlers to:

handle the stacking of registers, and/or
switch the register bank to a different one, and/or
check which interrupt required servicing (shared interrupt pin), and/or
locate or branch to the starting of interrupt handlers (not vectored),
unstack saved registers at the end of the ISR, etc.

All of these can result in additional, often significant, delays in the processing of interrupts. For example, typically in the 8051 which is still widely used today, there are multiple register banks so it is possible to avoid the need to write software to push registers to stack by switching register banks. You also need a branch/jump instruction to branch to the beginning of the ISR:

8-bit (e.g. 8051)	Cortex-M
1) Interrupt latency 2) SJMP/LJMP to handler 3) PUSH PSW 4) ORL PSW, #00001000b 5) Starting real handler code	1) Interrupt latency 2) Starting real handler code

8-bit (e.g. 8051)

Cortex-M

1) Interrupt latency

2) SJMP/LJMP to handler

3) PUSH PSW

4) ORL PSW, #00001000b

5) Starting real handler code

1) Interrupt latency

2) Starting real handler code

Table 2: Interrupt latency compare between 8051 and Cortex-M processors

As a result, whilst an 8051 microcontroller might have a lower interrupt latency on paper, the overall interrupt latency, when including the software overhead, is much worse than a Cortex-M based microcontrollers.

Interrupt Latency figure does not tell you how long it takes to carry out interrupt handling task

As in any program code, ISRs take time to execute. The faster the performance of the processor, the quicker the interrupt request is serviced, and the longer the system can stay in sleep mode thus reducing power consumption. When considering from the time an interrupt request is asserted to the time the interrupt processing is actually completed, the Cortex-M processors can be much better than other microcontrollers due to these higher performance characteristics (figure 6).

Figure 6: Interrupt latency when considering processing performance

Figure 6: Interrupt latency when considering processing performance

Interrupt Latency figure does not tell you the throughput / capacity of interrupt processing

In relation to the total number of clock cycles of the ISR execution, the maximum throughput / capacity of the system can also be very important in many heavily loaded systems. The maximum request per second depends on the system clock speed as well as the number of clock cycles required for the interrupts to be processed.

Figure 7: Cortex-M based microcontrollers have a much higher interrupt handling capacity

Figure 7: Cortex-M based microcontrollers have a much higher interrupt handling capacity

In traditional 8-bit/16-bit systems, the run time for ISRs can be many more cycles than with Cortex-M based microcontrollers because of lower performance. When combined with the higher maximum clock speed of many Cortex-M based microcontrollers, the maximum interrupt processing capacity can be much higher than other microcontroller products.

Interrupt Latency figure does not tell you about the jitter of interrupt response time

The jitter of interrupt response time refers to the variation (or value range) of interrupt latency cycles. In many systems, the interrupt latency cycle depends on what the CPU is doing when the interrupt takes place. For example, in an architecture like the 8051, if the processor is executing a multicycle instruction, the interrupt entry sequence cannot start until the instruction is finished, which can be a few cycles later. This results in a variation of the number of interrupt latency cycles, and is commonly referred as jitter.

Figure 8: Cortex-M processors are designed to have limited jitter in interrupt response

Figure 8: Cortex-M processors are designed to have limited jitter in interrupt response

In many applications the jitter doesn’t matter. However, in some applications, like audio or motor control, the jitter can results in distortion of audio signals, or vibration/noise of motors due to this unwanted jitter.

In Cortex-M processors, if a multiple cycle instruction is being executed when an interrupt arrives, in most cases, the instruction is abandoned and restarted when the ISR is completed. If the Cortex-M3/Cortex-M4 processor receives an interrupt request during a multiple load/store (memory access) instruction, the current state of the multiple transfer is automatically stored as part of the PSR (Program Status Register) and when the ISR completes, the multiple transfer can resume from where it was stalled by using the saved information in the PSR. This mechanism provides high performance processing while at the same time maintains low jitter in the interrupt response time.

So what should I look for?

Over the years the marketing literature from various microcontroller vendors has incomplete or misleading information on the interrupt latency. For example, sometimes machine cycles are used (instead of clock cycles) for quoting interrupt latencies and in some cases, quotes the interrupt latency but does not including software overhead. It’s important to fully investigate the details to understand the total interrupt latency work and time.

What else could make a difference?

The Cortex-M processors incorporate some additional optimizations during interrupt handling to reduce overheads even further:

Tail chaining

When an ISR is completed, and if there is another ISR waiting to be served, the processor will switch to the other ISR as soon as possible by skipping some of the unstacking and stacking operations which are normally needed (figure 9). This is called Tail Chaining, and can be just six cycles in the Cortex-M3 and Cortex-M4 processors. This also makes the processor much more energy-efficient by avoiding unnecessary memory accesses.

Figure 9: Tail chaining

Figure 9: Tail chaining

Late Arrival

If a high priority interrupt request arrives during the stacking stage of a lower priority interrupt, the high priority interrupt will always be serviced first. This ensures high priority interrupts are serviced quickly, and avoids another level of stacking operation during the nested interrupt handling process. In addition this will save energy on power consumption (due to less access to memory) and less stack space too.

Figure 10: Late arrival

Figure 10: Late arrival

Pop pre-emption

If an interrupt request arrives just as another ISR exiting and the unstacking process is underway, the unstacking sequence is stopped and the ISR for the new interrupt is entered as soon as possible (figure 11). Again, this avoids unnecessary unstacking and stacking, and reduces power consumption and latency.

Figure 11: Pop pre-emption

Figure 11: Pop pre-emption

Do banked registers make a difference?

In some architecture there are multiple register banks, and ISR can use a different, sometimes dedicated, register bank to avoid the overhead of stacking and un-stacking. For example, the 8051 provides four register banks. In the original 8051 the banked registers implementation was memory based, but newer accelerated 8051 designs now use register hardware.

Figure 12: Banked registers

Figure 12: Banked registers

Banked registers can reduce the overhead of context saving and restore in limited circumstances. However, this will often result in larger silicon area, higher power consumption and is not scalable to support the many levels of flexible nested interrupt system requirements. In some cases, like the 8051, there is the need for additional software overhead to switch the register bank(s). The Arm Cortex-M processors do not use banked registers, and this will provide much better energy efficiency and competitive performance when comparing interrupt driven systems with other microcontroller processor architectures.

Extra functionality with Cortex-M processors

Debug Support

The Cortex-M processors support comprehensive debug support features. The Cortex-M3 and Cortex-M4 processors also offer exception trace support which allows the capture and examination of the exception/interrupt history and timing information in a debugger.

Figure 13: Exception trace in Cortex-M3 and Cortex-M4 processors

Figure 13: Exception trace in Cortex-M3 and Cortex-M4 processors

The trace information can be captured using a single pin trace interface called Serial Wire Viewer (SWV), or a multi-bit trace port interface, which has higher trace bandwidth for supporting full instruction trace with an ETM (Embedded Trace Macrocell). The trace information can be very useful for debugging.

Zero jitter support on Cortex-M0/Cortex-M0+ processors

The interrupt latency of Cortex-M processors can be affected by wait states of the on chip bus system, which can result in a small jitter. The Cortex-M0 and Cortex-M0+ processors have an optional feature to force interrupt response time to have zero jitter. This is done by forcing the interrupt latency to be the worst case (i.e. interrupt latency + wait state effect). This feature is typically not used in microcontrollers (just process the interrupt request as quick as possible), but is used in some special SoC designs that demand zero jitter in interrupt responses.

Sleep-on-Exit feature

Sleep-on-Exit is a programmable feature which, when enabled, puts the processor into sleep mode when exiting an ISR if no other interrupt request needs to be serviced. This is very useful for any interrupt driven application, and can save power because it avoids the extra clock cycles in the thread (e.g. “main()” code) state, and reduces the amount of stacking and un-stacking normally needed for interrupt entry and exit. It also has a side effect (and benefit) of a shorter interrupt response time because stacking is not needed. For example, on the Cortex-M0, the wake up from Sleep-on-Exit is only 11 cycles.

Figure 14: Sleep-on-Exit can reduce interrupt latency (first instruction in ISR is SEV)

Figure 14: Sleep-on-Exit can reduce interrupt latency (first instruction in ISR is SEV)

Note that this technique is particularly useful for interrupt driven applications.

Wait-for-Event (WFE) sleep

There are two instructions for entering sleep modes: WFI (Wait for Interrupt) and WFE (Wait for Event). WFE enters sleep mode conditionally, and can wake up by events including:

Interrupts
Hardware event (via an input pin called RXEV)
Debug events

The WFE sleep can be woken up quickly without invoking the interrupt/exception sequence. This can shorten the wake up time to just a few cycles. For example, in the Cortex-M0 processor, it can take just four cycles to wake up from sleep mode:

Figure 15: Wake up from WFE using event input (RXEV)

Figure 15: Wake up from WFE using event input (RXEV)

In this operation the processor resumes from where it was stalled, just after the WFE instruction. Instead of using an RXEV input, a peripheral interrupt with a different feature called SEV-ON-PEND (also a programmable feature) can be used to generate the event and wake up the processor, without the need to execute an ISR.

Once again, note that this technique is most useful for interrupt/event driven applications, and can only be useful when it is known that there is only one interrupt/event source that is being waiting for. If there are other interrupt sources, the program code in thread must still check for the reason for waking up from sleep mode.

Conclusions

The NVIC in the Cortex-M processors provides very flexible interrupt management and many useful features. One key aspect of the NVIC technical advantages is the low interrupt latency. When this is combined with the high performance of the Cortex-M processors, all interrupt requests can be processed quickly and thus provide high interrupt processing throughput. The interrupt latency on the Cortex-M processors is deterministic, and doesn’t have any hidden software overhead, which can be observed in many other architectures.

The Cortex-M processors are designed to be easy to use. For example, the NVIC programmer’s model is very simple, and the interrupt handlers can be programmed as normal C functions. At the same time, it is very powerful. All interrupts have programmable interrupt priority levels and support nested interrupts automatically. Furthermore, the NVIC supports vectored interrupt operations so that there is no need to use software to determine which interrupt to serve, and additional optimizations like tail chaining help reducing interrupt processing overhead and make the processor more energy efficient at the same time.

Top Comments

Parents

Florian Eibensteiner over 7 years ago

Hi Joseph,
sorry for my late response and thank you for your detailed answer, this was the missing piece I'm looking for.
So if the 12 cycles latency also including the NVIC, the 15 cycles I have measured can be explained by the following:
- 1 cycles is needed to handover the interrupt request from the timer hardware to the NVIC
- 12 cycles are required for saving the registers and loading the ISR, as depicted in your explanation above
- 1 cycle is needed to execute the STR-Instruction by the execution-stage itself
- and finally the bus access takes also one cycle (this is the extra cycle hidden by the LSU, but visible at the I/O pin).
As a reminder my system setup for the latency measurement is as follows
- program code is loaded from internal flash (zero-wait-state)
- data is stored in the internal SRAM (zero-wait-state), thus fetching instruction and saving registers can be done in parallel
- GPIO controller is connected to the fast AHB, thus accessing data output register should take only 1 cycle in this case
- expected from the timer's ISR, only a "while(1);" is executed.
- no Debugger is connected to the system.
regards,
Florian
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Florian Eibensteiner over 7 years ago

Hi Joseph,
sorry for my late response and thank you for your detailed answer, this was the missing piece I'm looking for.
So if the 12 cycles latency also including the NVIC, the 15 cycles I have measured can be explained by the following:
- 1 cycles is needed to handover the interrupt request from the timer hardware to the NVIC
- 12 cycles are required for saving the registers and loading the ISR, as depicted in your explanation above
- 1 cycle is needed to execute the STR-Instruction by the execution-stage itself
- and finally the bus access takes also one cycle (this is the extra cycle hidden by the LSU, but visible at the I/O pin).
As a reminder my system setup for the latency measurement is as follows
- program code is loaded from internal flash (zero-wait-state)
- data is stored in the internal SRAM (zero-wait-state), thus fetching instruction and saving registers can be done in parallel
- GPIO controller is connected to the fast AHB, thus accessing data output register should take only 1 cycle in this case
- expected from the timer's ISR, only a "while(1);" is executed.
- no Debugger is connected to the system.
regards,
Florian
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Architectures and Processors blog

MPAM-Style cache partitioning with ATP-Engine and gem5

Hristo Belchev

Upstream gem5 and ATP-Engine MPAM-style cache partitioning are discussed, with experiments for the feature being proposed and analyzed.
- April 24, 2024
Optimizing your programs for Arm platforms

Tamar Christina

This blog covers techniques and tips that are useful to create better performing programs through compilers whether you are creating Android, Desktop or Server applications.
- April 24, 2024
Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

Ker Liu

In-depth analysis of what the PMU of L2D_CACHE_WR counts on the Neoverse N2 server.
- April 15, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog