Need specific references to the hardware interrupt latency for the ARMv8 Cortex-A53. interrupt latency from when an interrupt is triggered to when the ISR is initially invoked, but not including operating system, kernel, or application latency.
For an A-profile core you'll probably find that the latency is dominated by memory system effects caused by the need to fetch instructions and data from main memory to actually run the handler. The costs involved will include both cache fetches and page-table walks to translate addresses.
I don't have any data to hand, but you are at least talking about thousands of cycles for anything remotely non-trivial if you miss in the cache. A single cache line miss will be 500-1000 cycles of memory round trip if you miss in the cache and TLB (three serialized fetches - two for the page table lookup, one to fetch the data); exactly how many depends on the ratio of CPU clock to memory system latencies. The architecture effects of the interrupt handling in the CPU itself are likely to be insignificant by comparison.
Thanks Pete. Trying to find where hardware interrupt latency is documented in a datasheet or TRM. Searching the ARMv8 or ARMv7 TRMs I didn't see it off hand.
Would a rough data point be 12 cycles for a best case hardware interrupt latency in Cortex-A53? This doesn’t include cache misses, TLB, misses, memory model used, etc. But using this as a best case, running at 1000Mhz, the period is 1nS multiplied by the number of clock cycles of 12 gives 12nS latency? You are saying, this could be 500 to 1000nS with cache misses, etc.?
>latency is dominated by memory system effects caused by the need to fetch instructions and data from main memory to actually run the handler. The costs involved will include both cache fetches and page-table walks to translate addresses.
Is this true for instance if one were to simply toggle a GPIO from U-boot without using the kernel or OS? For instance, directly toggling the GPIO registers and interrupt without an ISR and checking the cycles on oscilloscope? In this case, would there still be cache fetches or a page-table walk to translate addresses?
Worst case scenario depends on the ISR if we include kernel context switching, which causes me to ask how many instructions are required for a task switch for the Cortex-A53 and where is this documented?
For a PowerPC architecture a full task context switching includes all GPRs and some SPRs. About 40 registers have to be transferred and 40 instructions to be executed. In worst case, the stack is out of the data cache and the handler is out of instruction cache. The time required to fetch/save 40 registers and 40 instructions is the duration of ~ 6 memory bursts that is ~ 6*60 platform clocks (450us if platform @ 800MHz). Complex interrupt handlers may need more time.
Any idea what this would like for the Cortex-A53?
> Where can I find this latency measurement for the ARMv8 Cortex-A53?
I'm not aware that such a measurement exists for the Cortex-A cores; the best case will never happen for any real software so it's not really something which really worth measuring, and as per my first answer the realistic and worst case is totally dependent on the memory system performance which is not under control of the CPU design.
If you totally ignore memory system effects, the architectural cost of just running the CPU instructions is going to be pretty small (~32 registers to swap, a few system control registers (such as page table pointers) to change). Finger in the air guess would be ~50 cycles (50 ns at 1GHz).
... but you can't sensibly ignore memory system effects in an A-profile core with caches and virtual memory. You *are* going to get some cache misses, L2 fetches, main memory fetches, branch misspredictions, memory prefetch and speculation. Even a single cache miss in the L1 which hits in the L2 will add 30% to the above (assuming 15 cycle latency to L2), and a single cache miss to main memory (assuming page table hit) will add 250%. If you miss in the both levels of page tables too then you've just added 1000%. When you start including the memory system effects it quickly becomes irrelevant whether the CPU-side of things takes 10 cycles, 30 cycles, or 50 cycles - it's in the noise.
It's worth noting that the Cortex-A profile cores are not designed to give really fast / consistent interrupt latency response because big operating systems simply don't care that much (even on older ARM11 cores running at only 250MHz the observed interrupt latency from the point of view of a Linux device driver would tend to be around 5000 CPU cycles by the time you actually got to the device driver handler routine). Conversely the Cortex-M and Cortex-R cores are designed very much with giving predictable and lower latency response times in mind.
I understand. Allow me to explain the use case. We have a hard real time constraint of no more than 1 microsecond latency for the ARMv8 Cortex-A53 from a customer. Of course, using an FPGA/DSP design the customer can reach this goal, but the end customer wants us to demonstrate this can be accomplished on an LS1043A ARMv8 Cortex-A53 MCU without an FPGA/DSP solution since they are encountering undesirable latency with their FPGA/DSP solution that currently exists. And this latency is caused for a number of reasons, and they prefer to move their application to the MCU and off the FPGA/DSP in general.
Several options have been proposed to measure this latency. One is to service the interrupt using affinity on a dedicate core. The ISR is dedicated to a single core, all other interrupts are disabled preventing scheduling issues, and no context switches or kernel tasks will be allowed on that dedicated core. Essentially no other interrupts or context switches nor kernel tasks will be possible on the dedicated core, so no cache misses because of context switching etc., only an interrupt from the device can be serviced on that core. Another option is to attempt to move the ISR into a L2 cache and service the interrupt from the cache, and I'm not exactly sure how this latter option would be implemented at the moment.
So, the question would be, given the ISR is dedicated to a single core and the only interrupt on that core is servicing the ISR without any context switch, can we get below 1 microsecond latency with an ARMv8 Cortex-A53 architecture?
To what extent will we still have memory system effects and what would the memory system latency look like under this scenario? Realizing it depends on the complexity of the ISR, let’s suppose a basic read operation from the device.
This is what we are trying to ultimately measure both on vanilla Linux and using PREEMPT_RT with full pre-emption, using a dedicated core and/or the ISR in a L2 cache. In fact, measuring from the start of the interrupt to when the interrupt handler is initially invoked. Or, another possibility is measuring from U-boot on a uart to check only hardware interrupt latency with a 1 microsecond hard interrupt latency constraint, is this 1 microsecond possible?
It's a good question and I get where you are coming from, I'm not sure not sure there is a simple answer that we can give from the CPU point of view other than "empirically measure it on your platform". There are many different ARM chipset vendors with very different memory systems and cache sizes, and how the OS is configured will also impact this, so there isn't a neat single answer here.
Based on my past experience with Linux and phone chipsets my gut feel is that 1us seems rather tight for a realtime deadline on an A-profile core; assuming a 1GHz CPU frequency that's only 1000 cycles and you don't need much to go wrong in terms of cache or TLB misses to violate that.
If there are constraints you can play with your CPU partitioning then you may get close to this. If you dedicate a single core to running only the critical part of the interrupt handler so it stays inside the L1 cache and uTLB, and don't run anything else on that core then you can artificially encourage "good" cache hit rates, but that would involve not using that core for anything else and even then it will depend how much shared data you have (memory coherency will mean any shared cache lines may be pulled out of the L1 by other CPUs needing that cache line).
Keeping things hot in the L2 may be possible, but will depends on the size of the L2 vs the total size of code and data that is running on all CPUs. It's a shared cache across all of the cores, so other CPUs doing other work may push the critical stuff out of the L2 cache - so you're just playing with statistical improvement not guaranteed real-time performance.