This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Where is hardware interrupt latency documented for the ARMv8 Cortex-A53?

Need specific references to the hardware interrupt latency for the ARMv8 Cortex-A53. interrupt latency from when an interrupt is triggered to when the ISR is initially invoked, but not including operating system, kernel, or application latency.

Parents

0 Peter Harris over 9 years ago in reply to Tracy Smith

> Where can I find this latency measurement for the ARMv8 Cortex-A53?
I'm not aware that such a measurement exists for the Cortex-A cores; the best case will never happen for any real software so it's not really something which really worth measuring, and as per my first answer the realistic and worst case is totally dependent on the memory system performance which is not under control of the CPU design.
If you totally ignore memory system effects, the architectural cost of just running the CPU instructions is going to be pretty small (~32 registers to swap, a few system control registers (such as page table pointers) to change). Finger in the air guess would be ~50 cycles (50 ns at 1GHz).
... but you can't sensibly ignore memory system effects in an A-profile core with caches and virtual memory. You *are* going to get some cache misses, L2 fetches, main memory fetches, branch misspredictions, memory prefetch and speculation. Even a single cache miss in the L1 which hits in the L2 will add 30% to the above (assuming 15 cycle latency to L2), and a single cache miss to main memory (assuming page table hit) will add 250%. If you miss in the both levels of page tables too then you've just added 1000%. When you start including the memory system effects it quickly becomes irrelevant whether the CPU-side of things takes 10 cycles, 30 cycles, or 50 cycles - it's in the noise.
It's worth noting that the Cortex-A profile cores are not designed to give really fast / consistent interrupt latency response because big operating systems simply don't care that much (even on older ARM11 cores running at only 250MHz the observed interrupt latency from the point of view of a Linux device driver would tend to be around 5000 CPU cycles by the time you actually got to the device driver handler routine). Conversely the Cortex-M and Cortex-R cores are designed very much with giving predictable and lower latency response times in mind.
HTH,
Pete
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Peter Harris over 9 years ago in reply to Tracy Smith

> Where can I find this latency measurement for the ARMv8 Cortex-A53?
I'm not aware that such a measurement exists for the Cortex-A cores; the best case will never happen for any real software so it's not really something which really worth measuring, and as per my first answer the realistic and worst case is totally dependent on the memory system performance which is not under control of the CPU design.
If you totally ignore memory system effects, the architectural cost of just running the CPU instructions is going to be pretty small (~32 registers to swap, a few system control registers (such as page table pointers) to change). Finger in the air guess would be ~50 cycles (50 ns at 1GHz).
... but you can't sensibly ignore memory system effects in an A-profile core with caches and virtual memory. You *are* going to get some cache misses, L2 fetches, main memory fetches, branch misspredictions, memory prefetch and speculation. Even a single cache miss in the L1 which hits in the L2 will add 30% to the above (assuming 15 cycle latency to L2), and a single cache miss to main memory (assuming page table hit) will add 250%. If you miss in the both levels of page tables too then you've just added 1000%. When you start including the memory system effects it quickly becomes irrelevant whether the CPU-side of things takes 10 cycles, 30 cycles, or 50 cycles - it's in the noise.
It's worth noting that the Cortex-A profile cores are not designed to give really fast / consistent interrupt latency response because big operating systems simply don't care that much (even on older ARM11 cores running at only 250MHz the observed interrupt latency from the point of view of a Linux device driver would tend to be around 5000 CPU cycles by the time you actually got to the device driver handler routine). Conversely the Cortex-M and Cortex-R cores are designed very much with giving predictable and lower latency response times in mind.
HTH,
Pete
Cancel
Vote up 0 Vote down

Cancel

Children

0 Tracy Smith over 9 years ago in reply to Peter Harris

Hi Pete,
I understand. Allow me to explain the use case. We have a hard real time constraint of no more than 1 microsecond latency for the ARMv8 Cortex-A53 from a customer. Of course, using an FPGA/DSP design the customer can reach this goal, but the end customer wants us to demonstrate this can be accomplished on an LS1043A ARMv8 Cortex-A53 MCU without an FPGA/DSP solution since they are encountering undesirable latency with their FPGA/DSP solution that currently exists. And this latency is caused for a number of reasons, and they prefer to move their application to the MCU and off the FPGA/DSP in general.
Several options have been proposed to measure this latency. One is to service the interrupt using affinity on a dedicate core. The ISR is dedicated to a single core, all other interrupts are disabled preventing scheduling issues, and no context switches or kernel tasks will be allowed on that dedicated core. Essentially no other interrupts or context switches nor kernel tasks will be possible on the dedicated core, so no cache misses because of context switching etc., only an interrupt from the device can be serviced on that core. Another option is to attempt to move the ISR into a L2 cache and service the interrupt from the cache, and I'm not exactly sure how this latter option would be implemented at the moment.
So, the question would be, given the ISR is dedicated to a single core and the only interrupt on that core is servicing the ISR without any context switch, can we get below 1 microsecond latency with an ARMv8 Cortex-A53 architecture?
To what extent will we still have memory system effects and what would the memory system latency look like under this scenario? Realizing it depends on the complexity of the ISR, let’s suppose a basic read operation from the device.
This is what we are trying to ultimately measure both on vanilla Linux and using PREEMPT_RT with full pre-emption, using a dedicated core and/or the ISR in a L2 cache. In fact, measuring from the start of the interrupt to when the interrupt handler is initially invoked. Or, another possibility is measuring from U-boot on a uart to check only hardware interrupt latency with a 1 microsecond hard interrupt latency constraint, is this 1 microsecond possible?
thx,
Tracy
Cancel
Vote up 0 Vote down

Cancel
0 Peter Harris over 9 years ago in reply to Tracy Smith

Hi Tracy,
It's a good question and I get where you are coming from, I'm not sure not sure there is a simple answer that we can give from the CPU point of view other than "empirically measure it on your platform". There are many different ARM chipset vendors with very different memory systems and cache sizes, and how the OS is configured will also impact this, so there isn't a neat single answer here.
Based on my past experience with Linux and phone chipsets my gut feel is that 1us seems rather tight for a realtime deadline on an A-profile core; assuming a 1GHz CPU frequency that's only 1000 cycles and you don't need much to go wrong in terms of cache or TLB misses to violate that.
If there are constraints you can play with your CPU partitioning then you may get close to this. If you dedicate a single core to running only the critical part of the interrupt handler so it stays inside the L1 cache and uTLB, and don't run anything else on that core then you can artificially encourage "good" cache hit rates, but that would involve not using that core for anything else and even then it will depend how much shared data you have (memory coherency will mean any shared cache lines may be pulled out of the L1 by other CPUs needing that cache line).
Keeping things hot in the L2 may be possible, but will depends on the size of the L2 vs the total size of code and data that is running on all CPUs. It's a shared cache across all of the cores, so other CPUs doing other work may push the critical stuff out of the L2 cache - so you're just playing with statistical improvement not guaranteed real-time performance.
Cheers,
Pete
Cancel
Vote up 0 Vote down

Cancel