This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Interrupt Handling recommendation and spurious IRQ debugging

Jorge over 2 years ago

Hi all,

I've seen in several implementations two different ways of Interrupt handling:

(i) Using a loop that handles several IRQs until IAR gets the ID of a special/spurious IRQ.
(ii) Handling one by one, and each IRQ performs an kernel/hypervisor exit.

Q1: I would like to get your insight on what is the recommend way of a sw implementation running on ARMv8 with GICv3.

Today, we are using the option (ii) and we are getting some spurious IRQs, we actually dont know what is the root cause. We would like to know if the spurious IRQ is legitime or we are doing something wrong there.

Q2: In our setup, we use only SGIs and PPIs. Do you know a case where these two types of IRQs can trigger spurious IRQs?

I see in the spec the following:

" This value is returned in response to an interrupt acknowledge, if there is no pending interrupt with
sufficient priority for it to be signaled to the PE, or if the highest priority pending interrupt is not
appropriate for the:
1. Interrupt group that is associated with the sysreg
2. Current Security state

Q3: For (1) I udnerstand, if a group0 IRQ has higher prioritiy and sw tries to acknowledge it using ICC_IAR1_EL1, is it?

(1) is not our case, since we only deal with group 1 IRQs. What about (2): If the TZ secure world is using IRQs in the secure side, does this has any side effect on the non-secure side, even when the group0 IRQs are disabled?

Q4: Do you guys suggest any good way do debug the root cause of spurious IRQs?

Thanks,
Jorge

Top replies

Martin Weidmann over 2 years ago +1 verified

Jorge said: Q1: I would like to get your insight on what is the recommend way of a sw implementation running on ARMv8 with GICv3. Today, we are using the option (ii) and we are getting some spurious...

Parents

+1 Martin Weidmann over 2 years ago
Jorge said:
Q1: I would like to get your insight on what is the recommend way of a sw implementation running on ARMv8 with GICv3.

Today, we are using the option (ii) and we are getting some spurious IRQs, we actually dont know what is the root cause. We would like to know if the spurious IRQ is legitime or we are doing something wrong there.

I don't think there is a single recommended approach, both the approaches you listed work. In part it comes down to what you're interrupts look like - how often will you have multiple interrupts pending that you can consume by looping? If it's rare, re-reading ICC_IARx_EL1 probably won't win you much, and the extra instructions in the loop would just be overhead.

Jorge said:
Q2: In our setup, we use only SGIs and PPIs. Do you know a case where these two types of IRQs can trigger spurious IRQs?

There are legitimate reasons why you might see spurious returned, but it's probably worth looking at as I'd expect it to be rare. Also, the reasons aren't really specific to SGIs or PPIs.

Examples:

The interrupt goes away - interrupts are asynchronous remember.

For level-sensitive interrupts if the source stops asserting the interrupt, then the interrupt stops being Pending. For example, the private timers are all level sensitive, if you updated the timer config it might cause the interrupt to no longer be asserted. It will take some time for the change in signal to propagate to the GIC, and for the GIC then to recall a pending interrupt from the processor (if it was pending). Then you have a possible race - the interrupt being recalled after the IRQ exception is taken but before ICC_IARx_EL1 is read.

I've seen this in the past with a sequence like:

Clear interrupt in perip

Write ICC_EOIRx

ERET

The clearing of the source took long enough to take affect that the processor has already executed the write to ICC_EOIRx and ERET. As it was level-sensitive, on the EOIR write the state machine went from Active&Pending to Pending, and the GIC re-signalled the interrupt. Then the level change made it to the GIC, at which point the state machine went Pending to Idle.

Change in interrupt config

Similar to above, software could re-program a pending interrupt so that it could no longer be sent. For example, clearing the individual enable or reducing the interrupt's priority.

Change in PE config

This would be "odd" but you could do something to the PE itself between taking the exception and reading ICC_IARx_EL1 that would result in the interrupt no longer being acknowledgeable. For example, changing the ICC_PMR_EL1 value. It's hard to think of a reason why you'd do this, but it is in theory possible.

A different interrupt (which you can't see) is now the highest priority

You're in Secure state, and a S_G1 interrupt becomes pending triggering an IRQ. Before software gets to the read of ICC_IAR1_EL1, a G0 with higher priority becomes pending, and is the new HPPI. The read of IAR1 now returns spurious, because IAR1 can't ack a G0 interrupt. For this sequence to work, you'd have had to route FIQs to S_EL1 or S_EL2. Otherwise once the G0 interrupt arrived, an FIQ would have jumped you to EL3.

But again, I'd expect these circumstances to be relatively rare in typical usage.

Jorge said:
Q3: For (1) I udnerstand, if a group0 IRQ has higher prioritiy and sw tries to acknowledge it using ICC_IAR1_EL1, is it?

A G0 interrupt would generate an FIQ, not IRQ (assuming no in legacy mode). But otherwise - yes.

Another example could be that the highest priority pending interrupt (HPPI) belongs to the "other" world. For example, the HPPI is a S.G1 interrupt. You try to read ICC_IAR1_EL1 from Non-secure state - you'd get spurious.

The way the IRQ/FIQ signals are used in GICv3 (non-legacy) means that you typically

Jorge said:
Q4: Do you guys suggest any good way do debug the root cause of spurious IRQs?

Some things I have done in the past:

On entry to the IRQ handler - before ICC_IARx - read the ISPEND and ISACTIVE registers. If ICC_IARx returns spurious, re-read the ISPEND and ISACTIVE registers, seeing if anything changed. This doesn't solve all the race conditions, but it can highlight some problems. (You'd only need to check the GICR registers, not GICD, given you're using PPIs and SGIs)

In EL1 (or whichever EL you're routing the interrupts to), set the PSTATE.I/F bits and then go into WFI. The core will wake on the IRQ/FIQ arriving, but won't take an exception due to the masks. Immediately after the WFI, read ISR_EL1 and ICC_HPPIRx_EL1, then ack the interrupt. Keep repeating this process until you see spurious.

With both the approaches above, what I'm interested in is which interrupts trigger an exception but then "go away" again. Is it always the same one? It only when I achieve a certain rate of interrupts?
Cancel
Vote up +1 Vote down

Cancel

Reply

+1 Martin Weidmann over 2 years ago
Jorge said:
Q1: I would like to get your insight on what is the recommend way of a sw implementation running on ARMv8 with GICv3.

Today, we are using the option (ii) and we are getting some spurious IRQs, we actually dont know what is the root cause. We would like to know if the spurious IRQ is legitime or we are doing something wrong there.

I don't think there is a single recommended approach, both the approaches you listed work. In part it comes down to what you're interrupts look like - how often will you have multiple interrupts pending that you can consume by looping? If it's rare, re-reading ICC_IARx_EL1 probably won't win you much, and the extra instructions in the loop would just be overhead.

Jorge said:
Q2: In our setup, we use only SGIs and PPIs. Do you know a case where these two types of IRQs can trigger spurious IRQs?

There are legitimate reasons why you might see spurious returned, but it's probably worth looking at as I'd expect it to be rare. Also, the reasons aren't really specific to SGIs or PPIs.

Examples:

The interrupt goes away - interrupts are asynchronous remember.

For level-sensitive interrupts if the source stops asserting the interrupt, then the interrupt stops being Pending. For example, the private timers are all level sensitive, if you updated the timer config it might cause the interrupt to no longer be asserted. It will take some time for the change in signal to propagate to the GIC, and for the GIC then to recall a pending interrupt from the processor (if it was pending). Then you have a possible race - the interrupt being recalled after the IRQ exception is taken but before ICC_IARx_EL1 is read.

I've seen this in the past with a sequence like:

Clear interrupt in perip

Write ICC_EOIRx

ERET

The clearing of the source took long enough to take affect that the processor has already executed the write to ICC_EOIRx and ERET. As it was level-sensitive, on the EOIR write the state machine went from Active&Pending to Pending, and the GIC re-signalled the interrupt. Then the level change made it to the GIC, at which point the state machine went Pending to Idle.

Change in interrupt config

Similar to above, software could re-program a pending interrupt so that it could no longer be sent. For example, clearing the individual enable or reducing the interrupt's priority.

Change in PE config

This would be "odd" but you could do something to the PE itself between taking the exception and reading ICC_IARx_EL1 that would result in the interrupt no longer being acknowledgeable. For example, changing the ICC_PMR_EL1 value. It's hard to think of a reason why you'd do this, but it is in theory possible.

A different interrupt (which you can't see) is now the highest priority

You're in Secure state, and a S_G1 interrupt becomes pending triggering an IRQ. Before software gets to the read of ICC_IAR1_EL1, a G0 with higher priority becomes pending, and is the new HPPI. The read of IAR1 now returns spurious, because IAR1 can't ack a G0 interrupt. For this sequence to work, you'd have had to route FIQs to S_EL1 or S_EL2. Otherwise once the G0 interrupt arrived, an FIQ would have jumped you to EL3.

But again, I'd expect these circumstances to be relatively rare in typical usage.

Jorge said:
Q3: For (1) I udnerstand, if a group0 IRQ has higher prioritiy and sw tries to acknowledge it using ICC_IAR1_EL1, is it?

A G0 interrupt would generate an FIQ, not IRQ (assuming no in legacy mode). But otherwise - yes.

Another example could be that the highest priority pending interrupt (HPPI) belongs to the "other" world. For example, the HPPI is a S.G1 interrupt. You try to read ICC_IAR1_EL1 from Non-secure state - you'd get spurious.

The way the IRQ/FIQ signals are used in GICv3 (non-legacy) means that you typically

Jorge said:
Q4: Do you guys suggest any good way do debug the root cause of spurious IRQs?

Some things I have done in the past:

On entry to the IRQ handler - before ICC_IARx - read the ISPEND and ISACTIVE registers. If ICC_IARx returns spurious, re-read the ISPEND and ISACTIVE registers, seeing if anything changed. This doesn't solve all the race conditions, but it can highlight some problems. (You'd only need to check the GICR registers, not GICD, given you're using PPIs and SGIs)

In EL1 (or whichever EL you're routing the interrupts to), set the PSTATE.I/F bits and then go into WFI. The core will wake on the IRQ/FIQ arriving, but won't take an exception due to the masks. Immediately after the WFI, read ISR_EL1 and ICC_HPPIRx_EL1, then ack the interrupt. Keep repeating this process until you see spurious.

With both the approaches above, what I'm interested in is which interrupts trigger an exception but then "go away" again. Is it always the same one? It only when I achieve a certain rate of interrupts?
Cancel
Vote up +1 Vote down

Cancel

Children

No data