This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Interrupt Handling recommendation and spurious IRQ debugging

Hi all,


I've seen in several implementations two different ways of Interrupt handling:

(i) Using a loop that handles several IRQs until IAR gets the ID of a special/spurious IRQ.
(ii) Handling one by one, and each IRQ performs an kernel/hypervisor exit.

Q1: I would like to get your insight on what is the recommend way of a sw implementation running on ARMv8 with GICv3.

Today, we are using the option (ii) and we are getting some spurious IRQs, we actually dont know what is the root cause. We would like to know if the spurious IRQ is legitime or we are doing something wrong there.

Q2: In our setup, we use only SGIs and PPIs. Do you know a case where these two types of IRQs can trigger spurious IRQs?

I see in the spec the following:

  • " This value is returned in response to an interrupt acknowledge, if there is no pending interrupt with
       sufficient priority for it to be signaled to the PE, or if the highest priority pending interrupt is not
       appropriate for the:
       
    1. Interrupt group that is associated with the sysreg
    2. Current Security state

Q3: For (1) I udnerstand, if a group0 IRQ has higher prioritiy and sw tries to acknowledge it using ICC_IAR1_EL1, is it?

(1) is not our case, since we only deal with group 1 IRQs. What about (2): If the TZ secure world is using IRQs in the secure side, does this has any side effect on the non-secure side, even when the group0 IRQs are disabled?

Q4: Do you guys suggest any good way do debug the root cause of spurious IRQs?

Thanks,
Jorge


  • Q1: I would like to get your insight on what is the recommend way of a sw implementation running on ARMv8 with GICv3.

    Today, we are using the option (ii) and we are getting some spurious IRQs, we actually dont know what is the root cause. We would like to know if the spurious IRQ is legitime or we are doing something wrong there.

    I don't think there is a single recommended approach, both the approaches you listed work.  In part it comes down to what you're interrupts look like - how often will you have multiple interrupts pending that you can consume by looping?  If it's rare, re-reading ICC_IARx_EL1 probably won't win you much, and the extra instructions in the loop would just be overhead.

    Q2: In our setup, we use only SGIs and PPIs. Do you know a case where these two types of IRQs can trigger spurious IRQs?

    There are legitimate reasons why you might see spurious returned, but it's probably worth looking at as I'd expect it to be rare.  Also, the reasons aren't really specific to SGIs or PPIs.  

    Examples:

    • The interrupt goes away - interrupts are asynchronous remember.
      • For level-sensitive interrupts if the source stops asserting the interrupt, then the interrupt stops being Pending.  For example, the private timers are all level sensitive, if you updated the timer config it might cause the interrupt to no longer be asserted.  It will take some time for the change in signal to propagate to the GIC, and for the GIC then to recall a pending interrupt from the processor (if it was pending).  Then you have a possible race - the interrupt being recalled after the IRQ exception is taken but before ICC_IARx_EL1 is read.
      • I've seen this in the past with a sequence like:
        • Clear interrupt in perip
        • Write ICC_EOIRx
        • ERET
      • The clearing of the source took long enough to take affect that the processor has already executed the write to ICC_EOIRx and ERET.  As it was level-sensitive, on the EOIR write the state machine went from Active&Pending to Pending, and the GIC re-signalled the interrupt.  Then the level change made it to the GIC, at which point the state machine went Pending to Idle.
    • Change in interrupt config
      • Similar to above, software could re-program a pending interrupt so that it could no longer be sent.  For example, clearing the individual enable or reducing the interrupt's priority.
    • Change in PE config
      • This would be "odd" but you could do something to the PE itself between taking the exception and reading ICC_IARx_EL1 that would result in the interrupt no longer being acknowledgeable.  For example, changing the ICC_PMR_EL1 value.  It's hard to think of a reason why you'd do this, but it is in theory possible.
    • A different interrupt (which you can't see) is now the highest priority
      • You're in Secure state, and a S_G1 interrupt becomes pending triggering an IRQ.  Before software gets to the read of ICC_IAR1_EL1, a G0 with higher priority becomes pending, and is the new HPPI.  The read of IAR1 now returns spurious, because IAR1 can't ack a G0 interrupt.  For this sequence to work, you'd have had to route FIQs to S_EL1 or S_EL2.  Otherwise once the G0 interrupt arrived, an FIQ would have jumped you to EL3.

    But again, I'd expect these circumstances to be relatively rare in typical usage.

    Q3: For (1) I udnerstand, if a group0 IRQ has higher prioritiy and sw tries to acknowledge it using ICC_IAR1_EL1, is it?

    A G0 interrupt would generate an FIQ, not IRQ (assuming no in legacy mode).  But otherwise - yes.  

    Another example could be that the highest priority pending interrupt (HPPI) belongs to the "other" world.  For example, the HPPI is a S.G1 interrupt.  You try to read ICC_IAR1_EL1 from Non-secure state - you'd get spurious.

    The way the IRQ/FIQ signals are used in GICv3 (non-legacy) means that you typically

    Q4: Do you guys suggest any good way do debug the root cause of spurious IRQs?

    Some things I have done in the past:

    On entry to the IRQ handler - before ICC_IARx - read the ISPEND and ISACTIVE registers.  If ICC_IARx returns spurious, re-read the ISPEND and ISACTIVE registers, seeing if anything changed.  This doesn't solve all the race conditions, but it can highlight some problems.  (You'd only need to check the GICR registers, not GICD, given you're using PPIs and SGIs)

    In EL1 (or whichever EL you're routing the interrupts to), set the PSTATE.I/F bits and then go into WFI.  The core will wake on the IRQ/FIQ arriving, but won't take an exception due to the masks.  Immediately after the WFI, read ISR_EL1 and ICC_HPPIRx_EL1, then ack the interrupt.  Keep repeating this process until you see spurious.  

    With both the approaches above, what I'm interested in is which interrupts trigger an exception but then "go away" again.  Is it always the same one?  It only when I achieve a certain rate of interrupts?