This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

TCPNet / Ethernet - Preventing 'interrupt storm'

Background information that is hopefully relevant:

- We are using the RL-ARM/TCPNet stack and appropriate Keil Ethernet driver for the ST Cortex-M4 in use (STM32F417).

- Using the Keil RTX.

- Providing a periodic 'tick' to several tasks by means of TIM2_IRQHandler. This tick leads to tasks being 'woken up' using a isr_evt_set() call. We are aware that, internally, the RTX uses a FIFO structure to convey these flags.

- A WDT to restart the device if the main application task stops kicking.

We've already started discussing this with Keil technical support; however, the actual problem we have isn't necessarily a fault or limitation of the Keil library, but more so of our overall design perhaps.

The problem is that if the Ethernet interface is 'flooded' with packets at very fast rate then the amount of packets that the receive ISR, ETH_IRQHandler(), needs to handle leads to 'interrupt storm'. This leads to the processor having little or no time to run the main application task.

The test is done using the hping3 utility. For example, thousands of SYN packets can be sent like so (though the actual flags in the packet doesn't matter; it's the volume of packets that matters):

hping3 -i u10 -p 80 192.168.0.100

This could eventually cause the device to fall over and restart. There are two reasons why this happens:

(a) The limiting of processing time for the main application is severe enough such that the WDT isn't kicked;

(b) A FIFO overflow event happens (OS_ERR_FIFO_OVF in os_error()) because various tasks weren't live enough to receive their event notifications out of the FIFO.

The live lock is caused by the processor spending most or all of its time in ETH_IRQHandler().

I'm aware that this is a very common kind of problem in any system that is interrupted by external events, including Ethernet adapters. From the research I've done, I understand that Ethernet adapters may employ some kind of throttling or 'rate limiting' to prevent this. I've looked at (well, okay - scanned through!) the STM32F417xx Reference Manual (RM009) in the hope that there may already be a hardware rate limiting mechanism within the Ethernet peripheral that could offer a solution, but from what I can tell, no such thing exists.

A further bit of information to add (perhaps just for interest, or if it helps anyone else) is that the problem was a lot worse until a simple modification was done to the Ethernet driver. The DMAIER register AISE and RBUIE bits were being set during initialisation, which enables an interrupt to happen in the event of a transmit buffer being available. This event happens a lot during the flood attack, but Keil's ETH_IRQHandler() wasn't checking or clearing the corresponding flag bits. This was causing the ISR to be called absolutely non-stop. In practical terms this would lead to an OS_ERR_FIFO_OVF as soon as the hping3 flood attack was started. This 'fix' (if one can call it that) of not enabling the transmit buffer unavailable interrupt in the first place has improved the issue from 'falls over instantly' to 'falls over after you leave it flooding for a while'.

So back to the problem we still have, is there a straightforward software Elastoplast I can apply, or a peripheral feature I could switch on to mitigate this further? The best solution I can imagine doing at the moment is to implement a rate limiting / throttle mechanism that watches for X number of packets in a given period of time. If that threshold is exceeded, the Ethernet receive interrupts are disabled for a rest period. I know that a proper rate limiting mechanism will employ buffers to prevent packet loss. However, I don't have the resources for that, and my primary goal is to prevent the device from being knocked over. The Ethernet capability is a non-essential function, so I don't care about sudden packet loss if it's obvious that it is being flooded.

Does anyone have any other guidance please?

  • Like the old joke goes:

    Patient: "Doctor, it hurts when I do this"
    Doctor: "Well, then don't do that"

    Seriously though. If "The Ethernet capability is a non-essential function" then why use an interrupt? You could go to a polled mechanisim in order to allow the processor to control port servicing. Or use a combination of both, as this author proposed (pg12):

    www.cs.wm.edu/.../ethdriver.pdf

    Another option is to handle the SYN storm problem in software via packet detection mechanisms as proposed here (.pdf pg7):

    seclab.cs.ucdavis.edu/.../DetectingSpoofed-DISCEX.pdf

    Interesting problem though.

  • Why don't you put the device behind a switch/router? Smart switches are capable of filtering out such kind of attacks.

    Or limit the ethernet speed to 10MBit in the ethernet driver. The 10 MBit link reduces the max. packet rate by a factor of 10.

  • I agree with S V: don't use the interrupt, use polling instead.
    You either try to catch every packet, in which case you have to use the interrupt and you better have the horsepower to deal with DoS attacks. Or you just service the network on a 'best effort' basis, in which case a DoS attack is not your concern any more.

  • I once encountered a similar problem but with a serial port. The solution was to measure the average time between interrupts, and if shorter than a few microseconds - reset the peripheral/disable the interrupt line etc. at least for _some_ time. That may be good enough for you. I would not give up interrupt driven traffic unless no other option remains.

  • I would not give up interrupt driven traffic unless no other option remains

    That's interesting. Is there a reason why you'd hold such a view?
    This is a rather categorical statement. I think your advice would only be valid for a limited range of applications.

  • What do we have on the menu? As far as I can see only two options: to poll or interrupt/event driven systems. Let's make a decent choice. Polling is easier and if we're operating inside preemptive OS with the fine tuned preempt mechanism may give us pretty good bang for a buck! Well, there is a problem:fine tuned... It means time to tune and time to re-tune if some mods have been done. Not bad though especially if it is one time shot. For other hand, interrupts do require much more work, interface design, more debugging and other headaches but when it is done its done. In general it gives you better performance but require more investment in the beginning. This is general ideas and of course each example is unique and one has think for himself trading of pro and cons. Sometimes the best results are coming from combining both approaches. Here is one example: I2C master has many interrupts. One of those is the START condition. Often time between issuing one and getting an interrupt is so small that for this particular case its worth polling, but the rest of the way when clocks may be extended by the slave using ISR looks better.
    Please take this just as a personal opinion.

    Thank you for listening :)

  • Apologies for this being a bit (many months) late, but thanks to all for the feedback and ideas.

    In summary, there were really two problems. The first problem was that on rare occasions the system would just suddenly crash when subjected to normal network traffic. The second problem, which we're discussing here, was that a prolonged intentional packet flood over a long term (tens of seconds) would cause the 'interrupt storm' leading to a WDT event.

    The problem of the sudden crash, not necessarily due to a flood attack, was (as previously explained) due to a bug in the Keil STM32 Ethernet driver. The final fix I've done for this is in function ETH_IRQHandler() where

    ETH->DMASR = INT_RBUIE
    

    is changed to

    ETH->DMASR = INT_AISE | INT_RBUIE
    

    . That completely prevents the occasional crash caused by the handler being called over and over due to INT_RBUIE not being reset, and that fixes the showstopper problem.