Background information that is hopefully relevant:
- We are using the RL-ARM/TCPNet stack and appropriate Keil Ethernet driver for the ST Cortex-M4 in use (STM32F417).
- Using the Keil RTX.
- Providing a periodic 'tick' to several tasks by means of TIM2_IRQHandler. This tick leads to tasks being 'woken up' using a isr_evt_set() call. We are aware that, internally, the RTX uses a FIFO structure to convey these flags.
- A WDT to restart the device if the main application task stops kicking.
We've already started discussing this with Keil technical support; however, the actual problem we have isn't necessarily a fault or limitation of the Keil library, but more so of our overall design perhaps.
The problem is that if the Ethernet interface is 'flooded' with packets at very fast rate then the amount of packets that the receive ISR, ETH_IRQHandler(), needs to handle leads to 'interrupt storm'. This leads to the processor having little or no time to run the main application task.
The test is done using the hping3 utility. For example, thousands of SYN packets can be sent like so (though the actual flags in the packet doesn't matter; it's the volume of packets that matters):
hping3 -i u10 -p 80 192.168.0.100
This could eventually cause the device to fall over and restart. There are two reasons why this happens:
(a) The limiting of processing time for the main application is severe enough such that the WDT isn't kicked;
(b) A FIFO overflow event happens (OS_ERR_FIFO_OVF in os_error()) because various tasks weren't live enough to receive their event notifications out of the FIFO.
The live lock is caused by the processor spending most or all of its time in ETH_IRQHandler().
I'm aware that this is a very common kind of problem in any system that is interrupted by external events, including Ethernet adapters. From the research I've done, I understand that Ethernet adapters may employ some kind of throttling or 'rate limiting' to prevent this. I've looked at (well, okay - scanned through!) the STM32F417xx Reference Manual (RM009) in the hope that there may already be a hardware rate limiting mechanism within the Ethernet peripheral that could offer a solution, but from what I can tell, no such thing exists.
A further bit of information to add (perhaps just for interest, or if it helps anyone else) is that the problem was a lot worse until a simple modification was done to the Ethernet driver. The DMAIER register AISE and RBUIE bits were being set during initialisation, which enables an interrupt to happen in the event of a transmit buffer being available. This event happens a lot during the flood attack, but Keil's ETH_IRQHandler() wasn't checking or clearing the corresponding flag bits. This was causing the ISR to be called absolutely non-stop. In practical terms this would lead to an OS_ERR_FIFO_OVF as soon as the hping3 flood attack was started. This 'fix' (if one can call it that) of not enabling the transmit buffer unavailable interrupt in the first place has improved the issue from 'falls over instantly' to 'falls over after you leave it flooding for a while'.
So back to the problem we still have, is there a straightforward software Elastoplast I can apply, or a peripheral feature I could switch on to mitigate this further? The best solution I can imagine doing at the moment is to implement a rate limiting / throttle mechanism that watches for X number of packets in a given period of time. If that threshold is exceeded, the Ethernet receive interrupts are disabled for a rest period. I know that a proper rate limiting mechanism will employ buffers to prevent packet loss. However, I don't have the resources for that, and my primary goal is to prevent the device from being knocked over. The Ethernet capability is a non-essential function, so I don't care about sudden packet loss if it's obvious that it is being flooded.
Does anyone have any other guidance please?