Over the past few months I've been doing a lot of work on a Kinetis K24 processor, which is a Cortex-M4, running the MQXLITE RTOS. It also has a couple other SDKs built in and a surprising level of complexity for a CM4 application. What all that leads to is a frustrating number of faults, and I still have trouble catching a few.
I currently run with the usage fault, memmanage fault, and bus fault handlers disabled because I consider all these fatal errors. I'm only interested in logging the causes of the faults to a persistent storage and rebooting, so I force everything to escalate to hard fault.
I currently have a hard fault handler that looks like this:
__asm volatile ( " ldr r1, =last_fault \n" // get the persistent data address " mov r2, #1 \n" // store the fault type " str r2, [r1, #28] \n" " tst lr, #4 \n" // Determine which banked stack pointer we were using when the fault occurred " ittee eq \n" " mrseq r0, msp \n" // Load the appropriate stack pointer " andeq r4, r0, #0x80000000 \n" // And mark which one it was " mrsne r0, psp \n" " movne r4, r0 \n" " str r4, [r1, #16] \n" // put away the stack register " ldr r3, [r0, #20] \n" // stored lr " ldr r2, [r0, #24] \n" // stored pc " ldr r5, [r0, #0] \n" // stored r0 " ldr r6, [r0, #4] \n" // stored r1 " str r3, [r1, #12] \n" // put away the lr " str r2, [r1, #8] \n" // put away the pc " str r5, [r1, #20] \n" // put away cached r0 " str r6, [r1, #24] \n" // put away cached r1 " ldr r2, handler2_address_const \n" // a handler that parses the fault status registers " blx r2 \n" " handler2_address_const: .word store_fault_info \n" " bkpt 255" // force a lockup and reset the chip );
This has served me well for a lot of simple faults - null pointer dereferences, etc. The handler reads the status information, writes it to a peripheral on the K24 called the "system register file" that persists through any reboot that isn't POR or low voltage, and I read it when I boot up.
However, I still get some faults that do not appear to trigger this handler - I get a reboot, and my persistent data is uninitialized. My core question is, why does my handler sometimes not execute when a hard fault occurs? And how can I make it more general to handle this case?
Figured it out: A piece of code from our vendor was implementing a critical section by setting FAULTMASK rather than BASEPRI. This prevented the hard fault handler from firing (when FAULTMASK is set, only NMI handlers may execute) and the device went straight to LOCKUP mode and reset.