Cortex M4 hard fault finding root cause on LPC4078 pc=0x0

Hi everyone,

I'm getting a hard fault at my LPC4078 on LPCXpresso and would be very glad if you could help me finding the root cause.

The µC runs with freeRtos 8.2.2 but I'm not sure if the hard fault has ever anything to do with it.

When the hard fault occurs it hangs on this position:

The register values are:

r0 volatile uint32_t 0x1 (Hex)
r1 volatile uint32_t 0x300 (Hex)
r2 volatile uint32_t 0x0 (Hex)
r3 volatile uint32_t 0x10008a90 (Hex)
r12 volatile uint32_t 0x0 (Hex)
lr volatile uint32_t 0x12f89 (Hex)
pc volatile uint32_t 0x0 (Hex)
psr volatile uint32_t 0x0 (Hex)
SCB SCB_Type * 0xe000ed00
CPUID const volatile uint32_t 0x410fc241 (Hex)
ICSR volatile uint32_t 0x429803 (Hex)
VTOR volatile uint32_t 0x8000 (Hex)
AIRCR volatile uint32_t 0xfa050000 (Hex)
SCR volatile uint32_t 0x0 (Hex)
CCR volatile uint32_t 0x200 (Hex)
SHP volatile uint8_t [12] 0xe000ed18 (Hex)
SHCSR volatile uint32_t 0x0 (Hex)
CFSR volatile uint32_t 0x20000 (Hex)
HFSR volatile uint32_t 0x40000000 (Hex)
DFSR volatile uint32_t 0x0 (Hex)
MMFAR volatile uint32_t 0xe000edf8 (Hex)
BFAR volatile uint32_t 0xe000edf8 (Hex)
AFSR volatile uint32_t 0x0 (Hex)
PFR const volatile uint32_t [2] 0xe000ed40 (Hex)
PFR[0] const volatile uint32_t 48
PFR[1] const volatile uint32_t 512
DFR const volatile uint32_t 0x100000 (Hex)
ADR const volatile uint32_t 0x0 (Hex)
MMFR const volatile uint32_t [4] 0xe000ed50 (Hex)
MMFR[0] const volatile uint32_t 1048624
MMFR[1] const volatile uint32_t 0
MMFR[2] const volatile uint32_t 16777216
MMFR[3] const volatile uint32_t 0
ISAR const volatile uint32_t [5] 0xe000ed60 (Hex)
ISAR[0] const volatile uint32_t 17830160
ISAR[1] const volatile uint32_t 34676736
ISAR[2] const volatile uint32_t 555950641
ISAR[3] const volatile uint32_t 17895729
ISAR[4] const volatile uint32_t 19988786
RESERVED0 uint32_t [5] 0xe000ed74 (Hex)
RESERVED0[0] uint32_t 0
RESERVED0[1] uint32_t 0
RESERVED0[2] uint32_t 0
RESERVED0[3] uint32_t 0
RESERVED0[4] uint32_t 0
CPACR volatile uint32_t 0xf00000 (Hex)

Unfortunately pc is 0x0. That helped me a lot at similar hard fault failures.

How would you proceed finding the cause? Are there any information missing or should I check any other values?

I already searched in Google but until now I didn't find anything useful or it seemed to be too complex.

I'm looking forward hearing from you for any hints or tips.

Best regards,

Daniel

Parents
  • Okay, I've got the 2015er edition of Definitive Guide M3&M4 processors. Do you mean chapter 12.8? Or is there another troubleshooting chapter?

    Of course I've checked NXP forums. My HardFault_Handler() - function is already the same as described and of course I've got the Cortex M4 Devices Generic User Guide. Do I really need to read the whole guide? That will take sooo much time.

    My call stack looks this:

    But what does it help?

    HardFault_Handler() at main.c:125 0x1110c
    <signal handler called>() at 0xfffffffd
    prvPortStartFirstTask() at port.c:284 0x1b70c
    xPortStartScheduler() at port.c:370 0x1ba40
    0xd3cba64a

    I hoped you can give me some tips about the register values in hardFault_Handler().

    What does pc=0x0 mean? Can I use lr=0x13411 for something? What about r0, r1, r2, r3, r12 values? What about CFSR=0x20000? (INVSTATE=1?)

Reply
  • Okay, I've got the 2015er edition of Definitive Guide M3&M4 processors. Do you mean chapter 12.8? Or is there another troubleshooting chapter?

    Of course I've checked NXP forums. My HardFault_Handler() - function is already the same as described and of course I've got the Cortex M4 Devices Generic User Guide. Do I really need to read the whole guide? That will take sooo much time.

    My call stack looks this:

    But what does it help?

    HardFault_Handler() at main.c:125 0x1110c
    <signal handler called>() at 0xfffffffd
    prvPortStartFirstTask() at port.c:284 0x1b70c
    xPortStartScheduler() at port.c:370 0x1ba40
    0xd3cba64a

    I hoped you can give me some tips about the register values in hardFault_Handler().

    What does pc=0x0 mean? Can I use lr=0x13411 for something? What about r0, r1, r2, r3, r12 values? What about CFSR=0x20000? (INVSTATE=1?)

Children
  • If you got the paper copy of the book, it doesn't have the appendixes. Because the book is too big they moved the appendixes online on the companion website: https://booksite.elsevier.com/9780124080829/

    From there you can download the appendixes: https://booksite.elsevier.com/9780124080829/appendices.php and the trouble shooting guide is appendix I: https://booksite.elsevier.com/9780124080829/downloads/APP-09.pdf

    I haven't use NXP LPCXpresso for very very long time. However, if you can view the register window, you can see the LR (exception return), from there you can tell which stack pointer was used for exception stacking: If bit 2 of EXC_RETURN is 0, then check where MSP is pointed to. If it is 1, PSP was used for stacking.

    Then locate the exception stack frame based on MSP/PSP, and look for the value of address offset 24 (decimal). This shows the PC value that was pushed to the stack.

    You mentioned : <signal handler called>() at 0xfffffffd

    I guess this is LR (EXC_RETURN) and is 0xfffffffd, so the fault is triggered in thread mode and was using PSP.

    >What about CFSR=0x20000? (INVSTATE=1?)

    As shown in appendix I of the book, the INVSTATE could be caused by:

    1) Loading branch target address to PC with LSB equals zero.Stacked PC should show the branch target.

    2) LSB of vector address in vector table is zero. Stacked PCshould show the starting of exception handler.

    3) Stacked PSR corrupted during exception handling, so afterthe exception the core tries to return to the interrupted code inARM state

    Combining with the stacked PC value, and disasseembly of the code, and the CFSR information, hopefully you can work out which of the causes mentioned above is the actual one.

    Hope this helps.

    regards,

    Joseph

  • Hi

    thanks for the information. After I activated the other fault possibilities with

    SCB->SHCSR |= SCB_SHCSR_USGFAULTENA_Msk | SCB_SHCSR_BUSFAULTENA_Msk | SCB_SHCSR_MEMFAULTENA_Msk;

    it hangs in UsageFault_Handler:

    Does it mean, that HardFault_Handler() - call wasn't correct before?

    Register values are (pc=0x00008128 is this function itself):

    As you see lr is 0xFFFFFFFD which means PSP is used. At stack position 0x10008A08 of psp I see this

    6th long word 0x00013421 seems to be valid flash address and I find in Disassembly:

    Does it mean there's a problem with

    Chip_CAN_Send(CANBUS_PERIPHERAL, CAN_BUFFER_1, pMsgObj);

    call some lines above?

    If yes, this function is called many many times and works solid before. How can I trigger this? (and what can I recognize when combining with CFSR INVSTATE=1?)

    Regards,

    Daniel

  • Hi Daniel,

    The address after the stacked PC is 0. This value should be the stack xPSR and the T bit in this value should be set, but it isn't.

    I guess there is a stack corruption. Please check if you have allocated enough stack space for the Main Stack (used by interrupt handlers) and each of the threads. I don't know if there is any chance for you to get event trace in you development tool. If yes, please check which interrupt handler is the last one that was triggered. If you don't have access to event trace feature, one trick you can do is to

    1) define a global variable

    2) In each interrupt, write the interrupt number into this variable

    After the Hardfault, see what was the value in this variable to see which ISR was executing before the fault. I guess an ISR has a stack corruption and return to 0x00013421 with xPSR equal 0, which triggered the fault (T bit is cleared).

    regards,

    Joseph

  • Hi 

    what is the correct stack frame layout?

    In Cortex-M4 User Guide I find:

    In case 7th byte is 0 I get a UsageFault with UFSR_INVSTATE=1, e.g.:

    In case 7th byte is any other invalid address (this happens very rarely) I get a BusFault with BFSR_IBUSERR=1 (e.g. if PC is 0x14000000).

    But you wrote the byte after PC, isn't that the 8th byte "xPSR" or what is the correct layout?

    As you suggested I defined a global variable and set it to unique numbers in every interrupt: NVIC_ISER shows these enabled interrupts:

    Enum LPC40XX_IRQn_Type in cmsis_40xx.h extracts it (set bits in ISER[0] and [1]) to these interrupts:

    5: UART0_IRQn

    10: I2C0_IRQn

    22: ADC_IRQn

    25: CAN_IRQn

    26: DMA_IRQn

    38: GPIO_IRQn

    But what about these 3 FreeRTOS-Interrupts which are implemented, too?

    #define vPortSVCHandler SVC_Handler
    #define xPortPendSVHandler PendSV_Handler
    #define xPortSysTickHandler SysTick_Handler

    Everytime I get the UsageFault (or very rarely BusFault) my variable is set to 26 DMA_IRQn and PendSVHandler was used recently. I checked by a counting variable at begin and end of DMA_IRQHandler() and PendSVHandler() if there were run completely last time and yes, the counting-variables in DMA_IRQHandler() are the same. PendSVHandler() is a bit complecated because of the assembly code inside, the 2nd variable stays at 0.

    What would you suggest, what could cause setting the stacked PC to 0? How can I check if possibly DMA_IRQHandler() some time has a stack corruption?

    Is it correct that USFR=INVSTATE-Bit-Set is caused by corrupt stacked PC=0?

  • Ah, Sorry! I misread your memory view. I though 0x13421 was the stacked PC. (I need new glasses!)

    There is a possibilty that the DMA handler caused a stack overflow and corrupted a task stack. This ends up the stacked PC in the exception stack frame of the task become 0. Because the task is not running, this doesn't trigger the fault immediately. But a bit later, FreeRTOS context switch (PendSV) into the thread that has the corrupted stack and crash. So please check the size of your main stack (which is used by the exception handlers).

    Another thing to check : make sure the task stacks are double word aligned. (Although in this case it might not be the cause of the problem.)

    regards,
    Joseph

    EDITED: There could be other possible reasons for the task stack corruption. e.g. Incorrect DMA operations or some other bugs in the DMA handler that cause a stack corruption.

    Alternatively, the application task that was crash has a stack overflow which end up the stack grow into the main stack. The ISR service using the same stack region corrupt the stack frame inside and end up crashing the task when it is resumed. FreeRTOS do have some stack checking feature which can help detect such issue:

    www.freertos.org/Stacks-and-stack-overflow-checking.html

  • Now I found a software bug but unfortunately don't understand what causes the corrupted task stack frame.

    In rarely cases when the BusFault instead of UsageFault - error occured I recognized a unique 32bit - value at stacked PC position. Searched the whole RAM area for this value and found it in addition to that PC position at another RAM address. Map-File shows an array around that position. I examined that array and found out that the index counter of that array is in error case too high and exceeds the bounds of the array. A classic programming bug (luckily it's not implemented by me :-))

    When I increase the mentioned array the error definitely doesn't occure anymore.

    I think it's not necessary to understand the "voodoo magic" what happens after there's a write exceeding the array bounds. I see some variables after the array are set to invalid values and there's a function call, where values of the array are passed. But that doesn't explain, why passed values are stored at "stacked PC" position and not behind where local variables etc are stored in the frame? Step over in debugger isn't enough, I've got to start "running" until it happens after that exceeding write access.

    Are there some possibilities to avoid array out-of-bounds write accesses? Of course you can implement if-conditions to trap it or you've got to comply MISRA-C rules. But isn't there a MPU in LPC4078 which identifies such invalid write accesses?

     Thank you for all the support. Fortunately I don't need to examine more of the ISR's, DMA handler etc. It's good to know to get support in this forum if I get similar errors in future.

  • Glad to know that you have made good progress.

    I guess the out-of-bound array write have corrupted some task stack - the saved PC value of a task, which is in the exception stack frame, are located inside task's stack.

    During context switching, the processor switched from a task running in Thread mode to an OS exception (e.g. SysTick), and the current PC in the task is saved in the exception stack frame in the task's stack. The OS code then use PendSV code to context switch, and switch to another task using exception return. If a task stack is corrupted, and later when the OS context switch into it, the return address in the exception stack frame is invalid (0x0 in your case) and therefore it entered hardFault.

    FreeRTOS can utilize the MPU to help detect this kind of issue:

    www.freertos.org/FreeRTOS-MPU-memory-protection-unit.html

    However, many FreeRTOS projects doesn't enable the MPU. Enabling the MPU require you to define the memory regions for each tasks (and share data variable) and the MPU region alignment requirements in Armv7-M make it a bit more complicated.  This is much easier to do in Armv8-M processors (e.g. Cortex-M23, Cortex-M33).

    regards,

    Joseph