Hi everyone,
I'm getting a hard fault at my LPC4078 on LPCXpresso and would be very glad if you could help me finding the root cause.
The µC runs with freeRtos 8.2.2 but I'm not sure if the hard fault has ever anything to do with it.
When the hard fault occurs it hangs on this position:
The register values are:
r0 volatile uint32_t 0x1 (Hex) r1 volatile uint32_t 0x300 (Hex) r2 volatile uint32_t 0x0 (Hex) r3 volatile uint32_t 0x10008a90 (Hex) r12 volatile uint32_t 0x0 (Hex) lr volatile uint32_t 0x12f89 (Hex) pc volatile uint32_t 0x0 (Hex) psr volatile uint32_t 0x0 (Hex) SCB SCB_Type * 0xe000ed00 CPUID const volatile uint32_t 0x410fc241 (Hex) ICSR volatile uint32_t 0x429803 (Hex) VTOR volatile uint32_t 0x8000 (Hex) AIRCR volatile uint32_t 0xfa050000 (Hex) SCR volatile uint32_t 0x0 (Hex) CCR volatile uint32_t 0x200 (Hex) SHP volatile uint8_t [12] 0xe000ed18 (Hex) SHCSR volatile uint32_t 0x0 (Hex) CFSR volatile uint32_t 0x20000 (Hex) HFSR volatile uint32_t 0x40000000 (Hex) DFSR volatile uint32_t 0x0 (Hex) MMFAR volatile uint32_t 0xe000edf8 (Hex) BFAR volatile uint32_t 0xe000edf8 (Hex) AFSR volatile uint32_t 0x0 (Hex) PFR const volatile uint32_t [2] 0xe000ed40 (Hex) PFR[0] const volatile uint32_t 48 PFR[1] const volatile uint32_t 512 DFR const volatile uint32_t 0x100000 (Hex) ADR const volatile uint32_t 0x0 (Hex) MMFR const volatile uint32_t [4] 0xe000ed50 (Hex) MMFR[0] const volatile uint32_t 1048624 MMFR[1] const volatile uint32_t 0 MMFR[2] const volatile uint32_t 16777216 MMFR[3] const volatile uint32_t 0 ISAR const volatile uint32_t [5] 0xe000ed60 (Hex) ISAR[0] const volatile uint32_t 17830160 ISAR[1] const volatile uint32_t 34676736 ISAR[2] const volatile uint32_t 555950641 ISAR[3] const volatile uint32_t 17895729 ISAR[4] const volatile uint32_t 19988786 RESERVED0 uint32_t [5] 0xe000ed74 (Hex) RESERVED0[0] uint32_t 0 RESERVED0[1] uint32_t 0 RESERVED0[2] uint32_t 0 RESERVED0[3] uint32_t 0 RESERVED0[4] uint32_t 0 CPACR volatile uint32_t 0xf00000 (Hex)
Unfortunately pc is 0x0. That helped me a lot at similar hard fault failures.
How would you proceed finding the cause? Are there any information missing or should I check any other values?
I already searched in Google but until now I didn't find anything useful or it seemed to be too complex.
I'm looking forward hearing from you for any hints or tips.
Best regards,
Daniel
Have you got Joseph Yiu's Definitive Guide book? It has an appendix on Troubleshooting.
See: https://community.arm.com/developer/ip-products/system/f/embedded-forum/3257/debugging-a-cortex-m0-hard-fault - which also has links to many other resources on debugging Hard Faults
Unknown said:When the hard fault occurs it hangs on this position
So that is in your Hard Fault Handler.
From there, you should be able to see where the fault itself occurred
What IDE are you using? it looks like something Eclipse-based?
Doesn't it give you a call-stack view, or other such facilities?
Hello Andy Neil
yes, I've got the Definitive Guide, but it's for Cortex M3 instead M4 in my case?
Do you mean appendix E for troubleshooting hard fault?
Yes, the position of my screenshots is already in the hard fault, sorry that I didn't mention clearly.
I use LPCXpresso which is eclipse based and proposed to use by NXP for this µC. I don't know a window with call stack visible. I'd like to get some help analyzing the register values in hard fault handler.
E.g. pc is zero. What does that mean? 0x0 is _vStackTop of vector table.
As described in this guide
https://www.silabs.com/community/mcu/32-bit/knowledge-base.entry.html/2014/05/26/debug_a_hardfault-78gc
INVSTATE of CFSR 0x20000 is set to 1 as in my case. But when I switch to disassembler of the lr address I can't recognize any disbehavior.
I need to know what happened before. What would you propose as preferable way?
Unknown said:I've got the Definitive Guide, but it's for Cortex M3 instead M4 in my case?
So time to get the M4 edition, then!
But I think the Hard Fault handling is very similar between the two.
Unknown said:I use LPCXpresso
So have you checked on NXP's forums to find out if it does have specific features to help you?
As I said, such things are pretty much standard - it seems unlikely that NXP would omit something so basic.
Unknown said:I need to know what happened before.
If you can't work back from the information in the Hard Fault handler, then instrument your code to narrow down where the fault is happening.
Unknown said: I don't know a window with call stack visible
https://www.nxp.com/design/microcontrollers-developer-resources/lpc-microcontroller-utilities/lpcxpresso-ide-v8-2-2:LPCXPRESSO?&tab=Documentation_Tab&linkline=Users-Guide
via
https://www.nxp.com/design/microcontrollers-developer-resources/lpc-microcontroller-utilities/lpcxpresso-ide-v8-2-2:LPCXPRESSO
Okay, I've got the 2015er edition of Definitive Guide M3&M4 processors. Do you mean chapter 12.8? Or is there another troubleshooting chapter?
Of course I've checked NXP forums. My HardFault_Handler() - function is already the same as described and of course I've got the Cortex M4 Devices Generic User Guide. Do I really need to read the whole guide? That will take sooo much time.
My call stack looks this:
But what does it help?
HardFault_Handler() at main.c:125 0x1110c <signal handler called>() at 0xfffffffd prvPortStartFirstTask() at port.c:284 0x1b70c xPortStartScheduler() at port.c:370 0x1ba40 0xd3cba64a
I hoped you can give me some tips about the register values in hardFault_Handler().
What does pc=0x0 mean? Can I use lr=0x13411 for something? What about r0, r1, r2, r3, r12 values? What about CFSR=0x20000? (INVSTATE=1?)
If you got the paper copy of the book, it doesn't have the appendixes. Because the book is too big they moved the appendixes online on the companion website: https://booksite.elsevier.com/9780124080829/
From there you can download the appendixes: https://booksite.elsevier.com/9780124080829/appendices.php and the trouble shooting guide is appendix I: https://booksite.elsevier.com/9780124080829/downloads/APP-09.pdf
I haven't use NXP LPCXpresso for very very long time. However, if you can view the register window, you can see the LR (exception return), from there you can tell which stack pointer was used for exception stacking: If bit 2 of EXC_RETURN is 0, then check where MSP is pointed to. If it is 1, PSP was used for stacking.
Then locate the exception stack frame based on MSP/PSP, and look for the value of address offset 24 (decimal). This shows the PC value that was pushed to the stack.
You mentioned : <signal handler called>() at 0xfffffffd
I guess this is LR (EXC_RETURN) and is 0xfffffffd, so the fault is triggered in thread mode and was using PSP.
>What about CFSR=0x20000? (INVSTATE=1?)
As shown in appendix I of the book, the INVSTATE could be caused by:
1) Loading branch target address to PC with LSB equals zero.Stacked PC should show the branch target.
2) LSB of vector address in vector table is zero. Stacked PCshould show the starting of exception handler.
3) Stacked PSR corrupted during exception handling, so afterthe exception the core tries to return to the interrupted code inARM state
Combining with the stacked PC value, and disasseembly of the code, and the CFSR information, hopefully you can work out which of the causes mentioned above is the actual one.
Hope this helps.
regards,
Joseph
Hi Joseph Yiu
thanks for the information. After I activated the other fault possibilities with
SCB->SHCSR |= SCB_SHCSR_USGFAULTENA_Msk | SCB_SHCSR_BUSFAULTENA_Msk | SCB_SHCSR_MEMFAULTENA_Msk;
it hangs in UsageFault_Handler:
Does it mean, that HardFault_Handler() - call wasn't correct before?
Register values are (pc=0x00008128 is this function itself):
As you see lr is 0xFFFFFFFD which means PSP is used. At stack position 0x10008A08 of psp I see this
6th long word 0x00013421 seems to be valid flash address and I find in Disassembly:
Does it mean there's a problem with
Chip_CAN_Send(CANBUS_PERIPHERAL, CAN_BUFFER_1, pMsgObj);
call some lines above?
If yes, this function is called many many times and works solid before. How can I trigger this? (and what can I recognize when combining with CFSR INVSTATE=1?)
Regards,
Hi Daniel,
The address after the stacked PC is 0. This value should be the stack xPSR and the T bit in this value should be set, but it isn't.
I guess there is a stack corruption. Please check if you have allocated enough stack space for the Main Stack (used by interrupt handlers) and each of the threads. I don't know if there is any chance for you to get event trace in you development tool. If yes, please check which interrupt handler is the last one that was triggered. If you don't have access to event trace feature, one trick you can do is to
1) define a global variable
2) In each interrupt, write the interrupt number into this variable
After the Hardfault, see what was the value in this variable to see which ISR was executing before the fault. I guess an ISR has a stack corruption and return to 0x00013421 with xPSR equal 0, which triggered the fault (T bit is cleared).
what is the correct stack frame layout?
In Cortex-M4 User Guide I find:
In case 7th byte is 0 I get a UsageFault with UFSR_INVSTATE=1, e.g.:
In case 7th byte is any other invalid address (this happens very rarely) I get a BusFault with BFSR_IBUSERR=1 (e.g. if PC is 0x14000000).
But you wrote the byte after PC, isn't that the 8th byte "xPSR" or what is the correct layout?
As you suggested I defined a global variable and set it to unique numbers in every interrupt: NVIC_ISER shows these enabled interrupts:
Enum LPC40XX_IRQn_Type in cmsis_40xx.h extracts it (set bits in ISER[0] and [1]) to these interrupts:
5: UART0_IRQn
10: I2C0_IRQn
22: ADC_IRQn
25: CAN_IRQn
26: DMA_IRQn
38: GPIO_IRQn
But what about these 3 FreeRTOS-Interrupts which are implemented, too?
#define vPortSVCHandler SVC_Handler #define xPortPendSVHandler PendSV_Handler #define xPortSysTickHandler SysTick_Handler
Everytime I get the UsageFault (or very rarely BusFault) my variable is set to 26 DMA_IRQn and PendSVHandler was used recently. I checked by a counting variable at begin and end of DMA_IRQHandler() and PendSVHandler() if there were run completely last time and yes, the counting-variables in DMA_IRQHandler() are the same. PendSVHandler() is a bit complecated because of the assembly code inside, the 2nd variable stays at 0.
What would you suggest, what could cause setting the stacked PC to 0? How can I check if possibly DMA_IRQHandler() some time has a stack corruption?
Is it correct that USFR=INVSTATE-Bit-Set is caused by corrupt stacked PC=0?
Ah, Sorry! I misread your memory view. I though 0x13421 was the stacked PC. (I need new glasses!)
There is a possibilty that the DMA handler caused a stack overflow and corrupted a task stack. This ends up the stacked PC in the exception stack frame of the task become 0. Because the task is not running, this doesn't trigger the fault immediately. But a bit later, FreeRTOS context switch (PendSV) into the thread that has the corrupted stack and crash. So please check the size of your main stack (which is used by the exception handlers).
Another thing to check : make sure the task stacks are double word aligned. (Although in this case it might not be the cause of the problem.)
regards,Joseph
EDITED: There could be other possible reasons for the task stack corruption. e.g. Incorrect DMA operations or some other bugs in the DMA handler that cause a stack corruption.
Alternatively, the application task that was crash has a stack overflow which end up the stack grow into the main stack. The ISR service using the same stack region corrupt the stack frame inside and end up crashing the task when it is resumed. FreeRTOS do have some stack checking feature which can help detect such issue:
www.freertos.org/Stacks-and-stack-overflow-checking.html
Now I found a software bug but unfortunately don't understand what causes the corrupted task stack frame.
In rarely cases when the BusFault instead of UsageFault - error occured I recognized a unique 32bit - value at stacked PC position. Searched the whole RAM area for this value and found it in addition to that PC position at another RAM address. Map-File shows an array around that position. I examined that array and found out that the index counter of that array is in error case too high and exceeds the bounds of the array. A classic programming bug (luckily it's not implemented by me :-))
When I increase the mentioned array the error definitely doesn't occure anymore.
I think it's not necessary to understand the "voodoo magic" what happens after there's a write exceeding the array bounds. I see some variables after the array are set to invalid values and there's a function call, where values of the array are passed. But that doesn't explain, why passed values are stored at "stacked PC" position and not behind where local variables etc are stored in the frame? Step over in debugger isn't enough, I've got to start "running" until it happens after that exceeding write access.
Are there some possibilities to avoid array out-of-bounds write accesses? Of course you can implement if-conditions to trap it or you've got to comply MISRA-C rules. But isn't there a MPU in LPC4078 which identifies such invalid write accesses?
Joseph Yiu Thank you for all the support. Fortunately I don't need to examine more of the ISR's, DMA handler etc. It's good to know to get support in this forum if I get similar errors in future.
Glad to know that you have made good progress.
I guess the out-of-bound array write have corrupted some task stack - the saved PC value of a task, which is in the exception stack frame, are located inside task's stack.
During context switching, the processor switched from a task running in Thread mode to an OS exception (e.g. SysTick), and the current PC in the task is saved in the exception stack frame in the task's stack. The OS code then use PendSV code to context switch, and switch to another task using exception return. If a task stack is corrupted, and later when the OS context switch into it, the return address in the exception stack frame is invalid (0x0 in your case) and therefore it entered hardFault.
FreeRTOS can utilize the MPU to help detect this kind of issue:
www.freertos.org/FreeRTOS-MPU-memory-protection-unit.html
However, many FreeRTOS projects doesn't enable the MPU. Enabling the MPU require you to define the memory regions for each tasks (and share data variable) and the MPU region alignment requirements in Armv7-M make it a bit more complicated. This is much easier to do in Armv8-M processors (e.g. Cortex-M23, Cortex-M33).
I have seen errors like this, pc being set to 0, so can offer some insight. It may be way off base. I haven't used your CPU, nor an M4 at all, but have used M3 and the two are similar enough for what I'm about to describe. Also, I use the GNU toolchain + make (no IDE) but your problem is obviously a runtime one so the build process isn't so relevant.
Let's look at the facts:
pc 0
lr 12f89
psr 0
hfsr 40000000
shcsr 0
This tells us: fault was escalated to Hard Fault from a lesser fault, since hfsr[30] is set. pc being 0 is a usage fault, M4 can't go to the ARM mode, only Thumb, and Thumb mode always has pc[0] = 1. So, a usage fault has been escalated to Hard Fault. SHCSR[18] being clear confirms Usage Fault handler not enabled at time of fault.
psr[8:0] being zero tells us we weren't in any system exception (SVC,PendSV,Systick) or interrupt handler at time of fault. You say you are running with an RTOS, so I would guess you were in thread mode and using Process stack at time of fault.
OK, here's my theory...
Some function A includes a call to some other function B, at 0x12f84:
12f84: bl B
12f88: next thing in A
Why do I think this? Because your lr is 12f89, and that is correct lr value for B to return to A at the instruction after A's call to B (the instruction is at 12f88 but to jump to it, PC[0] must be 1).
If you have a listing file (I would do an OBJDUMP on my .axf file to produce a .lst file), you can look for 12f84 and that will tell you both A and B.
So, A has called B. The standard function prolog is
push r7, lr
A function does this so that it preserves the caller's r7, lr in order to use them itself. r7 (aka fp) is the frame pointer used to refer to the function's own local variables. lr needs saving if B wants to make further calls, i.e. B calls C. r7 will be pushed first, lr second.
As well as the prolog above, a function ends with an epilog that re-instates its caller' s r7 and lr and of course returning to it. This is
pop r7, lr
bx lr
or, the shorter equivalent
pop r7, pc
Now, let's imagine B looks something like
B() {
int X[2];
X[2] = 0;
}
B's prolog saved A's lr one slot on the stack ABOVE x[1]. Of course only X[0] and X[1] are valid, but our code overran the array bounds and did X[2] = 0.
This has trashed the slot on the stack that will be popped into pc by the epilog. The code above will indeed set pc to 0, and would fault. I think that your compiled epilog would have been the
variant, since if it had been the
variant then you would have had lr = pc = 0 at time of fault, and your lr was not 0.
A tell-tale sign of this kind of error is to examine r7 too. Your dump didn't include it, but if r7 and pc are related, that's a clue. If B had also done
X[3] = 1;
we'd see r7 = 1 at time of fault, since B's epilog would pop the 1 into r7 and the 0 into pc (or into lr which is then xferred to pc via bx lr).
Note here that this is not a stack overflow, you haven't run too far DOWN in memory. It's actually the opposite, you've written HIGHER in memory than your function's own local variable space.
I see that you also mention Bus Faults. If instead of
X[2] = 0
the code was
X[2] = BIG
then you have loaded BIG into PC and BIG may not be present in the address space, so the processor can't go fetch the instruction there, and if I recall, that is a Bus Fault.
I learned all of the above from Yiu's amazing Def Guide to M3/M4, 3rd ed, oh and of course by solving my own pc=0 situations!
Wow, what a lot of information tobermory
I didn't catch all of it and since I solved the issue in my device I'm currently not working on this topic. But that may help on similar issues and for experience on this kind of failure. Thank you!
__attribute__((naked)) void FaultHandler(void) { __asm__( "TST LR, #4 \n" "ITE EQ \n" "MRSEQ r1, MSP \n" "MRSNE r1, PSP \n" "MOV r2, LR \n" "MOV r0, r7 \n" "B FaultHandler_C \n" ); static void FaultHandler_C( uint32_t r7, uint32_t* stack, uint32_t excRet ) { ... };
You are welcome. I've included my own Fault Handler impl, that grabs r7 as I suggested, and also r14, which holds excReturn at time of fault.
And to follow up on the bit about the prolog and epilog, the prolog pushes TO the stack FROM r7, lr. The epilog expects to pop those SAME values FROM the stack TO r7, lr. The array out-of-bounds write shows how surprisingly easy it is to destroy that sequence of stack operations.