Hello everyone,
today I'm asking for hints on a tricky problem. We have a firmware that uses RTX-Kernel running on a NXP LPC2368. Now the device that the firmware is written for should get a new lc display. My honest mission is to change the firmware in order to use the new display.
I've spent some weeks this year to do so and some time I've had the problem that the controller resets short time after start and again and again...
Everytime this behaviour occured I have deleted one or more obsolete variables (mostly global) or functions. In most cases I solved the problem by searching other obsolete variables and deleting them from source code - try and error. That is really time-killing.
While testing the firmware on wednesday, I tried to make the adopted and modified routine for writing data to display RAM a little faster. I moved an global unsigned int to the function and changed it to static unsigned char because the value it has to carry is 0x0D at a maximum.
After flashing the firmware in the controller, the controller hung at a random short time.
Yesterday I was trying to solve the problem with hanging firmware on random time and found the problem when no task is running: OS calls os_idle_demon() and was not able to return from it. I found a solution in world wide web: Creating an empty low priority task without using any os_wait functions that prevents the OS from calling the idle task. (It has something to do with incorrect interrupt states on retunring from idle task.)
Today I further tried to make the display writing function faster and changed two unsigned char inside the function from static to non-static. After flashing this firmware the controller resets again and again. I will now try to find out why the controller behaves this way.
What I found out is, that no watchdog is enabled by user (is it part of the OS?). The os_stk_overflow an os_idle_demon are not called from OS. I debug the firmware using ULINK2.
Any ideas where to search the problem for?
Best regards
It would be up to you to enable any watchdog.
The RTOS can't do it, because the RTOS would not know when to kick the watchdog. A program that makes use of a watchdog should make a lot of attempts to verify that the program is really, really behaving well before deciding to kick. The RTOS can only figure out if it is working ok - not if running threads are doing what they are expected to, or if sleeping threads are really expected to be sleeping.
It sounds like you have uninitialized variables, stack overflow or a buffer overflow (memory overwrites) somewhere in the program. Adding or removing global variables or changing the contents on the stack changes the behaviour you see because your code changes also moves the location of lots of variables. And changes the total amount of stack space needed.
Have you started by making sure you compile your code at maximum warning level?
Have you tried to fill the stacks with a pattern and check how much of the stacks that are getting used?
Todays work is coming to an end now and here are the actual results.
At first I checked the compiler settings inside the project an found in tab 'C/C++' the option 'Warnings' set to 'All warnings'. That should meet my needs. The next step was to have a sight to the compiler control string. Amongst other entries that define include directories, global macros, generation of listing files and so on, I found the optimizing level set to 0 (-O0). That should be fine also for debug purposes. Then I added --strict to the string and got hundreds and hundred warnings / errors because of using // to comment code out. Nice experiment - I removed strict.
Next step was to check the stack usage. After reading µV4-Help a while, I found the --callgraph output generated by the linker. Opening the callgraph I found the entry Maximum stack usage = 592 bytes + Unknown (Functions without stacksize, Cycles, Untraceable Function Pointers). Since there is a stack size of 274 bytes defined for each task, this definitely is one source for data corruption.
So tommorow I will have a closer look at the functions that use a huge amount of stack size and try to optimize them. Further I will try to implement user defined stack sizes for each task.
Do you think I am on the right way? Any comments or hints?
I agree. (@Per - for many chips technically I think you could write a Reset Handler that properly resets / re initializes all your peripherals and the CPU properly but certainly not common or in my opinion good practice. But as you point out I may be wrong in that thinking)
I once had found a website that had excellent examples of handlers (for debug purposes) for ARM7 and Cortex-M3 does anyone have any good links with any good examples?
I have seen one too, and later saw links to it on this forum.
A bit of Google skills should be able to pick up on it - was showing code to walk the exception stack.
Not applicable to ARM7 but for future reference here is a good one for Cortex M3/0
support.code-red-tech.com/.../DebugHardFault
Thank you very much Per and Marc again for all your answers to make me moving in the right direction.
Actually I read 'The Insider's Guide To The NXP LPC 2300/2400 Based Microcontrollers'. I have the feeling that I should have done this earlier, but I did not know anything of the existence of such a document.
I will report here if I can make the reset happen again and I will try to find the handler that leads to the reset.
Good morning everyone.
Things appear a little bit clearer to me now. I am still reading the guide. If I understand all your latest posts right and review the startup code of our firmware, I realize that in our firmware all protection exceptions (Undef, PAbt, DAbt) lead to a reset of the device.
That means if any of these exceptions occurs, the firmware forces a reset. Again and again, if the exceptions source is an error in the source code of our firmware. Because of this, I am not able to find any error if the firmware resets on every exception...
Am I right so far?
And additionally: The included RealTime-Agent is not able to work like it should, because of this line:
DAbt_Addr DCD Reset_Handler ;DAbt_Handler
Right?
I changed back the handlers like Marc suggested, since I know now what this part of the startup file is doing.
I left the DAbt_Handler unchanged for using the RealTime-Agent.
A question regarding the DAbt_Handler: In the following code sequence, what DAbt_Handler would be jumped to in case of a data abort exception?
IMPORT SWI_Handler EXTERN DAbt_Handler Undef_Handler B Undef_Handler ;SWI_Handler B SWI_Handler PAbt_Handler B PAbt_Handler DAbt_Handler B DAbt_Handler IRQ_Handler B IRQ_Handler FIQ_Handler B FIQ_Handler ; Reset Handler EXPORT Reset_Handler Reset_Handler
Would a DAbt force a jump to the external handler or to the endless loop? I would guess the jump goes to EXTERN DAbt_Handler simply because the statement is located earlier in the code.
A second question: What exactly means EXPORT Reset_Handler?
I found out on my own that the default DAbt_Handler has to be commented out. This should always be done if an external label is importet.
www.keil.com/.../armasmref_Babcjehh.htm
IMPORT imports the symbol unconditionally. EXTERN imports the symbol only if it is referred to in the current assembly.
[EXTERN in assembly] is different from [extern in C].
My understanding is:
Assuming that, for some reasons, your firmware push/pop some data from one of the stacks, causes a Data Abort, then the processor performs the
LDR PC, DAbt_Addr
and since
DAbt_Addr DCD Reset_Handler
the processor runs the Reset_Handler once again, doing something else, if the "something else" does not cause another Data Abort, you will not notice anything about the passed Data Abort. However, the system is already messed up.
A Reset runs the Reset_Handler. But re-run the Reset_Handler is not a Reset, the reason is as what Per has explained.
Hello John,
thank you for the link to the Assembler Reference. I found the explaination for EXPORT there, but it is not fully clear to me why this directive is used for the Reset_Handler symbol in the startup file.
Reading my thread here again I unterstand the following:
1. Any program exception (DAbt, PAbt, Undef, Reset) leads to a call of the reset handler in the firmware.
2. Calling the reset handler simply equals a jump to the start address of the firmware without setting any reset conditions in the device.
3. Because of this I have to implement a mechanism to force a real reset.
Someone tell me please if I'm right or wrong.
Your code does contain jumps to the reset address.
Most startup files do not. It's normal to either supply a real exception handler, or have just a busy-loop like:
PAbt_Handler B PAbt_Handler
For programs that has the watchdog enabled, the above busy-loop will hang the processor in the loop until the watchdog generates a real reset, that does not only jump to the reset address but first performs a full reset of the processor. And full reset here means that all registers gets default values (except the boot reason bits, that will inform that it was a watchdog reset), and all internal state machines gets reset.
So a program should never make an intentional jump to the reset address. If the detected problem can't be solved by explicit code, then the program should let the watchdog force a reset.
Thank you for confirming my assumptions Per. It finally sunk in!
I feel a little bit stupid for taking so long until I realized what you have meant.
Since I am informed about what the problem is and I am on my way reading the guide, I found interesting codelines in the firmware.
While searching for enabled interrupt sources (to get a better overview of the firmware) I found an attempt to implement the watchdog in reset mode:
__irq void watchdog(void) { } void Init_Watchdog(void) { // RSIR|= 0x04; VICVectAddr1= (unsigned long)watchdog; // set interrupt vector in 0 VICIntEnable= VICIntEnable | 0x00000001; WDTC= 0x00000FFF; WDCLKSEL= 0x00000001, WDCLKSEL= 0x00000001; WDMOD|= 0x3; os_dly_wait(100); WDFEED= 0xAA; os_dly_wait(100); WDFEED= 0xEE; }
Because of the some errors and unnecessary code lines (red marked) the watchdog never was running, I suppose. Furthermore it is needless to set the watchdog as vectored interrupt, when setting it to reset mode.
Now I will try to implement the watchdog, including the original endless loops called on program exceptions to get a real reset!
Any further suggestions?
Hi Robert,
Don't you still need to determine the cause of your exception/reset?
Before enabling any watchdog I would recommend implementing proper exception handlers (even if they are simple while(1)) and observing your RSID value on reset.
Also, you should note implementing a watchdog in an OS task based firmware in not as straight forward as you might expect.
You will need a way for each task (or relevant tasks, anyway) to flag a intermidiate watchdog flag before you actually feed the hardware watchdog. (otherwise it is meaningless or you are only 'watching' a single task)
I highly recommend figuring out if you have any issues and implementing proper handlers before enabling a watchdog.
M
Hello Marc.
You are right, I still need to find the cause of exceptions / reset. But actually I am not able to make the firmware behave so bad! ;-) Whatever has forced the exceptions / reset, it has apparently temporarily gone away.
And you are right once more, if you tell me to implement proper exception handlers before enabling watchdog. That's why I wrote 'Now I will try to implement the watchdog, including the original endless loops called on program exceptions to get a real reset!' in my latest post.
Could you explain a little more detailed, why I need an intermediate watchdog flag when I use RTOS? I plan to reload the WD in every active task including my idle task. I thought it should be a good plan, because in case of a exception and a call to an endless jump loop no task is able to reload the WD. Are there mistakes in this plan or something more that I should consider?
Best regards Robert
If you 'feed' (kick, reload...) the watchdog in all tasks than you will not know if only one of your tasks gets stuck.
If you use the watchdog as you describe than you are only using it to reset your device when an exception occurs. Generally I think watchdogs provide a bigger function than simply a reset on an exception.
However, this would work as you describe.
If you have access to the external reset I would suggest this as a better mechanism for resetting your device on an exception and I would use the watchdog to ensure all tasks are properly executing.
Regards,
Marc
A normal way to implement the watchdog function is to have only the lowest-prioritized task kick the watchdog.
This proves that you have enough CPU capacity that you don't starve this low-prio task.
But to verify that all the other tasks works, you normally have them kick internal counters. The low-prio task checks that these counters all get updated - if a counter has stood still for too long time, then all kicking is of the physical watchdog is stopped.
In some situations, you may have to create a table where the low-prio task knows the max time allowed for the different high-prio tasks to update their individual counters.
Also, tasks may no longer perform infinite waits, but should specify a timeout. Just so that they can update their counters even if a serial listener never gets any data to process from an external serial port.
Next step up is to also perform dynamic checking of contents if important configuration and data structures, and stop kicking the watchdog if there is any errors detected.
Good morning Marc, good morning Per.
At first I want to thank you again for all the great input.
I understand what you both have explained to me and I decided to implement your recommendations stepwise.
=> The first step is to implement the watchdog to force a real reset on error exceptions. This is a better behaviour than simply jumping to start adress - that's what I have learned. ;-)
This step is nearly finished and it should be enough for the moment, since this is much more than I was instructed to do with the firmware. With this step I should be able to better debug the firmware in case of an error exception.
=> The second step is to implement the following mechanism:
Active tasks do not wait infinite and every task has its own timer.
The watchdog is kicked by the lowest priorized task only (actually my idle task) to ensure that enough cpu capacity is available. Furthermore, this task checks if all other tasks are running by checking their timers. If a task got stuck, the checking task can restart it. If restarting the deadlocked task is not enough (depending on the tasks function) the checking task could reset the whole device via external reset (I'm sure I can get access to it) after error information have been saved to memory (task name, register states, ...).
If a program error exception is detected, the device would be restarted via external reset when the error conditions have been saved to memory by the regarding handlers.
Is the second step consistent to that what you both have suggested?
I have another question regarding the user defined stack size for every task.
The RTOS is initialized by os_sys_init(task1). I want to have task1 its own user defined stack size using os_sys_init_user(task1, ...), but I do not want to reserve memory for its stack permanently.
How can I provide memory dynamically for task1 that is finished by os_tsk_delete_self()?
Looking forward for your answers.
Robert,
Don't use dynamic memory if you can avoid it. It will lead to memory fragmentation - unless you allocate a big chuck at program startup and distribute it as will (memory manager...). This can indeed reduce your binary size significantly if you have enough RAM.
the checking task can restart it.
In my opinion this is a bad design. You should test to make sure no hangups can occur. And how will you "release" a task? What if its sitting in a
for (;;) ;
? Better to reduce resource consumption and defer to the common, trusted and working solution of maintaining a bitmap, each bit indicating "task n is alive". If this regularly tested bitmap is not set entire to 1 (or 0) - one task, or preferably the hardware abstraction later will reset the device. This way, the watchdog is serviced from one place only, reducing the change of faulty code servicing it _even_ in the event of "task failure".
Of course, deleting a task and re-spawning it is always possible in an attempt to introduce a "self correcting" system. Note, however, that this could lead to inter-task synchronization problems.
Hello Tamir.
Thank you for your explanations. I will implement the stack for task1 as permanent memory reservation.
Additionally, I will think over my plans for step 2 if the time has come to implement it.
I have an urgent question regarding the watchdog. My device resets after a macro for kicking the watchdog is called.
I implemented the watchdog as follows:
// macro for kicking watchdog (in project.h): #define WDT_KICK { WDFEED= 0xAA; WDFEED= 0x55; } // watchdog initialization (in project.c): __task void task1(void) { // ... WDCLKSEL= 0x00000001; WDTC= 0x00000FFF; WDMOD= 0x03; WDT_KICK // <= execution jumps to DAbt_Handler in startup file? // ... } int main(void) { // ... os_sys_init(task1); } // startup file: ; Part 1: Physical vector table with Load Register Instructions (LDR) on each vector, ; loads values in constants table (32bit-wide memory locations) to program counter (PC) forcing a jump to memory location CDCVectors LDR PC, Reset_Addr LDR PC, Undef_Addr LDR PC, SWI_Addr LDR PC, PAbt_Addr LDR PC, DAbt_Addr NOP ; Reserved Vector ; LDR PC, IRQ_Addr LDR PC, [PC, #-0x0120] ; Vector from VicVectAddr LDR PC, FIQ_Addr ; Part 2: Constants table with addresses of jump targets Reset_Addr DCD Reset_Handler Undef_Addr DCD Undef_Handler ; RoS| 15.12.11: For watchdog (to get real reset) use Undef_Handler instead of Reset_Handler! Was: Reset if OP code isn't ARM or THUMB SWI_Addr DCD SWI_Handler PAbt_Addr DCD PAbt_Handler ; RoS| 15.12.11: For watchdog (to get real reset) use PAbt_Handler instead of Reset_Handler! Was: Reset on prefetch abort exception DAbt_Addr DCD DAbt_Handler ; RoS| 15.12.11: For watchdog (to get real reset) use DAbt_Handler instead of Reset_Handler! Was: Reset on data abort exception DCD 0 ; Reserved Address IRQ_Addr DCD IRQ_Handler FIQ_Addr DCD FIQ_Handler ; RoS| 13.12.11: Jump target for the one and only fast interrupt; Actually: endless loop - means "not defined" IMPORT SWI_Handler ; RoS| imported SWI label EXTERN DAbt_Handler ; RoS| 29.11.11: for RTA usage (siehe http://www.keil.com/support/man/docs/ulink2/ulink2_ra_modifying_startup.htm) - external DAbt_Handler ; Part 3: Labels to which program jumps to if exception occurs, ; and then endless jumps to label Undef_Handler B Undef_Handler ;SWI_Handler B SWI_Handler ; RoS| 13.12.11: Another SWI_Handler is imported PAbt_Handler B PAbt_Handler ;DAbt_Handler B DAbt_Handler ; RoS| 29.11.11: for RTA usage (siehe http://www.keil.com/support/man/docs/ulink2/ulink2_ra_modifying_startup.htm) - endless loop obsolete IRQ_Handler B IRQ_Handler FIQ_Handler B FIQ_Handler ; Reset Handler EXPORT Reset_Handler Reset_Handler
What may going on there? DAbt_Handler is part of the real time agent. How can I use this to find out what the error is?
Interrupts _must_ be disabled while servicing the watchdog !
Use
__disable_irq() ;
Or, even better - make the servicing function a SWI function.
*AAAARRRGGGHHH*
I read about disabling interrupts... LPC23xx user manual:
"Interrupts should be disabled during the feed sequence. An abort condition will occur if an interrupt happens during the feed sequence."
Why don't they mark such important sentences with an exclamation mark in their user manuals?
Thank you for helping out immediately Tamir!
PS: I will re-read the insiders guide to check if I am able to implement the feeding of the predator via SWI function.
I implemented kicking the watchdog as an swi function first called in task1:
// project.h: extern void __swi(12) WDT_KICK(void); // RoS| 15.12.2011: feeds the watchdog // watchdog kicking and initialization (in project.c): void __SWI_12 (void) { WDFEED= 0xAA; WDFEED= 0x55; } __task void task1(void) { // ... WDCLKSEL= 0x00000001; WDTC= 0x00000FFF; WDMOD= 0x03; WDT_KICK(); // <= execution STILL jumps to DAbt_Handler in startup file? // ... } int main(void) { // ... os_sys_init_user(task1, 1, &id1_stk, sizeof(id1_stk)); } // SWI_Table.s: // ... ; Import user SWI functions here. IMPORT __SWI_8 IMPORT __SWI_9 IMPORT __SWI_10 IMPORT __SWI_11 IMPORT __SWI_12 EXPORT SWI_Table SWI_Table DCD __SWI_0 ; SWI 0 used by RTL DCD __SWI_1 ; SWI 1 used by RTL DCD __SWI_2 ; SWI 2 used by RTL DCD __SWI_3 ; SWI 3 used by RTL DCD __SWI_4 ; SWI 4 used by RTL DCD __SWI_5 ; SWI 5 used by RTL DCD __SWI_6 ; SWI 6 used by RTL DCD __SWI_7 ; SWI 7 used by RTL ; Insert user SWI functions here. SWI 0..7 are used by RTL Kernel. DCD __SWI_8 ; SWI 8 User Function DCD __SWI_9 ; SWI 9 User Function DCD __SWI_10 ; SWI 10 User Function DCD __SWI_11 ; SWI 11 DCD __SWI_12 ; SWI 12 User function - Kick Watchdog with interrupts disabled (except FIQ)
The device resets again and again... It still jumps to DAbt_Handler when executing WDFEED= 0xAA;... Since there is no FIQ used, I have no idea why this is not working! :(
Could this have something to do with the RealTime Agent?
View all questions in Keil forum