Hello everyone,
today I'm asking for hints on a tricky problem. We have a firmware that uses RTX-Kernel running on a NXP LPC2368. Now the device that the firmware is written for should get a new lc display. My honest mission is to change the firmware in order to use the new display.
I've spent some weeks this year to do so and some time I've had the problem that the controller resets short time after start and again and again...
Everytime this behaviour occured I have deleted one or more obsolete variables (mostly global) or functions. In most cases I solved the problem by searching other obsolete variables and deleting them from source code - try and error. That is really time-killing.
While testing the firmware on wednesday, I tried to make the adopted and modified routine for writing data to display RAM a little faster. I moved an global unsigned int to the function and changed it to static unsigned char because the value it has to carry is 0x0D at a maximum.
After flashing the firmware in the controller, the controller hung at a random short time.
Yesterday I was trying to solve the problem with hanging firmware on random time and found the problem when no task is running: OS calls os_idle_demon() and was not able to return from it. I found a solution in world wide web: Creating an empty low priority task without using any os_wait functions that prevents the OS from calling the idle task. (It has something to do with incorrect interrupt states on retunring from idle task.)
Today I further tried to make the display writing function faster and changed two unsigned char inside the function from static to non-static. After flashing this firmware the controller resets again and again. I will now try to find out why the controller behaves this way.
What I found out is, that no watchdog is enabled by user (is it part of the OS?). The os_stk_overflow an os_idle_demon are not called from OS. I debug the firmware using ULINK2.
Any ideas where to search the problem for?
Best regards
Hello John,
thank you for the link to the Assembler Reference. I found the explaination for EXPORT there, but it is not fully clear to me why this directive is used for the Reset_Handler symbol in the startup file.
Reading my thread here again I unterstand the following:
1. Any program exception (DAbt, PAbt, Undef, Reset) leads to a call of the reset handler in the firmware.
2. Calling the reset handler simply equals a jump to the start address of the firmware without setting any reset conditions in the device.
3. Because of this I have to implement a mechanism to force a real reset.
Someone tell me please if I'm right or wrong.
Your code does contain jumps to the reset address.
Most startup files do not. It's normal to either supply a real exception handler, or have just a busy-loop like:
PAbt_Handler B PAbt_Handler
For programs that has the watchdog enabled, the above busy-loop will hang the processor in the loop until the watchdog generates a real reset, that does not only jump to the reset address but first performs a full reset of the processor. And full reset here means that all registers gets default values (except the boot reason bits, that will inform that it was a watchdog reset), and all internal state machines gets reset.
So a program should never make an intentional jump to the reset address. If the detected problem can't be solved by explicit code, then the program should let the watchdog force a reset.
Thank you for confirming my assumptions Per. It finally sunk in!
I feel a little bit stupid for taking so long until I realized what you have meant.
Since I am informed about what the problem is and I am on my way reading the guide, I found interesting codelines in the firmware.
While searching for enabled interrupt sources (to get a better overview of the firmware) I found an attempt to implement the watchdog in reset mode:
__irq void watchdog(void) { } void Init_Watchdog(void) { // RSIR|= 0x04; VICVectAddr1= (unsigned long)watchdog; // set interrupt vector in 0 VICIntEnable= VICIntEnable | 0x00000001; WDTC= 0x00000FFF; WDCLKSEL= 0x00000001, WDCLKSEL= 0x00000001; WDMOD|= 0x3; os_dly_wait(100); WDFEED= 0xAA; os_dly_wait(100); WDFEED= 0xEE; }
Because of the some errors and unnecessary code lines (red marked) the watchdog never was running, I suppose. Furthermore it is needless to set the watchdog as vectored interrupt, when setting it to reset mode.
Now I will try to implement the watchdog, including the original endless loops called on program exceptions to get a real reset!
Any further suggestions?
Hi Robert,
Don't you still need to determine the cause of your exception/reset?
Before enabling any watchdog I would recommend implementing proper exception handlers (even if they are simple while(1)) and observing your RSID value on reset.
Also, you should note implementing a watchdog in an OS task based firmware in not as straight forward as you might expect.
You will need a way for each task (or relevant tasks, anyway) to flag a intermidiate watchdog flag before you actually feed the hardware watchdog. (otherwise it is meaningless or you are only 'watching' a single task)
I highly recommend figuring out if you have any issues and implementing proper handlers before enabling a watchdog.
M
Hello Marc.
You are right, I still need to find the cause of exceptions / reset. But actually I am not able to make the firmware behave so bad! ;-) Whatever has forced the exceptions / reset, it has apparently temporarily gone away.
And you are right once more, if you tell me to implement proper exception handlers before enabling watchdog. That's why I wrote 'Now I will try to implement the watchdog, including the original endless loops called on program exceptions to get a real reset!' in my latest post.
Could you explain a little more detailed, why I need an intermediate watchdog flag when I use RTOS? I plan to reload the WD in every active task including my idle task. I thought it should be a good plan, because in case of a exception and a call to an endless jump loop no task is able to reload the WD. Are there mistakes in this plan or something more that I should consider?
Best regards Robert
If you 'feed' (kick, reload...) the watchdog in all tasks than you will not know if only one of your tasks gets stuck.
If you use the watchdog as you describe than you are only using it to reset your device when an exception occurs. Generally I think watchdogs provide a bigger function than simply a reset on an exception.
However, this would work as you describe.
If you have access to the external reset I would suggest this as a better mechanism for resetting your device on an exception and I would use the watchdog to ensure all tasks are properly executing.
Regards,
Marc
A normal way to implement the watchdog function is to have only the lowest-prioritized task kick the watchdog.
This proves that you have enough CPU capacity that you don't starve this low-prio task.
But to verify that all the other tasks works, you normally have them kick internal counters. The low-prio task checks that these counters all get updated - if a counter has stood still for too long time, then all kicking is of the physical watchdog is stopped.
In some situations, you may have to create a table where the low-prio task knows the max time allowed for the different high-prio tasks to update their individual counters.
Also, tasks may no longer perform infinite waits, but should specify a timeout. Just so that they can update their counters even if a serial listener never gets any data to process from an external serial port.
Next step up is to also perform dynamic checking of contents if important configuration and data structures, and stop kicking the watchdog if there is any errors detected.
Good morning Marc, good morning Per.
At first I want to thank you again for all the great input.
I understand what you both have explained to me and I decided to implement your recommendations stepwise.
=> The first step is to implement the watchdog to force a real reset on error exceptions. This is a better behaviour than simply jumping to start adress - that's what I have learned. ;-)
This step is nearly finished and it should be enough for the moment, since this is much more than I was instructed to do with the firmware. With this step I should be able to better debug the firmware in case of an error exception.
=> The second step is to implement the following mechanism:
Active tasks do not wait infinite and every task has its own timer.
The watchdog is kicked by the lowest priorized task only (actually my idle task) to ensure that enough cpu capacity is available. Furthermore, this task checks if all other tasks are running by checking their timers. If a task got stuck, the checking task can restart it. If restarting the deadlocked task is not enough (depending on the tasks function) the checking task could reset the whole device via external reset (I'm sure I can get access to it) after error information have been saved to memory (task name, register states, ...).
If a program error exception is detected, the device would be restarted via external reset when the error conditions have been saved to memory by the regarding handlers.
Is the second step consistent to that what you both have suggested?
I have another question regarding the user defined stack size for every task.
The RTOS is initialized by os_sys_init(task1). I want to have task1 its own user defined stack size using os_sys_init_user(task1, ...), but I do not want to reserve memory for its stack permanently.
How can I provide memory dynamically for task1 that is finished by os_tsk_delete_self()?
Looking forward for your answers.
Robert,
Don't use dynamic memory if you can avoid it. It will lead to memory fragmentation - unless you allocate a big chuck at program startup and distribute it as will (memory manager...). This can indeed reduce your binary size significantly if you have enough RAM.
the checking task can restart it.
In my opinion this is a bad design. You should test to make sure no hangups can occur. And how will you "release" a task? What if its sitting in a
for (;;) ;
? Better to reduce resource consumption and defer to the common, trusted and working solution of maintaining a bitmap, each bit indicating "task n is alive". If this regularly tested bitmap is not set entire to 1 (or 0) - one task, or preferably the hardware abstraction later will reset the device. This way, the watchdog is serviced from one place only, reducing the change of faulty code servicing it _even_ in the event of "task failure".
Of course, deleting a task and re-spawning it is always possible in an attempt to introduce a "self correcting" system. Note, however, that this could lead to inter-task synchronization problems.
Hello Tamir.
Thank you for your explanations. I will implement the stack for task1 as permanent memory reservation.
Additionally, I will think over my plans for step 2 if the time has come to implement it.
I have an urgent question regarding the watchdog. My device resets after a macro for kicking the watchdog is called.
I implemented the watchdog as follows:
// macro for kicking watchdog (in project.h): #define WDT_KICK { WDFEED= 0xAA; WDFEED= 0x55; } // watchdog initialization (in project.c): __task void task1(void) { // ... WDCLKSEL= 0x00000001; WDTC= 0x00000FFF; WDMOD= 0x03; WDT_KICK // <= execution jumps to DAbt_Handler in startup file? // ... } int main(void) { // ... os_sys_init(task1); } // startup file: ; Part 1: Physical vector table with Load Register Instructions (LDR) on each vector, ; loads values in constants table (32bit-wide memory locations) to program counter (PC) forcing a jump to memory location CDCVectors LDR PC, Reset_Addr LDR PC, Undef_Addr LDR PC, SWI_Addr LDR PC, PAbt_Addr LDR PC, DAbt_Addr NOP ; Reserved Vector ; LDR PC, IRQ_Addr LDR PC, [PC, #-0x0120] ; Vector from VicVectAddr LDR PC, FIQ_Addr ; Part 2: Constants table with addresses of jump targets Reset_Addr DCD Reset_Handler Undef_Addr DCD Undef_Handler ; RoS| 15.12.11: For watchdog (to get real reset) use Undef_Handler instead of Reset_Handler! Was: Reset if OP code isn't ARM or THUMB SWI_Addr DCD SWI_Handler PAbt_Addr DCD PAbt_Handler ; RoS| 15.12.11: For watchdog (to get real reset) use PAbt_Handler instead of Reset_Handler! Was: Reset on prefetch abort exception DAbt_Addr DCD DAbt_Handler ; RoS| 15.12.11: For watchdog (to get real reset) use DAbt_Handler instead of Reset_Handler! Was: Reset on data abort exception DCD 0 ; Reserved Address IRQ_Addr DCD IRQ_Handler FIQ_Addr DCD FIQ_Handler ; RoS| 13.12.11: Jump target for the one and only fast interrupt; Actually: endless loop - means "not defined" IMPORT SWI_Handler ; RoS| imported SWI label EXTERN DAbt_Handler ; RoS| 29.11.11: for RTA usage (siehe http://www.keil.com/support/man/docs/ulink2/ulink2_ra_modifying_startup.htm) - external DAbt_Handler ; Part 3: Labels to which program jumps to if exception occurs, ; and then endless jumps to label Undef_Handler B Undef_Handler ;SWI_Handler B SWI_Handler ; RoS| 13.12.11: Another SWI_Handler is imported PAbt_Handler B PAbt_Handler ;DAbt_Handler B DAbt_Handler ; RoS| 29.11.11: for RTA usage (siehe http://www.keil.com/support/man/docs/ulink2/ulink2_ra_modifying_startup.htm) - endless loop obsolete IRQ_Handler B IRQ_Handler FIQ_Handler B FIQ_Handler ; Reset Handler EXPORT Reset_Handler Reset_Handler
What may going on there? DAbt_Handler is part of the real time agent. How can I use this to find out what the error is?
Interrupts _must_ be disabled while servicing the watchdog !
Use
__disable_irq() ;
Or, even better - make the servicing function a SWI function.
*AAAARRRGGGHHH*
I read about disabling interrupts... LPC23xx user manual:
"Interrupts should be disabled during the feed sequence. An abort condition will occur if an interrupt happens during the feed sequence."
Why don't they mark such important sentences with an exclamation mark in their user manuals?
Thank you for helping out immediately Tamir!
PS: I will re-read the insiders guide to check if I am able to implement the feeding of the predator via SWI function.
I implemented kicking the watchdog as an swi function first called in task1:
// project.h: extern void __swi(12) WDT_KICK(void); // RoS| 15.12.2011: feeds the watchdog // watchdog kicking and initialization (in project.c): void __SWI_12 (void) { WDFEED= 0xAA; WDFEED= 0x55; } __task void task1(void) { // ... WDCLKSEL= 0x00000001; WDTC= 0x00000FFF; WDMOD= 0x03; WDT_KICK(); // <= execution STILL jumps to DAbt_Handler in startup file? // ... } int main(void) { // ... os_sys_init_user(task1, 1, &id1_stk, sizeof(id1_stk)); } // SWI_Table.s: // ... ; Import user SWI functions here. IMPORT __SWI_8 IMPORT __SWI_9 IMPORT __SWI_10 IMPORT __SWI_11 IMPORT __SWI_12 EXPORT SWI_Table SWI_Table DCD __SWI_0 ; SWI 0 used by RTL DCD __SWI_1 ; SWI 1 used by RTL DCD __SWI_2 ; SWI 2 used by RTL DCD __SWI_3 ; SWI 3 used by RTL DCD __SWI_4 ; SWI 4 used by RTL DCD __SWI_5 ; SWI 5 used by RTL DCD __SWI_6 ; SWI 6 used by RTL DCD __SWI_7 ; SWI 7 used by RTL ; Insert user SWI functions here. SWI 0..7 are used by RTL Kernel. DCD __SWI_8 ; SWI 8 User Function DCD __SWI_9 ; SWI 9 User Function DCD __SWI_10 ; SWI 10 User Function DCD __SWI_11 ; SWI 11 DCD __SWI_12 ; SWI 12 User function - Kick Watchdog with interrupts disabled (except FIQ)
The device resets again and again... It still jumps to DAbt_Handler when executing WDFEED= 0xAA;... Since there is no FIQ used, I have no idea why this is not working! :(
Could this have something to do with the RealTime Agent?