Hello everyone,
today I'm asking for hints on a tricky problem. We have a firmware that uses RTX-Kernel running on a NXP LPC2368. Now the device that the firmware is written for should get a new lc display. My honest mission is to change the firmware in order to use the new display.
I've spent some weeks this year to do so and some time I've had the problem that the controller resets short time after start and again and again...
Everytime this behaviour occured I have deleted one or more obsolete variables (mostly global) or functions. In most cases I solved the problem by searching other obsolete variables and deleting them from source code - try and error. That is really time-killing.
While testing the firmware on wednesday, I tried to make the adopted and modified routine for writing data to display RAM a little faster. I moved an global unsigned int to the function and changed it to static unsigned char because the value it has to carry is 0x0D at a maximum.
After flashing the firmware in the controller, the controller hung at a random short time.
Yesterday I was trying to solve the problem with hanging firmware on random time and found the problem when no task is running: OS calls os_idle_demon() and was not able to return from it. I found a solution in world wide web: Creating an empty low priority task without using any os_wait functions that prevents the OS from calling the idle task. (It has something to do with incorrect interrupt states on retunring from idle task.)
Today I further tried to make the display writing function faster and changed two unsigned char inside the function from static to non-static. After flashing this firmware the controller resets again and again. I will now try to find out why the controller behaves this way.
What I found out is, that no watchdog is enabled by user (is it part of the OS?). The os_stk_overflow an os_idle_demon are not called from OS. I debug the firmware using ULINK2.
Any ideas where to search the problem for?
Best regards
It would be up to you to enable any watchdog.
The RTOS can't do it, because the RTOS would not know when to kick the watchdog. A program that makes use of a watchdog should make a lot of attempts to verify that the program is really, really behaving well before deciding to kick. The RTOS can only figure out if it is working ok - not if running threads are doing what they are expected to, or if sleeping threads are really expected to be sleeping.
It sounds like you have uninitialized variables, stack overflow or a buffer overflow (memory overwrites) somewhere in the program. Adding or removing global variables or changing the contents on the stack changes the behaviour you see because your code changes also moves the location of lots of variables. And changes the total amount of stack space needed.
Have you started by making sure you compile your code at maximum warning level?
Have you tried to fill the stacks with a pattern and check how much of the stacks that are getting used?
Todays work is coming to an end now and here are the actual results.
At first I checked the compiler settings inside the project an found in tab 'C/C++' the option 'Warnings' set to 'All warnings'. That should meet my needs. The next step was to have a sight to the compiler control string. Amongst other entries that define include directories, global macros, generation of listing files and so on, I found the optimizing level set to 0 (-O0). That should be fine also for debug purposes. Then I added --strict to the string and got hundreds and hundred warnings / errors because of using // to comment code out. Nice experiment - I removed strict.
Next step was to check the stack usage. After reading µV4-Help a while, I found the --callgraph output generated by the linker. Opening the callgraph I found the entry Maximum stack usage = 592 bytes + Unknown (Functions without stacksize, Cycles, Untraceable Function Pointers). Since there is a stack size of 274 bytes defined for each task, this definitely is one source for data corruption.
So tommorow I will have a closer look at the functions that use a huge amount of stack size and try to optimize them. Further I will try to implement user defined stack sizes for each task.
Do you think I am on the right way? Any comments or hints?
I think Marc Crandall is right. And maybe it is not a Reset, it just looks like a Reset.
Hi Robert Suess,
Did you implement any of the
Undef_Handler SWI_Handler PAbt_Handler DAbt_Handler
or some kind of software reset functionality?
I see, thank you for all the answers.
@S Steve: Ok, thank you for sharing your refreshing ideas. :) I was not quite sure if you where trolling around.
The general consensus is that there are no serious bugs in the linker.
Should I go with the flow or should I make my own experiences, what do you think?
@Per: Thank you for enlighten me about S(tunned) Steve. ;)
@Marc: Welcome to my thread and thank you very much for your input! I will try to enforce the reset again and if successful, I will check the value of the RSID after the reset like you suggested.
To answer Johns question: It is very difficult for me to find all pieces of the puzzle since I not wrote the firmware on my own. It is a crackbrained mix of a very old firmware written for a 8bit controller, a non actual RTX USBCDC example project written for 'Keil MCB2300' and a patchwork of code snippets to make the firmware behave like it should. And no comments are in the code. Do not get me wrong, it is a great achievement that the firmware runs like it is expected to do. But for me as programmer it is hard to find errors if they occur now.
The information I can give this moment is, if it helps to clarify your question John:
; Exception Vectors ; Mapped to Address 0. ; Absolute addressing mode must be used. ; Dummy Handlers are implemented as infinite loops which can be modified. CDCVectors LDR PC, Reset_Addr LDR PC, Undef_Addr LDR PC, SWI_Addr LDR PC, PAbt_Addr LDR PC, DAbt_Addr NOP ; Reserved Vector ; LDR PC, IRQ_Addr LDR PC, [PC, #-0x0120] ; Vector from VicVectAddr LDR PC, FIQ_Addr Reset_Addr DCD Reset_Handler Undef_Addr DCD Reset_Handler;Undef_Handler SWI_Addr DCD SWI_Handler PAbt_Addr DCD Reset_Handler;PAbt_Handler DAbt_Addr DCD Reset_Handler ;DAbt_Handler DCD 0 ; Reserved Address IRQ_Addr DCD IRQ_Handler FIQ_Addr DCD FIQ_Handler IMPORT SWI_Handler EXTERN DAbt_Handler ; RoS| 29.11.11: for RT-Agent (http://www.keil.com/support/man/docs/ulink2/ulink2_ra_modifying_startup.htm) Undef_Handler B Undef_Handler ;SWI_Handler B SWI_Handler PAbt_Handler B PAbt_Handler ;DAbt_Handler B DAbt_Handler ; RoS| 29.11.11: for RT-Agent IRQ_Handler B IRQ_Handler FIQ_Handler B FIQ_Handler ; Reset Handler EXPORT Reset_Handler Reset_Handler
This is part of the modified startup file lpc2300.s. Hope this helps.
It's a pretty safe bet to say that there are no serious bugs (and probably very few minor ones).
There is one person who claimed there was a dealy bug in the linker recently. However the evidence given was stunningly flawed.
I see.
I can not make the firmware reset permanently at startup today. I will try further on monday.
Thanks again for all hints and comments!
I only reply to technical statements made by the stunned among us. Now that he finally made one (!) he will get an answer:
First of, the linker, at least the one provided with MDK 4.14 is not perfect. Do you work with CM0 devices (LPC1114)? If you do, you might have encountered a failure to link with less that 0x2000 bytes of RAM available in the scatter file - fixed _ONLY_ by changing 0x2000 to 0x5000, linking, and then going back to the original setting (0x2000), which then links as well (!).
Now that he finally made one (!) he will get an answer:
Are you trying to take the proverbial? I replied to your mis-interpretation with an appropriate and valid question in:
http://www.keil.com/forum/19955/
Look carefully at how you answered it.
Professional ... I think not.
@Robert
Your fault handlers are simple while(1)s. Maybe to proceed you could implement more informative handlers to see if you can gather more information about where this fault (if any) is coming from.
M
Actually I didn't examine your source properly. It looks like all of your handlers are pointing to the Reset handler.
For starters put the while(1)'s back and see if you get stuck in one of these handlers.
Like so:
; Exception Vectors ; Mapped to Address 0. ; Absolute addressing mode must be used. ; Dummy Handlers are implemented as infinite loops which can be modified. CDCVectors LDR PC, Reset_Addr LDR PC, Undef_Addr LDR PC, SWI_Addr LDR PC, PAbt_Addr LDR PC, DAbt_Addr NOP ; Reserved Vector ; LDR PC, IRQ_Addr LDR PC, [PC, #-0x0120] ; Vector from VicVectAddr LDR PC, FIQ_Addr Reset_Addr DCD Reset_Handler Undef_Addr DCD Undef_Handler SWI_Addr DCD SWI_Handler PAbt_Addr DCD PAbt_Handler DAbt_Addr DCD DAbt_Handler DCD 0 ; Reserved Address IRQ_Addr DCD IRQ_Handler FIQ_Addr DCD FIQ_Handler IMPORT SWI_Handler EXTERN DAbt_Handler ; RoS| 29.11.11: for RT-Agent (http://www.keil.com/support/man/docs/ulink2/ulink2_ra_modifying_startup.htm) Undef_Handler B Undef_Handler ;SWI_Handler B SWI_Handler PAbt_Handler B PAbt_Handler DAbt_Handler B DAbt_Handler ; RoS| 29.11.11: for RT-Agent IRQ_Handler B IRQ_Handler FIQ_Handler B FIQ_Handler ; Reset Handler EXPORT Reset_Handler Reset_Handler
Never ever jump to the reset vector/start address.
The processor has not been properly reset, and there is a big likelyhood that the startup code and application code makes assumptions about current processor state that isn't true.
Most processors can make use of an internal watchdog handler to force a real reset.
Some processors have an internal register bit that can be written to to force a reset.
For most other processors, the hw designer should add external circuitry giving the processor an ability to force-reset itself.
Note that a real reset doesn't only put known content into all different registers. It also resets state machines inside the processor - often resetting state information not possible to reach using software. So a jump to the reset vector could result in a processor with a hung UART, interrupt controller or similar, and no way to get the chip back into working order again with less than a power cycle by the user.
I agree. (@Per - for many chips technically I think you could write a Reset Handler that properly resets / re initializes all your peripherals and the CPU properly but certainly not common or in my opinion good practice. But as you point out I may be wrong in that thinking)
I once had found a website that had excellent examples of handlers (for debug purposes) for ARM7 and Cortex-M3 does anyone have any good links with any good examples?
I have seen one too, and later saw links to it on this forum.
A bit of Google skills should be able to pick up on it - was showing code to walk the exception stack.
Not applicable to ARM7 but for future reference here is a good one for Cortex M3/0
support.code-red-tech.com/.../DebugHardFault
Thank you very much Per and Marc again for all your answers to make me moving in the right direction.
Actually I read 'The Insider's Guide To The NXP LPC 2300/2400 Based Microcontrollers'. I have the feeling that I should have done this earlier, but I did not know anything of the existence of such a document.
I will report here if I can make the reset happen again and I will try to find the handler that leads to the reset.
Good morning everyone.
Things appear a little bit clearer to me now. I am still reading the guide. If I understand all your latest posts right and review the startup code of our firmware, I realize that in our firmware all protection exceptions (Undef, PAbt, DAbt) lead to a reset of the device.
That means if any of these exceptions occurs, the firmware forces a reset. Again and again, if the exceptions source is an error in the source code of our firmware. Because of this, I am not able to find any error if the firmware resets on every exception...
Am I right so far?
And additionally: The included RealTime-Agent is not able to work like it should, because of this line:
DAbt_Addr DCD Reset_Handler ;DAbt_Handler
Right?
I changed back the handlers like Marc suggested, since I know now what this part of the startup file is doing.
I left the DAbt_Handler unchanged for using the RealTime-Agent.
A question regarding the DAbt_Handler: In the following code sequence, what DAbt_Handler would be jumped to in case of a data abort exception?
IMPORT SWI_Handler EXTERN DAbt_Handler Undef_Handler B Undef_Handler ;SWI_Handler B SWI_Handler PAbt_Handler B PAbt_Handler DAbt_Handler B DAbt_Handler IRQ_Handler B IRQ_Handler FIQ_Handler B FIQ_Handler ; Reset Handler EXPORT Reset_Handler Reset_Handler
Would a DAbt force a jump to the external handler or to the endless loop? I would guess the jump goes to EXTERN DAbt_Handler simply because the statement is located earlier in the code.
A second question: What exactly means EXPORT Reset_Handler?
I found out on my own that the default DAbt_Handler has to be commented out. This should always be done if an external label is importet.
www.keil.com/.../armasmref_Babcjehh.htm
IMPORT imports the symbol unconditionally. EXTERN imports the symbol only if it is referred to in the current assembly.
[EXTERN in assembly] is different from [extern in C].
My understanding is:
Assuming that, for some reasons, your firmware push/pop some data from one of the stacks, causes a Data Abort, then the processor performs the
LDR PC, DAbt_Addr
and since
DAbt_Addr DCD Reset_Handler
the processor runs the Reset_Handler once again, doing something else, if the "something else" does not cause another Data Abort, you will not notice anything about the passed Data Abort. However, the system is already messed up.
A Reset runs the Reset_Handler. But re-run the Reset_Handler is not a Reset, the reason is as what Per has explained.
Hello John,
thank you for the link to the Assembler Reference. I found the explaination for EXPORT there, but it is not fully clear to me why this directive is used for the Reset_Handler symbol in the startup file.
Reading my thread here again I unterstand the following:
1. Any program exception (DAbt, PAbt, Undef, Reset) leads to a call of the reset handler in the firmware.
2. Calling the reset handler simply equals a jump to the start address of the firmware without setting any reset conditions in the device.
3. Because of this I have to implement a mechanism to force a real reset.
Someone tell me please if I'm right or wrong.
Your code does contain jumps to the reset address.
Most startup files do not. It's normal to either supply a real exception handler, or have just a busy-loop like:
PAbt_Handler B PAbt_Handler
For programs that has the watchdog enabled, the above busy-loop will hang the processor in the loop until the watchdog generates a real reset, that does not only jump to the reset address but first performs a full reset of the processor. And full reset here means that all registers gets default values (except the boot reason bits, that will inform that it was a watchdog reset), and all internal state machines gets reset.
So a program should never make an intentional jump to the reset address. If the detected problem can't be solved by explicit code, then the program should let the watchdog force a reset.
Thank you for confirming my assumptions Per. It finally sunk in!
I feel a little bit stupid for taking so long until I realized what you have meant.
Since I am informed about what the problem is and I am on my way reading the guide, I found interesting codelines in the firmware.
While searching for enabled interrupt sources (to get a better overview of the firmware) I found an attempt to implement the watchdog in reset mode:
__irq void watchdog(void) { } void Init_Watchdog(void) { // RSIR|= 0x04; VICVectAddr1= (unsigned long)watchdog; // set interrupt vector in 0 VICIntEnable= VICIntEnable | 0x00000001; WDTC= 0x00000FFF; WDCLKSEL= 0x00000001, WDCLKSEL= 0x00000001; WDMOD|= 0x3; os_dly_wait(100); WDFEED= 0xAA; os_dly_wait(100); WDFEED= 0xEE; }
Because of the some errors and unnecessary code lines (red marked) the watchdog never was running, I suppose. Furthermore it is needless to set the watchdog as vectored interrupt, when setting it to reset mode.
Now I will try to implement the watchdog, including the original endless loops called on program exceptions to get a real reset!
Any further suggestions?
Hi Robert,
Don't you still need to determine the cause of your exception/reset?
Before enabling any watchdog I would recommend implementing proper exception handlers (even if they are simple while(1)) and observing your RSID value on reset.
Also, you should note implementing a watchdog in an OS task based firmware in not as straight forward as you might expect.
You will need a way for each task (or relevant tasks, anyway) to flag a intermidiate watchdog flag before you actually feed the hardware watchdog. (otherwise it is meaningless or you are only 'watching' a single task)
I highly recommend figuring out if you have any issues and implementing proper handlers before enabling a watchdog.
Hello Marc.
You are right, I still need to find the cause of exceptions / reset. But actually I am not able to make the firmware behave so bad! ;-) Whatever has forced the exceptions / reset, it has apparently temporarily gone away.
And you are right once more, if you tell me to implement proper exception handlers before enabling watchdog. That's why I wrote 'Now I will try to implement the watchdog, including the original endless loops called on program exceptions to get a real reset!' in my latest post.
Could you explain a little more detailed, why I need an intermediate watchdog flag when I use RTOS? I plan to reload the WD in every active task including my idle task. I thought it should be a good plan, because in case of a exception and a call to an endless jump loop no task is able to reload the WD. Are there mistakes in this plan or something more that I should consider?
Best regards Robert
If you 'feed' (kick, reload...) the watchdog in all tasks than you will not know if only one of your tasks gets stuck.
If you use the watchdog as you describe than you are only using it to reset your device when an exception occurs. Generally I think watchdogs provide a bigger function than simply a reset on an exception.
However, this would work as you describe.
If you have access to the external reset I would suggest this as a better mechanism for resetting your device on an exception and I would use the watchdog to ensure all tasks are properly executing.
Regards,
Marc
A normal way to implement the watchdog function is to have only the lowest-prioritized task kick the watchdog.
This proves that you have enough CPU capacity that you don't starve this low-prio task.
But to verify that all the other tasks works, you normally have them kick internal counters. The low-prio task checks that these counters all get updated - if a counter has stood still for too long time, then all kicking is of the physical watchdog is stopped.
In some situations, you may have to create a table where the low-prio task knows the max time allowed for the different high-prio tasks to update their individual counters.
Also, tasks may no longer perform infinite waits, but should specify a timeout. Just so that they can update their counters even if a serial listener never gets any data to process from an external serial port.
Next step up is to also perform dynamic checking of contents if important configuration and data structures, and stop kicking the watchdog if there is any errors detected.
Good morning Marc, good morning Per.
At first I want to thank you again for all the great input.
I understand what you both have explained to me and I decided to implement your recommendations stepwise.
=> The first step is to implement the watchdog to force a real reset on error exceptions. This is a better behaviour than simply jumping to start adress - that's what I have learned. ;-)
This step is nearly finished and it should be enough for the moment, since this is much more than I was instructed to do with the firmware. With this step I should be able to better debug the firmware in case of an error exception.
=> The second step is to implement the following mechanism:
Active tasks do not wait infinite and every task has its own timer.
The watchdog is kicked by the lowest priorized task only (actually my idle task) to ensure that enough cpu capacity is available. Furthermore, this task checks if all other tasks are running by checking their timers. If a task got stuck, the checking task can restart it. If restarting the deadlocked task is not enough (depending on the tasks function) the checking task could reset the whole device via external reset (I'm sure I can get access to it) after error information have been saved to memory (task name, register states, ...).
If a program error exception is detected, the device would be restarted via external reset when the error conditions have been saved to memory by the regarding handlers.
Is the second step consistent to that what you both have suggested?
I have another question regarding the user defined stack size for every task.
The RTOS is initialized by os_sys_init(task1). I want to have task1 its own user defined stack size using os_sys_init_user(task1, ...), but I do not want to reserve memory for its stack permanently.
How can I provide memory dynamically for task1 that is finished by os_tsk_delete_self()?
Looking forward for your answers.
Robert,
Don't use dynamic memory if you can avoid it. It will lead to memory fragmentation - unless you allocate a big chuck at program startup and distribute it as will (memory manager...). This can indeed reduce your binary size significantly if you have enough RAM.
the checking task can restart it.
In my opinion this is a bad design. You should test to make sure no hangups can occur. And how will you "release" a task? What if its sitting in a
for (;;) ;
? Better to reduce resource consumption and defer to the common, trusted and working solution of maintaining a bitmap, each bit indicating "task n is alive". If this regularly tested bitmap is not set entire to 1 (or 0) - one task, or preferably the hardware abstraction later will reset the device. This way, the watchdog is serviced from one place only, reducing the change of faulty code servicing it _even_ in the event of "task failure".
Of course, deleting a task and re-spawning it is always possible in an attempt to introduce a "self correcting" system. Note, however, that this could lead to inter-task synchronization problems.
View all questions in Keil forum