Hello everyone,
today I'm asking for hints on a tricky problem. We have a firmware that uses RTX-Kernel running on a NXP LPC2368. Now the device that the firmware is written for should get a new lc display. My honest mission is to change the firmware in order to use the new display.
I've spent some weeks this year to do so and some time I've had the problem that the controller resets short time after start and again and again...
Everytime this behaviour occured I have deleted one or more obsolete variables (mostly global) or functions. In most cases I solved the problem by searching other obsolete variables and deleting them from source code - try and error. That is really time-killing.
While testing the firmware on wednesday, I tried to make the adopted and modified routine for writing data to display RAM a little faster. I moved an global unsigned int to the function and changed it to static unsigned char because the value it has to carry is 0x0D at a maximum.
After flashing the firmware in the controller, the controller hung at a random short time.
Yesterday I was trying to solve the problem with hanging firmware on random time and found the problem when no task is running: OS calls os_idle_demon() and was not able to return from it. I found a solution in world wide web: Creating an empty low priority task without using any os_wait functions that prevents the OS from calling the idle task. (It has something to do with incorrect interrupt states on retunring from idle task.)
Today I further tried to make the display writing function faster and changed two unsigned char inside the function from static to non-static. After flashing this firmware the controller resets again and again. I will now try to find out why the controller behaves this way.
What I found out is, that no watchdog is enabled by user (is it part of the OS?). The os_stk_overflow an os_idle_demon are not called from OS. I debug the firmware using ULINK2.
Any ideas where to search the problem for?
Best regards
It would be up to you to enable any watchdog.
The RTOS can't do it, because the RTOS would not know when to kick the watchdog. A program that makes use of a watchdog should make a lot of attempts to verify that the program is really, really behaving well before deciding to kick. The RTOS can only figure out if it is working ok - not if running threads are doing what they are expected to, or if sleeping threads are really expected to be sleeping.
It sounds like you have uninitialized variables, stack overflow or a buffer overflow (memory overwrites) somewhere in the program. Adding or removing global variables or changing the contents on the stack changes the behaviour you see because your code changes also moves the location of lots of variables. And changes the total amount of stack space needed.
Have you started by making sure you compile your code at maximum warning level?
Have you tried to fill the stacks with a pattern and check how much of the stacks that are getting used?
Todays work is coming to an end now and here are the actual results.
At first I checked the compiler settings inside the project an found in tab 'C/C++' the option 'Warnings' set to 'All warnings'. That should meet my needs. The next step was to have a sight to the compiler control string. Amongst other entries that define include directories, global macros, generation of listing files and so on, I found the optimizing level set to 0 (-O0). That should be fine also for debug purposes. Then I added --strict to the string and got hundreds and hundred warnings / errors because of using // to comment code out. Nice experiment - I removed strict.
Next step was to check the stack usage. After reading µV4-Help a while, I found the --callgraph output generated by the linker. Opening the callgraph I found the entry Maximum stack usage = 592 bytes + Unknown (Functions without stacksize, Cycles, Untraceable Function Pointers). Since there is a stack size of 274 bytes defined for each task, this definitely is one source for data corruption.
So tommorow I will have a closer look at the functions that use a huge amount of stack size and try to optimize them. Further I will try to implement user defined stack sizes for each task.
Do you think I am on the right way? Any comments or hints?
An endless-loop normally looks like:
while(1) { }
for(;;) { }
Hello John.
Thank you for your answer again.
Yes, OS_STKCHECK is enabled all the time. We use an older version of RTX and so we have the os_stk_overflow() instead of os_error() to recognize stack overflows. But this error function is never called when the reset situations occur.
Regarding the endless-loop: Yes, I know what an endless-loop looks like, but I wanted my idle task to have some job to do ;)
Since the controller does no reset with the changed variables at the moment, I now will try to force the reset situation again. I want to remember all readers to this thread that there are many unanswered questions asked by myself in this thread. I conclude them:
1. What static code analyzer would you suggest to debug / analyze a RTX project? (I am able to use a analytic function (--callgraph) provied by the compiler inside µVision4)
2. Do you think that I should use the water-level method for stack checking, if I can force occurrence of the reset-error again? (Why should I do so, since os_stk_overflow was never called in the past?)
3. If question 2 is answered with 'yes', how can I locate the 1096 bytes large stacks for the tasks and fill them with 0xDEADFADE? (I know how to write values to a memory area, but do not know where exactly the stack in RAM is placed by the RTOS.)
4. May a wrong aligned stack pointer be the reason for occurrence of reset-errors?
5. May the MAM Timing setting (4 fetch cycles) be another reason for errors?
6. Is there any idea why implementation of RT Agent has led to an working version of the firmware? (I think of the hint by Per regarding the possibility of rearrangement of the whole firmware if one little thing is changed.) Best regards and thank you for any answer to my questions
Note that the watermark method indicates how much stack you use. The OS code just tells you if you get an overflow.
But if you only make use of the OS code, then the question is - how do you properly allocate optimal stack sizes for all your tasks without either being very close to the limit (so a single extra auto variable [potentially from changed code optimization] takes you over the limit) or wastes excess stack space that you could have used for larger communication buffers?
You always want to quantify your stack need, so you can produce a document saying how much safety margin you have added and why you think that should be enough.
Ok, I see. But how can I fill the stack of every task with a pattern? Thats the point I am hanging at.
My idea is to declare a char at the very beginning of a task and then fill 1096 bytes with my pattern beginning from the char's address. The char is an auto variable and should be placed on the stack. In debug mode I can check the char's adress with a breakpoint when the task starts. Then I let the firmware run and try to heavy load the task. At the end I check how much of the pattern exists any more in the 1096 bytes following the char's address.
That's it?
I use individually sized stacks for every task.
So I have a number of global arrays that I send as parameters for the stacks when I create the tasks. It's quite easy to fill these arrays before the tasks are created as I already know their addresses and sizes. And if I verify that the linker doesn't split them into two memory regions (for processors that has multiple RAM memory regions), I can use a single loop to fill all stack memory space.
If you configure the OS to supply the stacks, then you should still have access to a symbol for the memory area the OS will make use of, so you don't need to find the individual start address of each task stack.
I see. I'll try to generate an example with user-defined stack for my idle task 'task3', that needs no stack space for variables. So the only thing I have to do is reserve a stack space that has at least 68 bytes and fill it with a pattern:
static U64 Idle_Stk[88/8]; OS_TID id3; main(){ unsigned char pattern[8]= {0xDE, 0xAD, 0xFA, 0xDE, 0xDE, 0xAD, 0xFA, 0xDE}; int i; for(i= 0; i < sizeof(Idle_Stk); ++i) memcpy(&Idle_Stk[i], pattern, 8); // ... os_sys_init(task1); } //------------------------------------------- __task void task1(void){ //... id3= os_tsk_create_user(task3, 1, &Idle_Stk, sizeof(Idle_Stk)); //... } //-------------------------------------------
Is that code right?
How can I verify that the stack is not splitted by the linker?
Best regards and thank you very much so far!
I've made a little mistake... sizeof(Idle_Stk) returns 88 and in the for-loop I need result 11. So the for-loop should look like this:
for(i= 0; i < (sizeof(Idle_stk) / 8); ++i)
Not sure where you got your value 68 from. But it is a quit "odd" value - do note the alignment requirements for the stack. You would normally also size the stacks as x times your alignment requirement.
So it's quite common to have something like:
U64 render_stack[1280/8]; U64 display_stack[1024/8]; ...
I successful tested my first task-creation with user-defined stack (including a pattern initializing the stack). I have seen the pattern in debugger and how much it is overwritten. I'm very proud!
The 68 byte come from here: http://www.keil.com/support/man/docs/rlarm/rlarm_ar_cfgstack.htm , where is written: On the full context task switch, the RTX kernel stores all ARM registers on the stack. Full task context storing requires 64 bytes of stack.
Additionally I remember that I've read something these days, that in some cases 4 bytes more are needed for successful task switch, but I can't find it this minute.
Thats why I "guessed" to need at least 68 bytes for the stack of my idle-task.
I verified the stack-usage of my idle-task with debugger and found 4 byte used at the very beginning of the stack and 64 bytes used at the end of the stack, so I believe that 68 bytes are quite fine.
I want to thank you very much again - I now have a wide set of tools if I need to find any error in the future!
If I come in a situation again where the controller resets while starting I will investigat the reasons more deeper and report in this thread here.
So lets go on to estimate how much stack space a task needs. Let's say I have another simple task. Looking in the file generated by --callgraph linker option I find a Max Depth of 128 bytes and the task itself needs 0 byte of extra stack. So I would simply estimate that the task needs a 196 bytes wide stack (68 bytes basic stack for task switch and 128 bytes for the longest call chain). May that be right? Another question regarding this task: The task has a local unsigned short. Why this variable needs no stack space?
Note that the compiler can decide to use a register instead of allocating a variable on the stack - then the stack space for that variable will be included in the stack space used for a state save during a task switch.
The four bytes you saw at one end of the stack was probably the OS overwrite marker, that it uses to detect a stack overflow.
Ok, I see. Your explanations sound logical to me, thank you Per.
To go on with user defined stack space for most of the tasks in our firmware I checked the 'Max Depth' value outputted by --callgraph. Then I added 68 bytes to estimate maximum stack space needed and increased the value to a multiple of eigth. That works fine so far.
But now, there is a task in the callgraph output file that looks like this:
task4 (ARM, 848 bytes, Stack size 0 bytes, ma96.o(.text), UNUSED)
If I create the task with a user-defined stack of 68 (72) bytes, the os_stk_overflow() is called right after the task has been started. I wonder if there is the word 'UNUSED' in the callgraph output. The task is called often and there are several functions that will be called by the task on runtime.
Why can callgraph not calculate any call chain?
Why is the task marked as 'UNUSED' in callgraph output file?
Should I manually estimate the worst case call chain for the task?
Hello,
here we go again! Last days I've spent with optimizing tasks stack sizes, among other things.
Today I tried to user-define the stack for a task that waits for an event wich is set when connection via usb is made to the device.
I checked the callgraph output that estimates a Max Depth of 232 bytes for the task and created an own stack area with a size of 512 bytes.
Then I changed the task creation instruction to user defined stack and incremented the number of user defined tasks by 1 in config file. My plan was to check if the stack is big enough after that with a pattern.
Compiling the code and writing it to device resulted in permanent reset. Stack overflow handler is not called in RTOS. The reset occurs before the changed task is created in initial task. I changed the stack size experimentally to 1096 bytes (wich is the tasks default stack size in the config file), but nothing changed - the device resets permanently. If I change the task back to RTOS defined stack my program runs correct.
So now I am able to check if a stack overflow occurs, have implemented the RTA in my program, have disabled caching while debugging. But I have no idea where to search for the reason for this reset failure.
Any hints?
Could the reason be, that my own stack is located at another area than the stack provided by the RTOS?
Maybe writing a pattern into user defined stacks can tell you which task is overwritten and maybe even by how much. On the other hand, RTX does have such a mechanism, and it is not triggered. Maybe the stacks are the ones writing one of _your_ buffers which causes entry into abort, and thus is not a stack overrun at all?
Note that it is possible to have a stack overflow while jumping past the marker the OS may use for overflow detection. When a program declares lots of auto variables, the stack may overflow but with holes not used - for example a 100char write buffer that isn't completely filled.
A simulator that explicitly keeps track of the stack pointer can detect such a stack overflow. But an OS that is limited to a single marker word can not.
And as noted - stack overflows are bad, but it is quite easy to get similar problems by having buffer overruns or using uninitialized pointers or uninitialized array indices.
Maybe ...
It's one of those Dealy linker bugs???
Dealy? Deadly! Since I'm a native german, I don't know how to handle the last post. Is there any linker bug known that I can check my linker for?
Some more background information for you: 'Use Microlib' is enabled in the target options. I do not know if it has any relevance.
I really do not have any idea where to search. I spent the last hours running some debugger sessions and tried to get a fixed point to catch the error, but everytime the program behaves different. Now my program code is at the point it was this morning but it is not doing a reset any more... really no difference to the code some hours ago, but no reset occurs.
I wish I had a pro here right by my side! I am tired of this sick program.
Hi Robert,
Please don't take that last post seriously.
The general consensus is that there are no serious bugs in the linker.
(Unless someone knows something different that they don't want to share.)
S(tunned) Steve is just bored and wants to throw gravel in the machinery by h(a)unting Tamir about a previous thread. Nothing you need to worry about.
I don't think anyone has asked yet but what is the source of the reset?
In other words what is the value of the RSID after the reset?
M
I think Marc Crandall is right. And maybe it is not a Reset, it just looks like a Reset.
Hi Robert Suess,
Did you implement any of the
Undef_Handler SWI_Handler PAbt_Handler DAbt_Handler
or some kind of software reset functionality?
I see, thank you for all the answers.
@S Steve: Ok, thank you for sharing your refreshing ideas. :) I was not quite sure if you where trolling around.
Should I go with the flow or should I make my own experiences, what do you think?
@Per: Thank you for enlighten me about S(tunned) Steve. ;)
@Marc: Welcome to my thread and thank you very much for your input! I will try to enforce the reset again and if successful, I will check the value of the RSID after the reset like you suggested.
To answer Johns question: It is very difficult for me to find all pieces of the puzzle since I not wrote the firmware on my own. It is a crackbrained mix of a very old firmware written for a 8bit controller, a non actual RTX USBCDC example project written for 'Keil MCB2300' and a patchwork of code snippets to make the firmware behave like it should. And no comments are in the code. Do not get me wrong, it is a great achievement that the firmware runs like it is expected to do. But for me as programmer it is hard to find errors if they occur now.
The information I can give this moment is, if it helps to clarify your question John:
; Exception Vectors ; Mapped to Address 0. ; Absolute addressing mode must be used. ; Dummy Handlers are implemented as infinite loops which can be modified. CDCVectors LDR PC, Reset_Addr LDR PC, Undef_Addr LDR PC, SWI_Addr LDR PC, PAbt_Addr LDR PC, DAbt_Addr NOP ; Reserved Vector ; LDR PC, IRQ_Addr LDR PC, [PC, #-0x0120] ; Vector from VicVectAddr LDR PC, FIQ_Addr Reset_Addr DCD Reset_Handler Undef_Addr DCD Reset_Handler;Undef_Handler SWI_Addr DCD SWI_Handler PAbt_Addr DCD Reset_Handler;PAbt_Handler DAbt_Addr DCD Reset_Handler ;DAbt_Handler DCD 0 ; Reserved Address IRQ_Addr DCD IRQ_Handler FIQ_Addr DCD FIQ_Handler IMPORT SWI_Handler EXTERN DAbt_Handler ; RoS| 29.11.11: for RT-Agent (http://www.keil.com/support/man/docs/ulink2/ulink2_ra_modifying_startup.htm) Undef_Handler B Undef_Handler ;SWI_Handler B SWI_Handler PAbt_Handler B PAbt_Handler ;DAbt_Handler B DAbt_Handler ; RoS| 29.11.11: for RT-Agent IRQ_Handler B IRQ_Handler FIQ_Handler B FIQ_Handler ; Reset Handler EXPORT Reset_Handler Reset_Handler
This is part of the modified startup file lpc2300.s. Hope this helps.
It's a pretty safe bet to say that there are no serious bugs (and probably very few minor ones).
There is one person who claimed there was a dealy bug in the linker recently. However the evidence given was stunningly flawed.
I see.
I can not make the firmware reset permanently at startup today. I will try further on monday.
Thanks again for all hints and comments!
I only reply to technical statements made by the stunned among us. Now that he finally made one (!) he will get an answer:
First of, the linker, at least the one provided with MDK 4.14 is not perfect. Do you work with CM0 devices (LPC1114)? If you do, you might have encountered a failure to link with less that 0x2000 bytes of RAM available in the scatter file - fixed _ONLY_ by changing 0x2000 to 0x5000, linking, and then going back to the original setting (0x2000), which then links as well (!).
Now that he finally made one (!) he will get an answer:
Are you trying to take the proverbial? I replied to your mis-interpretation with an appropriate and valid question in:
http://www.keil.com/forum/19955/
Look carefully at how you answered it.
Professional ... I think not.
@Robert
Your fault handlers are simple while(1)s. Maybe to proceed you could implement more informative handlers to see if you can gather more information about where this fault (if any) is coming from.
Actually I didn't examine your source properly. It looks like all of your handlers are pointing to the Reset handler.
For starters put the while(1)'s back and see if you get stuck in one of these handlers.
Like so:
; Exception Vectors ; Mapped to Address 0. ; Absolute addressing mode must be used. ; Dummy Handlers are implemented as infinite loops which can be modified. CDCVectors LDR PC, Reset_Addr LDR PC, Undef_Addr LDR PC, SWI_Addr LDR PC, PAbt_Addr LDR PC, DAbt_Addr NOP ; Reserved Vector ; LDR PC, IRQ_Addr LDR PC, [PC, #-0x0120] ; Vector from VicVectAddr LDR PC, FIQ_Addr Reset_Addr DCD Reset_Handler Undef_Addr DCD Undef_Handler SWI_Addr DCD SWI_Handler PAbt_Addr DCD PAbt_Handler DAbt_Addr DCD DAbt_Handler DCD 0 ; Reserved Address IRQ_Addr DCD IRQ_Handler FIQ_Addr DCD FIQ_Handler IMPORT SWI_Handler EXTERN DAbt_Handler ; RoS| 29.11.11: for RT-Agent (http://www.keil.com/support/man/docs/ulink2/ulink2_ra_modifying_startup.htm) Undef_Handler B Undef_Handler ;SWI_Handler B SWI_Handler PAbt_Handler B PAbt_Handler DAbt_Handler B DAbt_Handler ; RoS| 29.11.11: for RT-Agent IRQ_Handler B IRQ_Handler FIQ_Handler B FIQ_Handler ; Reset Handler EXPORT Reset_Handler Reset_Handler
Never ever jump to the reset vector/start address.
The processor has not been properly reset, and there is a big likelyhood that the startup code and application code makes assumptions about current processor state that isn't true.
Most processors can make use of an internal watchdog handler to force a real reset.
Some processors have an internal register bit that can be written to to force a reset.
For most other processors, the hw designer should add external circuitry giving the processor an ability to force-reset itself.
Note that a real reset doesn't only put known content into all different registers. It also resets state machines inside the processor - often resetting state information not possible to reach using software. So a jump to the reset vector could result in a processor with a hung UART, interrupt controller or similar, and no way to get the chip back into working order again with less than a power cycle by the user.
I agree. (@Per - for many chips technically I think you could write a Reset Handler that properly resets / re initializes all your peripherals and the CPU properly but certainly not common or in my opinion good practice. But as you point out I may be wrong in that thinking)
I once had found a website that had excellent examples of handlers (for debug purposes) for ARM7 and Cortex-M3 does anyone have any good links with any good examples?
I have seen one too, and later saw links to it on this forum.
A bit of Google skills should be able to pick up on it - was showing code to walk the exception stack.
View all questions in Keil forum