Hello everyone,
today I'm asking for hints on a tricky problem. We have a firmware that uses RTX-Kernel running on a NXP LPC2368. Now the device that the firmware is written for should get a new lc display. My honest mission is to change the firmware in order to use the new display.
I've spent some weeks this year to do so and some time I've had the problem that the controller resets short time after start and again and again...
Everytime this behaviour occured I have deleted one or more obsolete variables (mostly global) or functions. In most cases I solved the problem by searching other obsolete variables and deleting them from source code - try and error. That is really time-killing.
While testing the firmware on wednesday, I tried to make the adopted and modified routine for writing data to display RAM a little faster. I moved an global unsigned int to the function and changed it to static unsigned char because the value it has to carry is 0x0D at a maximum.
After flashing the firmware in the controller, the controller hung at a random short time.
Yesterday I was trying to solve the problem with hanging firmware on random time and found the problem when no task is running: OS calls os_idle_demon() and was not able to return from it. I found a solution in world wide web: Creating an empty low priority task without using any os_wait functions that prevents the OS from calling the idle task. (It has something to do with incorrect interrupt states on retunring from idle task.)
Today I further tried to make the display writing function faster and changed two unsigned char inside the function from static to non-static. After flashing this firmware the controller resets again and again. I will now try to find out why the controller behaves this way.
What I found out is, that no watchdog is enabled by user (is it part of the OS?). The os_stk_overflow an os_idle_demon are not called from OS. I debug the firmware using ULINK2.
Any ideas where to search the problem for?
Best regards
It would be up to you to enable any watchdog.
The RTOS can't do it, because the RTOS would not know when to kick the watchdog. A program that makes use of a watchdog should make a lot of attempts to verify that the program is really, really behaving well before deciding to kick. The RTOS can only figure out if it is working ok - not if running threads are doing what they are expected to, or if sleeping threads are really expected to be sleeping.
It sounds like you have uninitialized variables, stack overflow or a buffer overflow (memory overwrites) somewhere in the program. Adding or removing global variables or changing the contents on the stack changes the behaviour you see because your code changes also moves the location of lots of variables. And changes the total amount of stack space needed.
Have you started by making sure you compile your code at maximum warning level?
Have you tried to fill the stacks with a pattern and check how much of the stacks that are getting used?
If Per is right, you probably want to scan your software with a static code analyzer.
The os_stk_overflow an os_idle_demon are not called from OS
Maybe the overflow occurs during the execution of an interrupt (IRQ mode has a separate stack)? Note that RTX cannot warn you about that.
Good mornig Per, Good morning Tamir,
at first I want to thank you for your fast response. I wrote my opening post just before I started my weekend. Thats why I answer that late.
I will now check if the compiler warns me at a maximum level when compiling, thanks for this hint.
After that I will try the water-level method to watch the stack usage, like Per further suggested. If the stack of one or more tasks is at the upper limit I must rise the stack size. With the water-level method I additionally can see, if some tasks are using only a small amount of their reserved stack size - possibly user defined stack sizes would be wise.
Finally a question about static code analyzers: Is there a tool that you can suggest?
Todays work is coming to an end now and here are the actual results.
At first I checked the compiler settings inside the project an found in tab 'C/C++' the option 'Warnings' set to 'All warnings'. That should meet my needs. The next step was to have a sight to the compiler control string. Amongst other entries that define include directories, global macros, generation of listing files and so on, I found the optimizing level set to 0 (-O0). That should be fine also for debug purposes. Then I added --strict to the string and got hundreds and hundred warnings / errors because of using // to comment code out. Nice experiment - I removed strict.
Next step was to check the stack usage. After reading µV4-Help a while, I found the --callgraph output generated by the linker. Opening the callgraph I found the entry Maximum stack usage = 592 bytes + Unknown (Functions without stacksize, Cycles, Untraceable Function Pointers). Since there is a stack size of 274 bytes defined for each task, this definitely is one source for data corruption.
So tommorow I will have a closer look at the functions that use a huge amount of stack size and try to optimize them. Further I will try to implement user defined stack sizes for each task.
Do you think I am on the right way? Any comments or hints?
Oops, I have made a little mistake: Stack size (OS_STKSIZE) in our firmware is defined as 274. That means stack size for each task is 1096 bytes (274 * 4 bytes) at a maximum. Then I am at the beginning now, because a stack overflow seems impossible. To ensure that there is now overflow, I have to implement the water-level method tommorow, right I am? How do I find out which memory area is used for the 12 * 1096 bytes stack?
Maybe take a look at the below link first:
http://www.keil.com/forum/16324/
There are different types of stacks. See what Franc Urbanc said.
Ok, thank you very much John.
I read the thread and linked threads and took care of Franc's explanations:
Franc Urbanc wrote: 4. the kernel main stack (defined in startup file) is not checked in stack checking.
If I understand it right, a kernel main stack overflow in main()-execution on startup is meant here and this would not be detected by RTOS. That is not surprising me, because the RTOS is initialised at the end of main() and os_stk_overflow() is a function of the RTOS.
But I can't imagine the occurrence of a stack overflow while main()-execution in my case. Here is a more precise explanation of the reset on startup (my problem case):
The latest call in main() is os_sys_init(task1). It starts task1 that creates all other tasks we need (including a low-priority idle task to prevent the calling of os_idle_demon). One of the tasks started by task 1 (let's call it 'displaytask') initializes a mutex for an display RAM writing function and shows a welcome screen. After writing the welcome screen to the display RAM using the mutex-locked function the displaytask waits 3 seconds via os_dly_wait(). This is the time the controller resets - the next statement of the displaytask is never called. When I debug the firmware I found my own idle-task incrementing a static int to 999999 and then setting it to 0 in an endless-loop, while the displaytask is waiting. While incrementing the static int in my idle task, the controller resets. This is the idle task (with a stack usage of 0 bytes I guess):
__task void task3(void) { static int iVar; for(iVar= 0; iVar < 1000000; ++iVar){ if(iVar == 999999) iVar= 0; } } //------------------------------------------------------------------------------
The described behaviour occurs e.g. when I change two static unsigned char (RW data) to non-static unsigned char (stack of a task) inside a function written by me. I see, it smells like overflowing stack but I can't imagine why the hell in the described situation any stack should overflow.
Any more ideas what to check to get the reason of the reset?
Beside reading in forum and checking the firmware I realized that debug information was cached on my local PC and that the RT-Agent was not implemented. I fixed this to be capable to really find errors. Now I'm armed to eleminate the bug!
Two other hints where crossing my way while invastigating to detect the bug:
1. May a wrong aligned stack pointer be the reason?
2. May the MAM Timing (set to 4 fetch cycles) be the reason? I changed it to 5 cycles but the controller still was doing the reset, but maybe it has something to do with my problem?
Best regards and many thanks for every hint
EDIT: OK, NOW I AM GOING TO GO NUTS, THE FIRMWARE NOW IS NOT RESETTING ANY MORE WITH CHANGED VARIABLES... I KEEP TESTING IT. COULD IMPLEMENTATION OF RTA AND DISABLED DEBUG CACHE HAVE SOLVED THE PROBLEM???
Regarding to my problem: Where can I find the information that is provided by adding '--info=summarystack" to the linker control string?
THX
Did you enable this feature of RTX?
http://www.keil.com/support/man/docs/rlarm/rlarm_ar_cfgstchk.htm http://www.keil.com/support/man/docs/rlarm/rlarm_ar_cfgerrfunc.htm
Just in case you did not use this feature, 1. Enable Stack Checking of RTX. 2. Set a breakpoint at the beginning of [void os_error (U32 err_code)]. 3. Run your program, wait to see what will happen.
An endless-loop normally looks like:
while(1) { }
for(;;) { }
Hello John.
Thank you for your answer again.
Yes, OS_STKCHECK is enabled all the time. We use an older version of RTX and so we have the os_stk_overflow() instead of os_error() to recognize stack overflows. But this error function is never called when the reset situations occur.
Regarding the endless-loop: Yes, I know what an endless-loop looks like, but I wanted my idle task to have some job to do ;)
Since the controller does no reset with the changed variables at the moment, I now will try to force the reset situation again. I want to remember all readers to this thread that there are many unanswered questions asked by myself in this thread. I conclude them:
1. What static code analyzer would you suggest to debug / analyze a RTX project? (I am able to use a analytic function (--callgraph) provied by the compiler inside µVision4)
2. Do you think that I should use the water-level method for stack checking, if I can force occurrence of the reset-error again? (Why should I do so, since os_stk_overflow was never called in the past?)
3. If question 2 is answered with 'yes', how can I locate the 1096 bytes large stacks for the tasks and fill them with 0xDEADFADE? (I know how to write values to a memory area, but do not know where exactly the stack in RAM is placed by the RTOS.)
4. May a wrong aligned stack pointer be the reason for occurrence of reset-errors?
5. May the MAM Timing setting (4 fetch cycles) be another reason for errors?
6. Is there any idea why implementation of RT Agent has led to an working version of the firmware? (I think of the hint by Per regarding the possibility of rearrangement of the whole firmware if one little thing is changed.) Best regards and thank you for any answer to my questions
Note that the watermark method indicates how much stack you use. The OS code just tells you if you get an overflow.
But if you only make use of the OS code, then the question is - how do you properly allocate optimal stack sizes for all your tasks without either being very close to the limit (so a single extra auto variable [potentially from changed code optimization] takes you over the limit) or wastes excess stack space that you could have used for larger communication buffers?
You always want to quantify your stack need, so you can produce a document saying how much safety margin you have added and why you think that should be enough.
Ok, I see. But how can I fill the stack of every task with a pattern? Thats the point I am hanging at.
My idea is to declare a char at the very beginning of a task and then fill 1096 bytes with my pattern beginning from the char's address. The char is an auto variable and should be placed on the stack. In debug mode I can check the char's adress with a breakpoint when the task starts. Then I let the firmware run and try to heavy load the task. At the end I check how much of the pattern exists any more in the 1096 bytes following the char's address.
That's it?
I use individually sized stacks for every task.
So I have a number of global arrays that I send as parameters for the stacks when I create the tasks. It's quite easy to fill these arrays before the tasks are created as I already know their addresses and sizes. And if I verify that the linker doesn't split them into two memory regions (for processors that has multiple RAM memory regions), I can use a single loop to fill all stack memory space.
If you configure the OS to supply the stacks, then you should still have access to a symbol for the memory area the OS will make use of, so you don't need to find the individual start address of each task stack.
I see. I'll try to generate an example with user-defined stack for my idle task 'task3', that needs no stack space for variables. So the only thing I have to do is reserve a stack space that has at least 68 bytes and fill it with a pattern:
static U64 Idle_Stk[88/8]; OS_TID id3; main(){ unsigned char pattern[8]= {0xDE, 0xAD, 0xFA, 0xDE, 0xDE, 0xAD, 0xFA, 0xDE}; int i; for(i= 0; i < sizeof(Idle_Stk); ++i) memcpy(&Idle_Stk[i], pattern, 8); // ... os_sys_init(task1); } //------------------------------------------- __task void task1(void){ //... id3= os_tsk_create_user(task3, 1, &Idle_Stk, sizeof(Idle_Stk)); //... } //-------------------------------------------
Is that code right?
How can I verify that the stack is not splitted by the linker?
Best regards and thank you very much so far!