This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Intermittent System Failure

Hello,

I'm building an embedded application on a Dallas DS89C450 microcontroller. RTX-51 Full Version is used as the RTOS.

Everything seems to work fine although I'm facing with two abnormal intermittent system failures which will be tried to be described as follows:

1. The system hangs for a while and the watchdog timer resets the system. However, there is no infinite loop in a critical section inside the code which may cause the task with the highest priority in the system to be ceased to work. This task is responsible for kicking the watchdog circuitry.

2. The flow of the code somehow jumps to somewhere which is not supposed to work at that moment. And this section of the code is responsible for clearing the EEPROM content on a user request. EEPROM is accessed by the processor thru its data/address bus.

A same simple test procedure is applied on the system repeatedly and thus, the state of the system doesn't seem to change. However, 2 or 3 times / 30 trials ends up with such a catastrophic result.

Here are my questions:

1. Can it be caused by a stack overflow?
2. Would you recommend me to increase task stack sizes? If yes, how much?

Any ideas?

Thanks in advance.
Hakan

Parents

0 edPer Westermark over 15 years ago in reply to Hakan YAMANYAR

There are a large number of errors that can bring down a program. Some produces simmilar results even if you change large parts of the code. While other errors produce random results even if your code isn't changed.

Do you have any part of the code that can produce a buffer overflow, possibly destrying some unrelated variable? An array overwrite may result in a pointer variable somewhere always getting overwritten with the same value. If that pointer is regularly repaired, you may get away with the error 99 times out of 100, but if you get a specific interrupt during the time when the pointer is damaged, your program fails.

Another thing - do you have a variable larger than one byte that gets read or written to from two interrupts that may nest or from main code and interrupt handler, possibly resulting in the main code sometimes seeing this variable in the middle of an update - a 16-bit pointer having only one byte updated while the other byte still has the old value. Let's say the pointer or index of a ring buffer - if you sample this index during an update where the value isn't correct, and if you sample while the incorrect value has a specific magic value, your program may place the next received UART character at a memory-overwriting location.

Maybe your program has one specific call sequence that is requiring more stack space than all other call chains. If you get an interrupt just when inside this function, you may get a stack overflow and the same varaible assign or register save or return address may corrupt the stack chain in a way that the failure looks the same.

Since you are using an RTOS, you have to remember that your processor doesn't have any memory protection. Any memory overwrite that destroys a critical data structure in the RTOS can result in the RTOS performing an incorrect task switch. If your watchdog kick thread is always started in a given order, it may be the one that always gets affected by the memory overwrite.

Is there a possibility that you can deactivate different threads, and see if a specific thread seems to be involved in producing the error? Maybe the error only happens if you get ISR x activated while task y is running and visiting function z?

An RTOS on a small and "dumb" processor can be a real pain, because it is so darn hard to debug when you get a pointer/index error somewhere. With "just" interrupts, you can normally modify the code to manage without calling some parts of the code or with one or more ISR deactivated. RTOS-based applications often have so many inter-locks that it is hard to testrun subsets of the code. And it is even harder to try to simulate specific tast-switch combinations based on task priorities, pending events, ...

This is a reason why a RTOS should be avoided if not really, really needed. Even if you stress-test a program for hundreds of hours, you can still have talk about probabilities/confidence that your stress tests have managed to perform task switches at the critical times. In the same way, this is a big reason why people try to design superloop applications with non-nested interrupts. Allowing nested interrupts quickly increases the total number of combinations events can nest and quickly makes it impossible to force all combinations during testing. It's so much simpler to pay for a faster processor that can guarantee all critical timing with everything running in sequence - then you can manually pause a program and use your JTAG interface to trig all possible events and unpause the program. And you can select any part of the call chains of the main loop to force interrupt events, all the way to single-stepping individual assembler instructions while trigging one or more ISR.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 edPer Westermark over 15 years ago in reply to Hakan YAMANYAR

There are a large number of errors that can bring down a program. Some produces simmilar results even if you change large parts of the code. While other errors produce random results even if your code isn't changed.

Do you have any part of the code that can produce a buffer overflow, possibly destrying some unrelated variable? An array overwrite may result in a pointer variable somewhere always getting overwritten with the same value. If that pointer is regularly repaired, you may get away with the error 99 times out of 100, but if you get a specific interrupt during the time when the pointer is damaged, your program fails.

Another thing - do you have a variable larger than one byte that gets read or written to from two interrupts that may nest or from main code and interrupt handler, possibly resulting in the main code sometimes seeing this variable in the middle of an update - a 16-bit pointer having only one byte updated while the other byte still has the old value. Let's say the pointer or index of a ring buffer - if you sample this index during an update where the value isn't correct, and if you sample while the incorrect value has a specific magic value, your program may place the next received UART character at a memory-overwriting location.

Maybe your program has one specific call sequence that is requiring more stack space than all other call chains. If you get an interrupt just when inside this function, you may get a stack overflow and the same varaible assign or register save or return address may corrupt the stack chain in a way that the failure looks the same.

Since you are using an RTOS, you have to remember that your processor doesn't have any memory protection. Any memory overwrite that destroys a critical data structure in the RTOS can result in the RTOS performing an incorrect task switch. If your watchdog kick thread is always started in a given order, it may be the one that always gets affected by the memory overwrite.

Is there a possibility that you can deactivate different threads, and see if a specific thread seems to be involved in producing the error? Maybe the error only happens if you get ISR x activated while task y is running and visiting function z?

An RTOS on a small and "dumb" processor can be a real pain, because it is so darn hard to debug when you get a pointer/index error somewhere. With "just" interrupts, you can normally modify the code to manage without calling some parts of the code or with one or more ISR deactivated. RTOS-based applications often have so many inter-locks that it is hard to testrun subsets of the code. And it is even harder to try to simulate specific tast-switch combinations based on task priorities, pending events, ...

This is a reason why a RTOS should be avoided if not really, really needed. Even if you stress-test a program for hundreds of hours, you can still have talk about probabilities/confidence that your stress tests have managed to perform task switches at the critical times. In the same way, this is a big reason why people try to design superloop applications with non-nested interrupts. Allowing nested interrupts quickly increases the total number of combinations events can nest and quickly makes it impossible to force all combinations during testing. It's so much simpler to pay for a faster processor that can guarantee all critical timing with everything running in sequence - then you can manually pause a program and use your JTAG interface to trig all possible events and unpause the program. And you can select any part of the call chains of the main loop to force interrupt events, all the way to single-stepping individual assembler instructions while trigging one or more ISR.
Cancel
Vote up 0 Vote down

Cancel

Children

0 Hakan YAMANYAR over 15 years ago in reply to edPer Westermark

Hello Per,

First of all thank you so much for your very informative posting. I think that it will help a lot. If you don't mind, may I have your e-mail address to reach at you after I reviewed the points you mentioned about in my source code?
Cancel
Vote up 0 Vote down

Cancel
0 Hakan YAMANYAR over 15 years ago in reply to Hakan YAMANYAR

By the way, I totally agree with your comments on not using an RTOS on a small-sized processor. However, I took over the source from somebody who was responsible for the project and it seems to be quite hard now to take it off from the source.
Cancel
Vote up 0 Vote down

Cancel
0 erik malund over 15 years ago in reply to edPer Westermark

This is a reason why a RTOS should be avoided if not really, really needed. Even if you stress-test a program for hundreds of hours, you can still have talk about probabilities/confidence that your stress tests have managed to perform task switches at the critical times. In the same way, this is a big reason why people try to design superloop applications with non-nested interrupts. Allowing nested interrupts quickly increases the total number of combinations events can nest and quickly makes it impossible to force all combinations during testing. It's so much simpler to pay for a faster processor that can guarantee all critical timing with everything running in sequence - then you can manually pause a program and use your JTAG interface to trig all possible events and unpause the program. And you can select any part of the call chains of the main loop to force interrupt events, all the way to single-stepping individual assembler instructions while trigging one or more ISR.

My experience is that with a RTOS it is 10% easier to write the code and 1000% more difficult to debug the 'nasties'.
This commment, of course is totally irrelevant since we all write bug free code :)

just one comment: Per may sound as "do not use interrupts" I am sure he does not mean that, by interrupts he, I presume, is referring to 'task switching' interrupting a task.

Erik
Cancel
Vote up 0 Vote down

Cancel