This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Intermittent System Failure

Hello,

I'm building an embedded application on a Dallas DS89C450 microcontroller. RTX-51 Full Version is used as the RTOS.

Everything seems to work fine although I'm facing with two abnormal intermittent system failures which will be tried to be described as follows:

1. The system hangs for a while and the watchdog timer resets the system. However, there is no infinite loop in a critical section inside the code which may cause the task with the highest priority in the system to be ceased to work. This task is responsible for kicking the watchdog circuitry.

2. The flow of the code somehow jumps to somewhere which is not supposed to work at that moment. And this section of the code is responsible for clearing the EEPROM content on a user request. EEPROM is accessed by the processor thru its data/address bus.

A same simple test procedure is applied on the system repeatedly and thus, the state of the system doesn't seem to change. However, 2 or 3 times / 30 trials ends up with such a catastrophic result.

Here are my questions:

1. Can it be caused by a stack overflow?
2. Would you recommend me to increase task stack sizes? If yes, how much?

Any ideas?

Thanks in advance.
Hakan

0 Tamir Michael over 15 years ago

1. The system hangs for a while and the watchdog timer resets the system. However, there is no infinite loop in a critical section inside the code which may cause the task with the highest priority in the system to be ceased to work. This task is responsible for kicking the watchdog circuitry.

but the point is that you have a failure; that's why the watchdog is not kicked. if your PC is taking a hike, you are, err, very, very dead indeed.

Can it be caused by a stack overflow?

of course. the required size is very application specific.
Cancel
Vote up 0 Vote down

Cancel
0 Hakan YAMANYAR over 15 years ago in reply to Tamir Michael

I assume that you mean "Program Counter" by PC. I know that program flow may jump to an undesired location if the PC is not popped from the stack properly after a function call. However, what I am confused is the fact that I continue developing my project, which means that I'm adding extra source code every single day, but the behavior does not change! OK, it used to jump to the "EraseEEPROM" function but how can it still jump to the same location although I'm changing the code? I'm quite dazed and confused.
Cancel
Vote up 0 Vote down

Cancel
0 erik malund over 15 years ago in reply to Hakan YAMANYAR

What is the stack address in the .map file????
Coding in C or assembler or both?
any ISRs using 0 (default)?
all interrupts same priority?

I know that program flow may jump to an undesired location if the PC is not popped from the stack properly after a function call.
difficult to achieve except by stack under/overflow and, if you are coding in C only too small a stack can be the cause.

However, what I am confused is the fact that I continue developing my project, which means that I'm adding extra source code every single day, but the behavior does not change!
la-la land is a big place, if it jumps to la-la land it does not matter where.

Erik
Cancel
Vote up 0 Vote down

Cancel
0 Hakan YAMANYAR over 15 years ago in reply to Hakan YAMANYAR

Maybe a bottleneck which is located somewhere deep down in my call graph is lurking to catch me up. Even if I continue growing up my source code elsewhere, this part remains the same and when it is triggered intermittenly, God knows when, it hunts me down. Can this be the reason?
Cancel
Vote up 0 Vote down

Cancel
0 Hakan YAMANYAR over 15 years ago in reply to Hakan YAMANYAR

Thank you Erik for your post. I am developing it under Keil in C. A couple of assembler source code also takes place in my project but not much. 90% in C.
Cancel
Vote up 0 Vote down

Cancel
0 Hakan YAMANYAR over 15 years ago in reply to Hakan YAMANYAR

And not all ISRs have the same priority level. But you asked me if there are any 0's. Well, no.
Cancel
Vote up 0 Vote down

Cancel
0 erik malund over 15 years ago in reply to Hakan YAMANYAR

'using' has nothing to do with priority level
WHAT does the .map file show as THE STACK address

Erik
Cancel
Vote up 0 Vote down

Cancel
0 Hakan YAMANYAR over 15 years ago in reply to erik malund

What I could find in my *.map file goes as follows:

* * * * * * * * * * * D A T A M E M O R Y * * * * * * * * * * * * *
000000H 000007H 000008H --- AT.. DATA "REG BANK 0"
000008H 00000DH 000006H BYTE UNIT DATA ?RTX?INT_MASK?RTXCONF
00000EH 00000FH 000002H BYTE UNIT DATA ?C?LIB_DATA
000010H 000017H 000008H --- AT.. DATA "REG BANK 2"
000018H 00001FH 000008H --- AT.. DATA "REG BANK 3"
000020H 000021H 000002H BYTE BITADDR DATA ?RTX?RTX_BIT_RELBYTE_SEG
000022H.0 000023H.5 000001H.6 BIT UNIT BIT ?RTX?RTX_BIT_SEG
000023H.6 000023H.7 000000H.2 BIT UNIT BIT ?RTX?FLT_BITSEG
000024H.0 000024H.0 000000H.1 BIT UNIT BIT _BIT_GROUP_
000024H.1 000024H 000000H.7 --- --- **GAP**
000025H 000047H 000023H BYTE UNIT DATA ?RTX?RTX_RELBYTE_SEG
000048H 00005FH 000018H BYTE UNIT IDATA ?RTX?FTASKDATA?2
000060H 000077H 000018H BYTE UNIT IDATA ?RTX?FTASKDATA?3
000078H 00007DH 000006H BYTE UNIT IDATA _IDATA_GROUP_
00007EH 000091H 000014H BYTE UNIT IDATA ?STACK

Are you asking about the last line that I made bold?
Cancel
Vote up 0 Vote down

Cancel
0 Hakan YAMANYAR over 15 years ago in reply to Hakan YAMANYAR

What does "any ISRs using 0 (default)?" mean, btw.?
Cancel
Vote up 0 Vote down

Cancel
0 edPer Westermark over 15 years ago in reply to Hakan YAMANYAR

There are a large number of errors that can bring down a program. Some produces simmilar results even if you change large parts of the code. While other errors produce random results even if your code isn't changed.

Do you have any part of the code that can produce a buffer overflow, possibly destrying some unrelated variable? An array overwrite may result in a pointer variable somewhere always getting overwritten with the same value. If that pointer is regularly repaired, you may get away with the error 99 times out of 100, but if you get a specific interrupt during the time when the pointer is damaged, your program fails.

Another thing - do you have a variable larger than one byte that gets read or written to from two interrupts that may nest or from main code and interrupt handler, possibly resulting in the main code sometimes seeing this variable in the middle of an update - a 16-bit pointer having only one byte updated while the other byte still has the old value. Let's say the pointer or index of a ring buffer - if you sample this index during an update where the value isn't correct, and if you sample while the incorrect value has a specific magic value, your program may place the next received UART character at a memory-overwriting location.

Maybe your program has one specific call sequence that is requiring more stack space than all other call chains. If you get an interrupt just when inside this function, you may get a stack overflow and the same varaible assign or register save or return address may corrupt the stack chain in a way that the failure looks the same.

Since you are using an RTOS, you have to remember that your processor doesn't have any memory protection. Any memory overwrite that destroys a critical data structure in the RTOS can result in the RTOS performing an incorrect task switch. If your watchdog kick thread is always started in a given order, it may be the one that always gets affected by the memory overwrite.

Is there a possibility that you can deactivate different threads, and see if a specific thread seems to be involved in producing the error? Maybe the error only happens if you get ISR x activated while task y is running and visiting function z?

An RTOS on a small and "dumb" processor can be a real pain, because it is so darn hard to debug when you get a pointer/index error somewhere. With "just" interrupts, you can normally modify the code to manage without calling some parts of the code or with one or more ISR deactivated. RTOS-based applications often have so many inter-locks that it is hard to testrun subsets of the code. And it is even harder to try to simulate specific tast-switch combinations based on task priorities, pending events, ...

This is a reason why a RTOS should be avoided if not really, really needed. Even if you stress-test a program for hundreds of hours, you can still have talk about probabilities/confidence that your stress tests have managed to perform task switches at the critical times. In the same way, this is a big reason why people try to design superloop applications with non-nested interrupts. Allowing nested interrupts quickly increases the total number of combinations events can nest and quickly makes it impossible to force all combinations during testing. It's so much simpler to pay for a faster processor that can guarantee all critical timing with everything running in sequence - then you can manually pause a program and use your JTAG interface to trig all possible events and unpause the program. And you can select any part of the call chains of the main loop to force interrupt events, all the way to single-stepping individual assembler instructions while trigging one or more ISR.
Cancel
Vote up 0 Vote down

Cancel
0 Hakan YAMANYAR over 15 years ago in reply to edPer Westermark

Hello Per,

First of all thank you so much for your very informative posting. I think that it will help a lot. If you don't mind, may I have your e-mail address to reach at you after I reviewed the points you mentioned about in my source code?
Cancel
Vote up 0 Vote down

Cancel
0 Hakan YAMANYAR over 15 years ago in reply to Hakan YAMANYAR

By the way, I totally agree with your comments on not using an RTOS on a small-sized processor. However, I took over the source from somebody who was responsible for the project and it seems to be quite hard now to take it off from the source.
Cancel
Vote up 0 Vote down

Cancel
0 doubt that my ISP Al Bradford over 15 years ago in reply to Hakan YAMANYAR

Erik means NEVER use using 0. This gives the tools permission to trash any used registers after returning from an interrupt. Since Register Bank 0 is your default or "home" register bank, you never want to trash these registers by default.
Bradford
Cancel
Vote up 0 Vote down

Cancel
0 erik malund over 15 years ago in reply to edPer Westermark

This is a reason why a RTOS should be avoided if not really, really needed. Even if you stress-test a program for hundreds of hours, you can still have talk about probabilities/confidence that your stress tests have managed to perform task switches at the critical times. In the same way, this is a big reason why people try to design superloop applications with non-nested interrupts. Allowing nested interrupts quickly increases the total number of combinations events can nest and quickly makes it impossible to force all combinations during testing. It's so much simpler to pay for a faster processor that can guarantee all critical timing with everything running in sequence - then you can manually pause a program and use your JTAG interface to trig all possible events and unpause the program. And you can select any part of the call chains of the main loop to force interrupt events, all the way to single-stepping individual assembler instructions while trigging one or more ISR.

My experience is that with a RTOS it is 10% easier to write the code and 1000% more difficult to debug the 'nasties'.
This commment, of course is totally irrelevant since we all write bug free code :)

just one comment: Per may sound as "do not use interrupts" I am sure he does not mean that, by interrupts he, I presume, is referring to 'task switching' interrupting a task.

Erik
Cancel
Vote up 0 Vote down

Cancel