We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello,
I have a major problem with RTX and Keil don't seem to be able to help (as they want a simple scenario to cause the problem, but I cannot give them the hardware of course. Maybe I can make it go wrong using an evaluation board). I'm using RTX as the backbone of a product that needs to run for extended periods of time without reboot (weeks...). The problem is that RTX stops executing arbitrary tasks at arbitrary moments - they remain 'ready' but not get services. Today I discovered a task entering 'WAIT_MUT' while not using ANY mutex. My question: Are there any tips using RTX correctly? I am growing totally frustrated and tired of this, what am I supposed to tell the client?! I'm using latest and so expensive RL-ARM without any results whatsoever. Can you share your experience with me?
Thanks you for your attention,
Tamir
Just as Stuart Wright notes, I was wondering if your thread could have got stuck on a mutex inside the ARM library even if your own code doesn't make use of any mutex.
How critical is a reboot?
Do you have any inter-process supervision, that could watchdog-reset your board if any thread gets stuck?
You you have long boot times, or lose important synchronization information that takes time (or is is impossible) to recollect?
I would recommend that you have a quite aggresive watchdog timeout, and and design all threads so that they they gets regularly woken up even if you have no work for them. Whenever they wake up, they should then sign off that they are alive. At the same time, they should compute how long time it was since the last time they got useful work todo, and decide if they are unhappy either with a thread they expect data from, or with a thread that was expected to eat produced data.
In the end, the watchdog shouldn't be kicked unless all threads and (when applicable) interrupt handlers are alive and working.
Having extra dummy events in the system may possibly increase the processor load with 1%, but it is often very critical to notice and react when the application is only partially working.
Per,
I have implemented most of your recommendations already. I have been able to induce a failure much faster by bombing the controller with UART trash, and it seems as if the responsible is a chunk of code I added yesterday (at least, this failure scenario is solved, I hope, but it does not explain the failures before I wrote it!). Franc Urbank (the RTX guru) is trying to help, too. I will keep you informed as we'll know more after the weekend.
Tamir,
your problem sounds really serious. Crossing all available fingers you'll catch it.
I am quite new to ARM & Keil, but been in the embedded business for quite a while. My impression is that this RTX in general lacks all things giving "comfort" of debug aid. Other OS offer a lot more compared to Keil, eg Segger's embOS.
One of the first things for me to do was to write sort of a schedule monitor to see what task was consuming what time. I found that the basic basics where there (the rt_agent_xx stuff) But it was basic and over all nonfunctional. So quite a bit left to code. Then I got me a help to capture all exception-relevant data .. and so on
All that - in my opinion - stuff that should have been in the box - at least at that price.
What I'd do: a) Is there a chance the the mutexes might get modified accidentally by sick pointers? What's your application (and your coding style ) alike ? On "good" days I manage really weird things ;))) Do you have the change to compile on a different machine (eg Visual Studio at highest warn levels) to get rid off the chance of such a problem ?
b) simply modify the OS If you think it is a proper call that changes the mutex, why simply not track all mutex accesses ? (for a reduced time of course) If you don't have so many accesses , maybe yu can put them all into RAM , let the machine run a while and dump them after next reset (but beware to put them in non cleared area.
Hope these idease were of any help G O O D L U C K !! Uli
just to go in detail with b)
if you can simply get the mutex to change to a wrong value (as you mentioned above), try to track all intended mutex changes with their corresponding task ids. Maybe you can then detect, that there is no "set_wrong" at all or you can detect the failing task.
even more luck ;) ULI
Uli,
Thanks for your reply. I made it home in a collapsed state :-)
Here is the deal: the system kept on crashing even after removing all mutexes. I did find 2 problems due: 1. There was a path in the program that attempted to lock a mutex while at interrupt context. a little assembly magic solved that. 2. there was a problem in the communication task that could have locked the direction of the RS-485. I don't fully understand why, but the code is now more solid.
The controllers are now running on my desk. If they don't hang by Monday, I think I can consider the situation as an improvement, even though the system used to crash before I started using mutexes etc and ran for over a week before without a problem (only lately it became so unstable). I will post what happened.
what I really miss in uv3/4/RTX combination:
* LR per task * locked system resources per task * more aid to debug
but the greatest benefit would probably be something my processor (LPC2478) does not have: a MMU...
Here is an idea to improve RTX:
Offer a debug mode binary / compile time macro that causes the system to guard critical data with a checksum that is being re-calculated every time the kernel is active. when a program overwrites something thus altering the checksum, the processor calls a callback and hangs, providing the history of the last 100 milliseconds of operation in terms of tasks that ran and interrupts that occurred.
Something else that can help a lot is the kernel causing the processor to immediately go to abort mode, if an attempt is made to use anything that must not be used in any exception mode from anything but user mode.
Just out of interest, have you found the problem?
I see in:
http://www.keil.com/forum/docs/thread15089.asp
you talk about a wild pointer.
Was it that?
hell no, there was an erroneous path in the program that: 1. tried to lock a mutex during an exception. RTX hates for obvious reasons (processor mode interrogated to fix, no guard against IRQs needed during IRQ). I added that mutex to guard the file system as I knew that the function can be accessed in "parallel". I only forgot that one of the optioned is access while at IRQ mode...! 2. tried to access the SD card during an exception. RTX hate that, too (now it is moved into RTX application itself with a circular buffer. was in the pipeline for month, no time to implement...!).
but my questions remain: why does RTX not die IMMEDIATELY when such a violation occurs? why does RTX not have a checksum to guard against wild pointers corrupting the kernel data ? and why it is allowed to run for at most 3 days in case 1 happens?!?!?! this was a close call, but at least we learned something...
why does RTX not die IMMEDIATELY when such a violation occurs?
Because customers (including you) hate it when OS kernels "waste" time on such "superfluous" checking. Do you have any idea what it would cost you, in terms of interrupt latency or task switch time, if the kernel checked all its data every time before performing the job you asked of it?
Hans, the answer to all of your questions is "yes". I forgot to mention here (but not in my official mail to Keil) that I would like to see such a debug mode for development purposes only.
I would like to see such a debug mode for development purposes only
That would disrupt the integrity of the program under debugging, and thereby invalidate the result. It's entirely possible that introducing such checks in a debug version of the program not only makes the bug go away (e.g. because it depended on timing details of the production version), but also causes the program to develop even worse ones (violated timing requirements, stack overflow, ...).
As people in aerospace put it: debug what you fly, and fly what you debugged.
"why does RTX not die IMMEDIATELY when such a violation occurs?"
<BeginAngryRant>
OMG! Is that supposed to be a serious question?
How could that occur unless you have some hardware to protect you from such things?
Have you considered using a part with MMU?
If Keil did include such a library, and it were to be used, then the rogue pointer might just access some other part of the system, maybe not the RTX data. What would you ask for then? For Keil to provide checksums over user application space?
<EndAngryRant>
That's the end of my 2 cents :)
Stephen,
I was dead serious, and still am.
too late for that now.
come on. I was specifically talking about a debug mode intended to verify that a mission critical element is not harmed by application software in the absence of hardware protection. I am fully aware of the impact of it. Had you spent the past weeks desperately looking a failure that causes RTX to fail at absolutely arbitrary moments without a processor exception of any kind, you would not have quoted me using "<xxxAngryRant>" tags...
"I was dead serious, and still am."
Hmmm.
You didn't answer the one about how you would expect Keil to carry out this 'immediate' magic.
As I implied, putting in the extra code into the RTX would possibly just end up hiding the problem. So it's value would be pretty limited. Wouldn't it?
It could also give a nasty false sense of security. Like "The RTX isn't throwing an error, therefore my code must be right".
Have you never been hit by the situation where an application would fail, but trying a build with the debug libraries (with the aim of narrowing down the problem) would cause the application to work again?