We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello,
I have a major problem with RTX and Keil don't seem to be able to help (as they want a simple scenario to cause the problem, but I cannot give them the hardware of course. Maybe I can make it go wrong using an evaluation board). I'm using RTX as the backbone of a product that needs to run for extended periods of time without reboot (weeks...). The problem is that RTX stops executing arbitrary tasks at arbitrary moments - they remain 'ready' but not get services. Today I discovered a task entering 'WAIT_MUT' while not using ANY mutex. My question: Are there any tips using RTX correctly? I am growing totally frustrated and tired of this, what am I supposed to tell the client?! I'm using latest and so expensive RL-ARM without any results whatsoever. Can you share your experience with me?
Thanks you for your attention,
Tamir
Tamir,
your problem sounds really serious. Crossing all available fingers you'll catch it.
I am quite new to ARM & Keil, but been in the embedded business for quite a while. My impression is that this RTX in general lacks all things giving "comfort" of debug aid. Other OS offer a lot more compared to Keil, eg Segger's embOS.
One of the first things for me to do was to write sort of a schedule monitor to see what task was consuming what time. I found that the basic basics where there (the rt_agent_xx stuff) But it was basic and over all nonfunctional. So quite a bit left to code. Then I got me a help to capture all exception-relevant data .. and so on
All that - in my opinion - stuff that should have been in the box - at least at that price.
What I'd do: a) Is there a chance the the mutexes might get modified accidentally by sick pointers? What's your application (and your coding style ) alike ? On "good" days I manage really weird things ;))) Do you have the change to compile on a different machine (eg Visual Studio at highest warn levels) to get rid off the chance of such a problem ?
b) simply modify the OS If you think it is a proper call that changes the mutex, why simply not track all mutex accesses ? (for a reduced time of course) If you don't have so many accesses , maybe yu can put them all into RAM , let the machine run a while and dump them after next reset (but beware to put them in non cleared area.
Hope these idease were of any help G O O D L U C K !! Uli
just to go in detail with b)
if you can simply get the mutex to change to a wrong value (as you mentioned above), try to track all intended mutex changes with their corresponding task ids. Maybe you can then detect, that there is no "set_wrong" at all or you can detect the failing task.
even more luck ;) ULI
Uli,
Thanks for your reply. I made it home in a collapsed state :-)
Here is the deal: the system kept on crashing even after removing all mutexes. I did find 2 problems due: 1. There was a path in the program that attempted to lock a mutex while at interrupt context. a little assembly magic solved that. 2. there was a problem in the communication task that could have locked the direction of the RS-485. I don't fully understand why, but the code is now more solid.
The controllers are now running on my desk. If they don't hang by Monday, I think I can consider the situation as an improvement, even though the system used to crash before I started using mutexes etc and ran for over a week before without a problem (only lately it became so unstable). I will post what happened.
what I really miss in uv3/4/RTX combination:
* LR per task * locked system resources per task * more aid to debug
but the greatest benefit would probably be something my processor (LPC2478) does not have: a MMU...
Here is an idea to improve RTX:
Offer a debug mode binary / compile time macro that causes the system to guard critical data with a checksum that is being re-calculated every time the kernel is active. when a program overwrites something thus altering the checksum, the processor calls a callback and hangs, providing the history of the last 100 milliseconds of operation in terms of tasks that ran and interrupts that occurred.
Something else that can help a lot is the kernel causing the processor to immediately go to abort mode, if an attempt is made to use anything that must not be used in any exception mode from anything but user mode.
Just out of interest, have you found the problem?
I see in:
http://www.keil.com/forum/docs/thread15089.asp
you talk about a wild pointer.
Was it that?
hell no, there was an erroneous path in the program that: 1. tried to lock a mutex during an exception. RTX hates for obvious reasons (processor mode interrogated to fix, no guard against IRQs needed during IRQ). I added that mutex to guard the file system as I knew that the function can be accessed in "parallel". I only forgot that one of the optioned is access while at IRQ mode...! 2. tried to access the SD card during an exception. RTX hate that, too (now it is moved into RTX application itself with a circular buffer. was in the pipeline for month, no time to implement...!).
but my questions remain: why does RTX not die IMMEDIATELY when such a violation occurs? why does RTX not have a checksum to guard against wild pointers corrupting the kernel data ? and why it is allowed to run for at most 3 days in case 1 happens?!?!?! this was a close call, but at least we learned something...
why does RTX not die IMMEDIATELY when such a violation occurs?
Because customers (including you) hate it when OS kernels "waste" time on such "superfluous" checking. Do you have any idea what it would cost you, in terms of interrupt latency or task switch time, if the kernel checked all its data every time before performing the job you asked of it?
Hans, the answer to all of your questions is "yes". I forgot to mention here (but not in my official mail to Keil) that I would like to see such a debug mode for development purposes only.
I would like to see such a debug mode for development purposes only
That would disrupt the integrity of the program under debugging, and thereby invalidate the result. It's entirely possible that introducing such checks in a debug version of the program not only makes the bug go away (e.g. because it depended on timing details of the production version), but also causes the program to develop even worse ones (violated timing requirements, stack overflow, ...).
As people in aerospace put it: debug what you fly, and fly what you debugged.
"why does RTX not die IMMEDIATELY when such a violation occurs?"
<BeginAngryRant>
OMG! Is that supposed to be a serious question?
How could that occur unless you have some hardware to protect you from such things?
Have you considered using a part with MMU?
If Keil did include such a library, and it were to be used, then the rogue pointer might just access some other part of the system, maybe not the RTX data. What would you ask for then? For Keil to provide checksums over user application space?
<EndAngryRant>
That's the end of my 2 cents :)
Stephen,
I was dead serious, and still am.
too late for that now.
come on. I was specifically talking about a debug mode intended to verify that a mission critical element is not harmed by application software in the absence of hardware protection. I am fully aware of the impact of it. Had you spent the past weeks desperately looking a failure that causes RTX to fail at absolutely arbitrary moments without a processor exception of any kind, you would not have quoted me using "<xxxAngryRant>" tags...
"I was dead serious, and still am."
Hmmm.
You didn't answer the one about how you would expect Keil to carry out this 'immediate' magic.
As I implied, putting in the extra code into the RTX would possibly just end up hiding the problem. So it's value would be pretty limited. Wouldn't it?
It could also give a nasty false sense of security. Like "The RTX isn't throwing an error, therefore my code must be right".
Have you never been hit by the situation where an application would fail, but trying a build with the debug libraries (with the aim of narrowing down the problem) would cause the application to work again?
I truly don't see the problem, given the knowledge of the underlaying chip. one of the issues above is independent of any hardware consideration, surely you can see. having done a similar thing in a OS I have written for an STR9, I know with complete certainty that it is possible.
but you can apply this logic to so many other factors that influence a system. the point this is: a silent RTX means nothing at all. But a failing RTX means that you most definitely do something wrong! another handy feature could be a recording of the last, say, 100 ms in terms of which tasks ran and what interrupts occured (with a timestamp). how much RAM does that cost? how much time does it save in the system fails and you have a post mortuary log?
"the point this is: a silent RTX means nothing at all. But a failing RTX means that you most definitely do something wrong!
Absolutely - It's like the maxim, you can prove something doesn't work, but not that it works 100%. Unless, of course, you believe the work of statisticians.
Unfortunately - Corrupt pointers are not normally considerate enough to scribble over the things you want them to scribble over.
I doubt very much that you can detect this corruption IMMEDIATELY as you suggest. On something like an STR9 there would surely always be a delay. There would probably have to be a check at the next timeslice or OS call. A lot can happen during those delays.
Oh ... And if you have this extra code and data for the purposes of checking execution history, you'd better protect that region as well from invalid pointer corruption. Consider an invalid pointer being used as an argument for a call to memset - Whoosh, lots of trashed data!