We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello,
I have a major problem with RTX and Keil don't seem to be able to help (as they want a simple scenario to cause the problem, but I cannot give them the hardware of course. Maybe I can make it go wrong using an evaluation board). I'm using RTX as the backbone of a product that needs to run for extended periods of time without reboot (weeks...). The problem is that RTX stops executing arbitrary tasks at arbitrary moments - they remain 'ready' but not get services. Today I discovered a task entering 'WAIT_MUT' while not using ANY mutex. My question: Are there any tips using RTX correctly? I am growing totally frustrated and tired of this, what am I supposed to tell the client?! I'm using latest and so expensive RL-ARM without any results whatsoever. Can you share your experience with me?
Thanks you for your attention,
Tamir
why does RTX not die IMMEDIATELY when such a violation occurs?
Because customers (including you) hate it when OS kernels "waste" time on such "superfluous" checking. Do you have any idea what it would cost you, in terms of interrupt latency or task switch time, if the kernel checked all its data every time before performing the job you asked of it?
Hans, the answer to all of your questions is "yes". I forgot to mention here (but not in my official mail to Keil) that I would like to see such a debug mode for development purposes only.
I would like to see such a debug mode for development purposes only
That would disrupt the integrity of the program under debugging, and thereby invalidate the result. It's entirely possible that introducing such checks in a debug version of the program not only makes the bug go away (e.g. because it depended on timing details of the production version), but also causes the program to develop even worse ones (violated timing requirements, stack overflow, ...).
As people in aerospace put it: debug what you fly, and fly what you debugged.
"why does RTX not die IMMEDIATELY when such a violation occurs?"
<BeginAngryRant>
OMG! Is that supposed to be a serious question?
How could that occur unless you have some hardware to protect you from such things?
Have you considered using a part with MMU?
If Keil did include such a library, and it were to be used, then the rogue pointer might just access some other part of the system, maybe not the RTX data. What would you ask for then? For Keil to provide checksums over user application space?
<EndAngryRant>
That's the end of my 2 cents :)
Stephen,
I was dead serious, and still am.
too late for that now.
come on. I was specifically talking about a debug mode intended to verify that a mission critical element is not harmed by application software in the absence of hardware protection. I am fully aware of the impact of it. Had you spent the past weeks desperately looking a failure that causes RTX to fail at absolutely arbitrary moments without a processor exception of any kind, you would not have quoted me using "<xxxAngryRant>" tags...
"I was dead serious, and still am."
Hmmm.
You didn't answer the one about how you would expect Keil to carry out this 'immediate' magic.
As I implied, putting in the extra code into the RTX would possibly just end up hiding the problem. So it's value would be pretty limited. Wouldn't it?
It could also give a nasty false sense of security. Like "The RTX isn't throwing an error, therefore my code must be right".
Have you never been hit by the situation where an application would fail, but trying a build with the debug libraries (with the aim of narrowing down the problem) would cause the application to work again?
I truly don't see the problem, given the knowledge of the underlaying chip. one of the issues above is independent of any hardware consideration, surely you can see. having done a similar thing in a OS I have written for an STR9, I know with complete certainty that it is possible.
but you can apply this logic to so many other factors that influence a system. the point this is: a silent RTX means nothing at all. But a failing RTX means that you most definitely do something wrong! another handy feature could be a recording of the last, say, 100 ms in terms of which tasks ran and what interrupts occured (with a timestamp). how much RAM does that cost? how much time does it save in the system fails and you have a post mortuary log?
"the point this is: a silent RTX means nothing at all. But a failing RTX means that you most definitely do something wrong!
Absolutely - It's like the maxim, you can prove something doesn't work, but not that it works 100%. Unless, of course, you believe the work of statisticians.
Unfortunately - Corrupt pointers are not normally considerate enough to scribble over the things you want them to scribble over.
I doubt very much that you can detect this corruption IMMEDIATELY as you suggest. On something like an STR9 there would surely always be a delay. There would probably have to be a check at the next timeslice or OS call. A lot can happen during those delays.
Oh ... And if you have this extra code and data for the purposes of checking execution history, you'd better protect that region as well from invalid pointer corruption. Consider an invalid pointer being used as an argument for a call to memset - Whoosh, lots of trashed data!
"immediately" was certainly an inappropriate term to be used here.
I was specifically talking about a debug mode intended to verify that a mission critical element is not harmed by application software in the absence of hardware protection.
As you've been told quite a number of times now, that is utterly impossible. No amount of testing can ever verify anything.
And a debug version that's not identical to the real program can proeve even less than that. The fact that a suspected wild pointer doesn't hit the supervised, critical data of the debug version, doesn't mean anything at all for the critical data of the release version. The critical data may be in a different place, or there may be more of it (to implement all that testing), or the wild pointer may point elsewhere.
You're chasing a unicorn.
Hans, Thanks for your input. I respect and fully understand everyone's comments, but unicorn or no unicorn, I am going yo try it because I still think it has a good chance of working, giving the following restriction:
* The OS data must be positioned at a predefined location (easy to do with a scatter file), preferably at the beginning of (external) RAM (if possible). this eliminates many possibilities by preventing critical data regions from mingling with other data.
eliminates many possibilities
The problem is that "many" just is not enough.
Not good enough, means that a test can't prove correctness.
But a test that have x% probability of pinpointing the location of an error can still be meaningful.
The big problem here is estimate how large the percentage would be, i.e. the gain in relation to the cost.
The thing that is important to note, is that checksummed datastructures doesn't lead to correct programs. It is only a way to _maybe_ detect corruption.
In this case, checksumming could possibly tell what task was running during the corruption. And if all ISR sets a flag, then checksumming could possibly add a list of potential ISR to look closer at. But checksumming would possibly point at the wrong task, in case the memory corruption is caused by a DMA transfer, started by another thread but creating the corruption after a task switch.
But checksumming would possibly point at the wrong task, in case the memory corruption is caused by a DMA transfer, started by another thread but creating the corruption after a task switch.
ouch, you are so right. I overlooked that one...!
Per,
The point you made about the DMA transfers is indeed an issue. I never meant this to be something more a possible little help in case things are that much out of control (believe me, they were until a couple of days ago - nervous clients, nervous boss, nervous keyboard...). I don't think Keil are going to do this with RTX (there are other, more pressing issues...) - let's leave it as an intellectual exercise.
I regularly look at checksumming as one of the available tools to detect problems, but prefer to use it in situations where it can be included in the release build. Just as previously mentioned, it is best to test the same build that is expected to ship. It is enough to change a single byte in RAM or flash to make the debug build pass all tests (even if buggy) while the release build will fail - possibly in a routine the customer will only trig once every three months.
The reason I posted was that Hans-Bernhard Broekers post was aimed at pointing out that checksumming can't validate something as correct. But that is a separate issue from using it as a tool to detect something broken. A bigger issue with checksumming (at least when used in release builds) is to decide what action to perform in case of a checksum error. Auto-repair, reboot, deadlock, warn, ...