This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

os_get_first data abort (caused by os_rdy)

Hello,

I am having an issue with RL-ARM RTX where I get a data abort in the os_get_first function.

The reason that I get the data abort is that my os_rdy table has had its p_lnk pointer loaded with an invalid address--it appears that somehow a non-existent task has worked its way on to my os_rdy table (that is to say that at one point os_rdy.p_lnk = 0, and then the kernel did this: os_rdy.p_lnk = os_rdy.p_link->p_lnk).

I have found forum posts of other people having this problem -- unfortunately no solutions were offered and I am unable to reply to their threads:
http://www.keil.com/forum/docs/thread12032.asp
http://www.keil.com/forum/docs/thread12671.asp
http://www.keil.com/forum/docs/thread7618.asp

This condition occurs very rarely -- on the order of once every 24 hours. It always seems to occur shortly after an interrupt that makes use of the isr_mbx_send and isr_evt_set functions -- but this might be coincidence.

I am running RV MDK V3.70 and RL-ARM V3.70. My MCU is the LPC2468.

Any advice would be greatly appreciated!

Thanks,
Eric

Parents

0 Eric Severson over 16 years ago in reply to ryan williams

Well, yes and no. I don't know what the direct cause of the problem is, but I did manage to make it go away by doing one of two things.

Either 1: removing my isr os calls from interrupts (namely isr_event_set and isr_sem_send). Are you using either of these functions?

OR, 2: consolidating some of our tasks so that we only had eight running. How many tasks do you have running?

Either of these fixes completely removed the problem for us. Strangely enough, setting all of our vectored interrupts to the same priority made the problem happen much more frequently.

I should note that it is possible our code was corrupting the OS. However, all of the os structures looked uncorrupted at the instant of the problem. Keil has not had any luck tracking this down themselves. But I recommend calling in and starting a case number. You can have them reference my case: 438460.

Please keep me posted as you debug this. I am very interested in finding the cause of this problem.

-Eric
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Eric Severson over 16 years ago in reply to ryan williams

Well, yes and no. I don't know what the direct cause of the problem is, but I did manage to make it go away by doing one of two things.

Either 1: removing my isr os calls from interrupts (namely isr_event_set and isr_sem_send). Are you using either of these functions?

OR, 2: consolidating some of our tasks so that we only had eight running. How many tasks do you have running?

Either of these fixes completely removed the problem for us. Strangely enough, setting all of our vectored interrupts to the same priority made the problem happen much more frequently.

I should note that it is possible our code was corrupting the OS. However, all of the os structures looked uncorrupted at the instant of the problem. Keil has not had any luck tracking this down themselves. But I recommend calling in and starting a case number. You can have them reference my case: 438460.

Please keep me posted as you debug this. I am very interested in finding the cause of this problem.

-Eric
Cancel
Vote up 0 Vote down

Cancel

Children

0 ryan williams over 16 years ago in reply to Eric Severson

yes, i am using isr_event_set and other isr_xxx functions often. completely removing them will take some time but i will try it.

i have something like 10-15 tasks depending on user's actions.

i will post if i can verify that something is a cause of this
Cancel
Vote up 0 Vote down

Cancel
0 Eric Severson over 16 years ago in reply to ryan williams

You might try the latest RL-ARM and RVMDK (3.80). I see in the release notes that they fixed some problems with isr_xxx functions.

Unfortunately, version 3.80 did not fix our problems.

What microcontroller are you using?
Cancel
Vote up 0 Vote down

Cancel
0 Tamiryan Michael over 16 years ago in reply to Eric Severson

Eric,

I have a situation in which the ready list contains entries that point to themselves! very frustrating and problematic for the product/client. I am running out of patience and what is infinitely worse - time...

see here:

http://www.keil.com/forum/docs/thread15337.asp

I really don't know if this is the result of data corruption. it is just too slick - always the same. the problem is easy to reproduce on the product when the tick rate is 50 micro, but I have written a separate test program that does not have it...! so - it is either timing related, so data corruption, etc....
Cancel
Vote up 0 Vote down

Cancel
0 ryan williams over 16 years ago in reply to Eric Severson

i have removed all isr_xxx functions but the problem wasn't solved so i put them back in my code.

i was using a 1ms tick time. it would normally crash within 1-5 days. i reduced tick time to 50us and it crashes within 1-10 seconds. always at the same place in os_get_first.

i don't have a support thread going yet because our support expired this month. dunno if we are going to renew it yet.
Cancel
Vote up 0 Vote down

Cancel
0 Eric Severson over 16 years ago in reply to ryan williams

This sounds a lot like what Tamir is finding.

Is there anyway that you can get this to happen in the simulator and share a sample project? It would be really interesting to have Keil take a look at this.
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to Eric Severson

Eric, Ryan,

I found that a tick rate of 10 milliseconds yields a much more stable system. I did not see any crashes so far with this tick rate, but nevertheless this issue must be fixed and it will - Keil are busy with it right now as far as I know.
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to Eric Severson

Eric, Ryan,

What chips (and their revisions) are you using? I thought you were using an LPC2468, Eric? but what hardware revision?
Cancel
Vote up 0 Vote down

Cancel
0 Franc Urbanc over 16 years ago in reply to Tamir Michael

Please read:

http://www.keil.com/forum/docs/thread15346.asp
Cancel
Vote up 0 Vote down

Cancel
0 ryan williams over 16 years ago in reply to Franc Urbanc

yea, i'm using lpc2468. an older version.

i had set my clock as the errata says, but didn't see the mam setting. my startup file was modified from some example that came with my EA lpc2468 dev board.

so, now i've set mam to 1 instead of 2 and its not crashing with the short tick. i'll have to let it run for days now and see for sure.
Cancel
Vote up 0 Vote down

Cancel
0 ryan williams over 16 years ago in reply to ryan williams

i was wrong about this. it is still crashing, just less often. about the same effect as using a larger tick time.
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to ryan williams

this is a very worrying report, Ryan. Have you tried my test program, that can be found here http://www.keil.com/forum/docs/thread15346.asp ?
Does it still crash on your hardware? I'm using a LPC2478 and after re-configuring the PLL errata details of the LPC2468 - I have not observed any crashes (maybe I didn't run long enough).
Cancel
Vote up 0 Vote down

Cancel
0 Eric Severson over 16 years ago in reply to Franc Urbanc

I am using an LPC2468 Revision B. My clock rates are well below the maximums listed in the errata and I am not using the MAM interface.

Franc --

I left you a response here regarding why I think this may not be a hardware issue:
http://www.keil.com/forum/docs/thread15346.asp
Cancel
Vote up 0 Vote down

Cancel