This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Inducing RTX failure

Hello all,

I do not wish to repeat myself as I have addressed this issue in a recent thread, but this is too important to turn a blind eye to as even lives could be at stake which certainly makes it worth a separate thread: I believe I managed to conceive a program that causes a failure of RTX on a LPC2468/2478 (simulator does not induce the failure). Quite some people have reported problems with RTX on this forum, so hopefully they can download my stripped test program here dl.getdropbox.com/.../LPC2468_RTX_Demo_min.zip to try there own variants. I have of course informed Keil support about this issue and I am currently waiting for feedback. I would very much appreciate any feedback you might have.

Tamir

Parents

0 Tamir Michael over 16 years ago in reply to Tamir Michael

Hello,

I have received this reply from NXP:

"The PLL.1 erratum has been resolved since version "A" of the silicon. It doesn't apply to the LPC2478. It is therefore unlikely that the PLL frequency is the cause of your system instability.

LPC2478 started with rev C silicon, current LPC2468 are rev B silicon."

Why is it then that choosing the most stable clock settings that yields 72[MHz] on a LPC2478 stops RTX from failing? It is NXP's spread sheet that I used to calculate the clock settings, and nowhere is it specified that some of the results are illegal or that you must choose the lowest possible M,N.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Tamir Michael over 16 years ago in reply to Tamir Michael

Hello,

I have received this reply from NXP:

"The PLL.1 erratum has been resolved since version "A" of the silicon. It doesn't apply to the LPC2478. It is therefore unlikely that the PLL frequency is the cause of your system instability.

LPC2478 started with rev C silicon, current LPC2468 are rev B silicon."

Why is it then that choosing the most stable clock settings that yields 72[MHz] on a LPC2478 stops RTX from failing? It is NXP's spread sheet that I used to calculate the clock settings, and nowhere is it specified that some of the results are illegal or that you must choose the lowest possible M,N.
Cancel
Vote up 0 Vote down

Cancel

Children

0 Tamir Michael over 16 years ago in reply to Tamir Michael

Hello,

More information: I have conducted additional tests along with Franc Urbanc and found out that M=12, N=1 (=72[MHz] with a crystal of 12[MHz]) and a tick rate of 50 microsecond fail RTX when the startup file is augmented with NOP to align the binary structure compared to the M=24, N=2 version. We are large shifts in generated code and an increase in binary size (4 bytes) compared to settings of M=24, N=1. The errata sheet of LPC2468 does not apply to LPC2478.
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to Tamir Michael

I have conducted more tests for Franc. It seems to be a problem with the LPC2478 revision C MAM (not certain yet), as my test program does not crash when executed from RAM.
Cancel
Vote up 0 Vote down

Cancel
0 ryan williams over 16 years ago in reply to Tamir Michael

i'm changed mine to run from ram, it still crashes, maybe less often. still always in that same function.

i haven't run your test program but i will as soon as i get a chance.
Cancel
Vote up 0 Vote down

Cancel
0 Tamiryan Michael over 16 years ago in reply to ryan williams

Hello Ryan,

Have you tried to shutdown the MAM altogether? Do notice that I have determined, together with Franc Urbanc, that the actual structure of internal flash image determines whether there is a crash or not, as long as the program is runs using the MAM.
Cancel
Vote up 0 Vote down

Cancel
0 ryan williams over 16 years ago in reply to Tamiryan Michael

tamir,

i ran your test program for a while. I have not seen it crash, but three times so far, the watchdog function in each task shows that some of those tasks are no longer running. this happened after 5-30 minutes of running. i was running this on the lpc2478 board.

I have seen this behavior on my own project lately while doing tests. It seems that sometimes a task is lost, sometimes a task is in the list more than once (next pointer points to itself and gets stuck in a loop), and most often the dabt error occurs on the null pointer.

i've run it with MAM off and these things still happen, perhaps less often. my tests now include a mcb2470 as well as 2 different EA lpc2468 OEM boards on my own base board design.
Cancel
Vote up 0 Vote down

Cancel
0 Tamiryan Michael over 16 years ago in reply to ryan williams

Ryan,

Franc Urbanc confirmed that this is indeed a RTX problem. Tests are conducted now to check if Franc's fix is valid. Be patient, help is under way...
Cancel
Vote up 0 Vote down

Cancel
0 Eric Severson over 16 years ago in reply to Tamiryan Michael

It sounds like you guys are making great progress. Thank you for continuing to post information as you go. I appreciate it and I am sure others on the forum do as well.

-Eric
Cancel
Vote up 0 Vote down

Cancel
0 Robert over 16 years ago in reply to Tamiryan Michael

The actual reason for all the sporadic occasional RTX failures you have been seeing is most likely due to the NXP LPC2xxx VIC undocumented "feature" (described bellow) and that RTX was not aware of this.

VIC behavior: After an interrupt is disabled (writing to VICIntEnClr) the interrupt is not immediately blocked but can still happen for a few cycles (time needed for VIC to process the request). Special tests were performed which confirm this behavior.

This "feature" was not taken into account by the RTX kernel. Therefore in some rare situations (very timing specific) it could happen that a blocked interrupt was still executed which eventually lead to RTX failure. Such situations are very rare (can happen sooner when the system time tick interrupt happens more often) and even less likely when the MAM is disabled because then an instruction fetch takes longer then the few cycles that VIC requires. This explains also why the problem was not detected sooner and why it was almost gone when MAM was disabled.

The updated RTX kernel now takes the described VIC behavior into account which should eliminate the reported problems (at the cost of a few additional CPU cycles).

BTW: Similar Interrupt controller behavior like described for the NXP VIC applies also for the ST's STR7 EIC. In reality the EIC is even worse in this aspect since the time to process the interrupts is even longer. Therefore this behavior was already seen and RTX kernel already handled this. On the other hand it was considered that for NXP VIC this is not necessary.

In general ARM7/9 cores do not have interrupt controllers so silicon vendors added their own external implementation and this leads to such behavior as described above. Much better in this aspect are the new ARM Cortex-M cores which have an advanced Nested Interrupt Controller (NVIC) already tightly integrated with the core. This has many benefits (faster interrupt response, late arriving interrupts, tail chaining ...) and also eliminates such problems as seen with VIC and EIC.
Cancel
Vote up 0 Vote down

Cancel
0 Tamiryan Michael over 16 years ago in reply to Robert

Hello Robert,

Franc provided me with a patch that seems to work fine. I guess we need to thank you all for putting so much effort in this. When can we expect a new offical release of RL-ARM containing this fix?

Tamir
Cancel
Vote up 0 Vote down

Cancel
0 ImPer Westermark over 16 years ago in reply to Robert

Is it really undocumented? Isn't that behaviour common to all ARM chips that have an external interrupt controller, and one of the reasons why code either has to wait a fixed number of cycles or deactivate interrupts in the core instead of in the interrupt controller?
Cancel
Vote up 0 Vote down

Cancel
0 Robert over 16 years ago in reply to ImPer Westermark

Tamir,

The new RL-ARM which includes this fix will be released soon (in a few weeks).

Per,

Yes, this behavior seems to be common to ARM7/9 with external interrupt controllers. However the number of cycles varies between interrupt controller implementations and I haven't seen any documentation about this.
Cancel
Vote up 0 Vote down

Cancel
0 ImPer Westermark over 16 years ago in reply to Robert

"However the number of cycles varies between interrupt controller implementations and I haven't seen any documentation about this."
Neither have I. And it isn't easy to guestimate the required number either. Some thing that seems to work after extensive testing can still be one clock off, just waiting for that other interrupt to come and catch you with the pants down :(
Cancel
Vote up 0 Vote down

Cancel
0 bruce yu over 16 years ago in reply to Robert

Hi Robert, Thank you very much for your explain. My arm7 uses EIC and I encounter a simular issue, see here :http://www.keil.com/forum/docs/thread15796.asp

To fix the problem, our solution is to excute a short for loop(to delay) after disable EIC interrupt.Dose the method work?
Cancel
Vote up 0 Vote down

Cancel
0 ImPer Westermark over 16 years ago in reply to bruce yu

A number of chip vendors have recommended the use of a couple of nop after disabling interrupts. Any combination of instructions that takes - at least - the required number of cycles should do fine.

The only issue is that the exact number of cycles isn't always known because the manufacturer haven't published it in any datasheet.
Cancel
Vote up 0 Vote down

Cancel