Hello all,
I do not wish to repeat myself as I have addressed this issue in a recent thread, but this is too important to turn a blind eye to as even lives could be at stake which certainly makes it worth a separate thread: I believe I managed to conceive a program that causes a failure of RTX on a LPC2468/2478 (simulator does not induce the failure). Quite some people have reported problems with RTX on this forum, so hopefully they can download my stripped test program here dl.getdropbox.com/.../LPC2468_RTX_Demo_min.zip to try there own variants. I have of course informed Keil support about this issue and I am currently waiting for feedback. I would very much appreciate any feedback you might have.
Tamir
Oh, I apologize. I read your post wrong.
Just to be clear: you have not yet had this occur in your test application on LPC2478 hardware after making the changes that the LPC2468 errata suggested, correct?
Eric,
That is correct. I had 2 boards running my test application with the lowest values of M,N (for PLL determination) that yield 72 MHz for about an hour without a failure. I forgot to leave such a system running all night long, though...will do so tomorrow! it is weird that some errata stuff that was written for the LPC2468 applies to the LPC2478 as well - while the errata sheet of the LPC2478 is "clean"...!
I forgot to mention that I have addressed the technical support of NXP regarding the errata sheet (in)compatibility issue.
Hello,
I have received this reply from NXP:
"The PLL.1 erratum has been resolved since version "A" of the silicon. It doesn't apply to the LPC2478. It is therefore unlikely that the PLL frequency is the cause of your system instability.
LPC2478 started with rev C silicon, current LPC2468 are rev B silicon."
Why is it then that choosing the most stable clock settings that yields 72[MHz] on a LPC2478 stops RTX from failing? It is NXP's spread sheet that I used to calculate the clock settings, and nowhere is it specified that some of the results are illegal or that you must choose the lowest possible M,N.
More information: I have conducted additional tests along with Franc Urbanc and found out that M=12, N=1 (=72[MHz] with a crystal of 12[MHz]) and a tick rate of 50 microsecond fail RTX when the startup file is augmented with NOP to align the binary structure compared to the M=24, N=2 version. We are large shifts in generated code and an increase in binary size (4 bytes) compared to settings of M=24, N=1. The errata sheet of LPC2468 does not apply to LPC2478.
I have conducted more tests for Franc. It seems to be a problem with the LPC2478 revision C MAM (not certain yet), as my test program does not crash when executed from RAM.
i'm changed mine to run from ram, it still crashes, maybe less often. still always in that same function.
i haven't run your test program but i will as soon as i get a chance.
Hello Ryan,
Have you tried to shutdown the MAM altogether? Do notice that I have determined, together with Franc Urbanc, that the actual structure of internal flash image determines whether there is a crash or not, as long as the program is runs using the MAM.
tamir,
i ran your test program for a while. I have not seen it crash, but three times so far, the watchdog function in each task shows that some of those tasks are no longer running. this happened after 5-30 minutes of running. i was running this on the lpc2478 board.
I have seen this behavior on my own project lately while doing tests. It seems that sometimes a task is lost, sometimes a task is in the list more than once (next pointer points to itself and gets stuck in a loop), and most often the dabt error occurs on the null pointer.
i've run it with MAM off and these things still happen, perhaps less often. my tests now include a mcb2470 as well as 2 different EA lpc2468 OEM boards on my own base board design.
Ryan,
Franc Urbanc confirmed that this is indeed a RTX problem. Tests are conducted now to check if Franc's fix is valid. Be patient, help is under way...
It sounds like you guys are making great progress. Thank you for continuing to post information as you go. I appreciate it and I am sure others on the forum do as well.
-Eric
The actual reason for all the sporadic occasional RTX failures you have been seeing is most likely due to the NXP LPC2xxx VIC undocumented "feature" (described bellow) and that RTX was not aware of this.
VIC behavior: After an interrupt is disabled (writing to VICIntEnClr) the interrupt is not immediately blocked but can still happen for a few cycles (time needed for VIC to process the request). Special tests were performed which confirm this behavior.
This "feature" was not taken into account by the RTX kernel. Therefore in some rare situations (very timing specific) it could happen that a blocked interrupt was still executed which eventually lead to RTX failure. Such situations are very rare (can happen sooner when the system time tick interrupt happens more often) and even less likely when the MAM is disabled because then an instruction fetch takes longer then the few cycles that VIC requires. This explains also why the problem was not detected sooner and why it was almost gone when MAM was disabled.
The updated RTX kernel now takes the described VIC behavior into account which should eliminate the reported problems (at the cost of a few additional CPU cycles).
BTW: Similar Interrupt controller behavior like described for the NXP VIC applies also for the ST's STR7 EIC. In reality the EIC is even worse in this aspect since the time to process the interrupts is even longer. Therefore this behavior was already seen and RTX kernel already handled this. On the other hand it was considered that for NXP VIC this is not necessary.
In general ARM7/9 cores do not have interrupt controllers so silicon vendors added their own external implementation and this leads to such behavior as described above. Much better in this aspect are the new ARM Cortex-M cores which have an advanced Nested Interrupt Controller (NVIC) already tightly integrated with the core. This has many benefits (faster interrupt response, late arriving interrupts, tail chaining ...) and also eliminates such problems as seen with VIC and EIC.
Hello Robert,
Franc provided me with a patch that seems to work fine. I guess we need to thank you all for putting so much effort in this. When can we expect a new offical release of RL-ARM containing this fix?
Is it really undocumented? Isn't that behaviour common to all ARM chips that have an external interrupt controller, and one of the reasons why code either has to wait a fixed number of cycles or deactivate interrupts in the core instead of in the interrupt controller?
Tamir,
The new RL-ARM which includes this fix will be released soon (in a few weeks).
Per,
Yes, this behavior seems to be common to ARM7/9 with external interrupt controllers. However the number of cycles varies between interrupt controller implementations and I haven't seen any documentation about this.
"However the number of cycles varies between interrupt controller implementations and I haven't seen any documentation about this." Neither have I. And it isn't easy to guestimate the required number either. Some thing that seems to work after extensive testing can still be one clock off, just waiting for that other interrupt to come and catch you with the pants down :(