Hello,
I was hoping to hear your opinion about a serious problem I have - it is either I solve it or reduce my LPC2478 CPU speed from 72[MHz] to 64[MHz] (11% loss. The problem does not seem to be occurring at lower MHz settings). I posted about this in the past but it was a long time ago. When I place a controller in an environmental chamber and increase the temperature to 80+ Celsius degrees, I often see data abort exceptions, and sometimes I get the impression that the PC takes a hike (even the firmware LED that blinks every 1 second becomes irregular for a while before it stops). The program is launched by a boot loader and has a lower level supporting firmware layer that handles some interrupts (not all). I also see that if RTX is not started at all (but the application hangs in a "for (;;)" loop instead, hence the bootloader and firmware layer were/are involved, but the application is idle) - the system never crashes! I have excluded, as far as I could tell, the roll of external memory or RTX in this situation. However, I still suspect RTX a little (even though my test programs never crashed). My question: did you ever encounter such a situation? Where do I look best? can this be the result of a misbehaving peripheral? NXP have confirmed the LPC2478 is not the reason.
Another thing: Did you test different devices, and are they all showing the same symptoms? Nothing is more annoying than a wild goose chase that, after many dead ends, resolves to merely being a manufacturing defect.
Also, consider using cooling spray to find out which part of the circuit is sensitive to heat.
Does it matter if the processor caches memory accesses?
Does it matter if the processor runs at 100% CPU load, or if it sleeps between interrupts?
Have the hw guys verified/probed all external signals - oscillator, power, interrupt inputs, voltage references, ...
Christoph, Per,
I _think_ this is something in the application itself, even though LR indicates the data abort originated from RTX code (that is always the case). I tried to change the tick rate to no avail. Tried cooling spray already - I think the processor is the factor, not sure though. My test applications in the same conditions never crash. Other prints suffer from the same problem, no localized defect... the hardware seems verified.
Per,
Could you please explain why you asked
?
The CPU load will affect the power consumed by the processor which will affect the additional heating internally. It will add extra stress to both processor and power supply.
Caching of memory will both affect CPU load, but also memory timing. If the supply voltage isn't stable or is slightly changing, or the oscillator is jittering or slightly changing frequency, the safety margins can be reduced.
If the chip is running with DRAM, then the refresh needs will be higher at high temperatures but the RAM access patterns can hide the refresh problem. Caching of RAM will change the access pattern and hence change the amount of refresh from normal access cycles. At room temperature, the DRAM refresh may be several times too slow without any problems possible to see because the much slower self-discharge at room temperature.
thanks. I tried refining the parameters of the DRAM according to the data sheet - it is speced to work until 85 degrees, to no avail. but the DRAM is a suspect, as it seems to make a difference if I cool only that component...
even though LR indicates the data abort originated from RTX code (that is always the case).
Does the RTX code do anything that the regular code does not (or only infrequently)?
For example:
- Run from internal flash - Use internal RAM
guess what Christoph, how the failure happens somewhere else (LR indicates a non-existing "internal RAM" like address) so I cannot answer your question (I did not write down the data yesterday, darn!). I have asked Samsung to advise us about the changing timing of the DRAM at high temperature.
I am working with the production revision of the hardware. this one seems more stable, but still fails. if I learn something new, I will post it.
Maybe it is a solder problem.
I doubt it. it occurs on many systems, and the hardware has been verified.
"if I learn something new, I will post it."
Like when something is not related to Keil tools.
Egads.
I understand that you have serious problems in finding something that invokes your intellect. do yourself (and us all, unless you have something meaningful to say) a favor, then: be gone.
If you think that the DRAM is the problem, I would try:
- To do a small app where: write to the whole memory (at the highest bandwidth) and then verify. Make an infinite loop waiting for a fail or exception. - Write down info like time-temp. - Run the app for DRAM and then for IRAM or if you have many boards do it at the same time with 3-4 of them.
Try increasing the temperature to see if the problem gets any worse.
Leandro,
Despite "stunned Steve"'s doubt about me, I have already tried this (but thanks for the tip anyway) - all my test programs, including the one you described work well under the above conditions and worse. still searching for the smoking gun/component!
Despite "stunned Steve"'s doubt about me
Should of course read:
Despite "stunned Steve"'s doubts about me