This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Program failure at 80+ degrees

Hello,

I was hoping to hear your opinion about a serious problem I have - it is either I solve it or reduce my LPC2478 CPU speed from 72[MHz] to 64[MHz] (11% loss. The problem does not seem to be occurring at lower MHz settings). I posted about this in the past but it was a long time ago.
When I place a controller in an environmental chamber and increase the temperature to 80+ Celsius degrees, I often see data abort exceptions, and sometimes I get the impression that the PC takes a hike (even the firmware LED that blinks every 1 second becomes irregular for a while before it stops). The program is launched by a boot loader and has a lower level supporting firmware layer that handles some interrupts (not all). I also see that if RTX is not started at all (but the application hangs in a "for (;;)" loop instead, hence the bootloader and firmware layer were/are involved, but the application is idle) - the system never crashes! I have excluded, as far as I could tell, the roll of external memory or RTX in this situation. However, I still suspect RTX a little (even though my test programs never crashed).
My question: did you ever encounter such a situation? Where do I look best? can this be the result of a misbehaving peripheral? NXP have confirmed the LPC2478 is not the reason.

Parents

0 Christoph Franck over 16 years ago in reply to Tamir Michael

Another thing: Did you test different devices, and are they all showing the same symptoms? Nothing is more annoying than a wild goose chase that, after many dead ends, resolves to merely being a manufacturing defect.

Also, consider using cooling spray to find out which part of the circuit is sensitive to heat.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Christoph Franck over 16 years ago in reply to Tamir Michael

Another thing: Did you test different devices, and are they all showing the same symptoms? Nothing is more annoying than a wild goose chase that, after many dead ends, resolves to merely being a manufacturing defect.

Also, consider using cooling spray to find out which part of the circuit is sensitive to heat.
Cancel
Vote up 0 Vote down

Cancel

Children

0 ImPer Westermark over 16 years ago in reply to Christoph Franck

Does it matter if the processor caches memory accesses?

Does it matter if the processor runs at 100% CPU load, or if it sleeps between interrupts?

Have the hw guys verified/probed all external signals - oscillator, power, interrupt inputs, voltage references, ...
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to ImPer Westermark

Christoph, Per,

I _think_ this is something in the application itself, even though LR indicates the data abort originated from RTX code (that is always the case). I tried to change the tick rate to no avail. Tried cooling spray already - I think the processor is the factor, not sure though. My test applications in the same conditions never crash. Other prints suffer from the same problem, no localized defect...
the hardware seems verified.

Per,

Could you please explain why you asked

Does it matter if the processor caches memory accesses?

Does it matter if the processor runs at 100% CPU load, or if it sleeps between interrupts?

?
Cancel
Vote up 0 Vote down

Cancel
0 ImPer Westermark over 16 years ago in reply to Tamir Michael

The CPU load will affect the power consumed by the processor which will affect the additional heating internally. It will add extra stress to both processor and power supply.

Caching of memory will both affect CPU load, but also memory timing. If the supply voltage isn't stable or is slightly changing, or the oscillator is jittering or slightly changing frequency, the safety margins can be reduced.

If the chip is running with DRAM, then the refresh needs will be higher at high temperatures but the RAM access patterns can hide the refresh problem. Caching of RAM will change the access pattern and hence change the amount of refresh from normal access cycles. At room temperature, the DRAM refresh may be several times too slow without any problems possible to see because the much slower self-discharge at room temperature.
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to ImPer Westermark

Per,

thanks. I tried refining the parameters of the DRAM according to the data sheet - it is speced to work until 85 degrees, to no avail. but the DRAM is a suspect, as it seems to make a difference if I cool only that component...
Cancel
Vote up 0 Vote down

Cancel
0 Christoph Franck over 16 years ago in reply to Tamir Michael

even though LR indicates the data abort originated from RTX code (that is always the case).

Does the RTX code do anything that the regular code does not (or only infrequently)?

For example:

- Run from internal flash
- Use internal RAM
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to Christoph Franck

guess what Christoph, how the failure happens somewhere else (LR indicates a non-existing "internal RAM" like address) so I cannot answer your question (I did not write down the data yesterday, darn!). I have asked Samsung to advise us about the changing timing of the DRAM at high temperature.
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to Tamir Michael

I am working with the production revision of the hardware. this one seems more stable, but still fails. if I learn something new, I will post it.
Cancel
Vote up 0 Vote down

Cancel
0 VIKTOR BUCHRT over 16 years ago in reply to Tamir Michael

Maybe it is a solder problem.
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to VIKTOR BUCHRT

I doubt it. it occurs on many systems, and the hardware has been verified.
Cancel
Vote up 0 Vote down

Cancel
0 S. Steve over 16 years ago in reply to Tamir Michael

"if I learn something new, I will post it."

Like when something is not related to Keil tools.

Egads.
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to S. Steve

I understand that you have serious problems in finding something that invokes your intellect. do yourself (and us all, unless you have something meaningful to say) a favor, then: be gone.
Cancel
Vote up 0 Vote down

Cancel
0 DrOctavius Octavius over 16 years ago in reply to Tamir Michael

If you think that the DRAM is the problem, I would try:

- To do a small app where: write to the whole memory (at the highest bandwidth) and then verify. Make an infinite loop waiting for a fail or exception.
- Write down info like time-temp.
- Run the app for DRAM and then for IRAM or if you have many boards do it at the same time with 3-4 of them.
Cancel
Vote up 0 Vote down

Cancel
0 Nevill Dayley over 16 years ago in reply to DrOctavius Octavius

Try increasing the temperature to see if the problem gets any worse.
Cancel
Vote up 0 Vote down

Cancel
0 Tamir Michael over 16 years ago in reply to DrOctavius Octavius

Leandro,

Despite "stunned Steve"'s doubt about me, I have already tried this (but thanks for the tip anyway) - all my test programs, including the one you described work well under the above conditions and worse. still searching for the smoking gun/component!
Cancel
Vote up 0 Vote down

Cancel
0 S. Steve over 16 years ago in reply to Tamir Michael

Despite "stunned Steve"'s doubt about me

Should of course read:

Despite "stunned Steve"'s doubts about me
Cancel
Vote up 0 Vote down

Cancel