This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Program failure at 80+ degrees

Hello,

I was hoping to hear your opinion about a serious problem I have - it is either I solve it or reduce my LPC2478 CPU speed from 72[MHz] to 64[MHz] (11% loss. The problem does not seem to be occurring at lower MHz settings). I posted about this in the past but it was a long time ago.
When I place a controller in an environmental chamber and increase the temperature to 80+ Celsius degrees, I often see data abort exceptions, and sometimes I get the impression that the PC takes a hike (even the firmware LED that blinks every 1 second becomes irregular for a while before it stops). The program is launched by a boot loader and has a lower level supporting firmware layer that handles some interrupts (not all). I also see that if RTX is not started at all (but the application hangs in a "for (;;)" loop instead, hence the bootloader and firmware layer were/are involved, but the application is idle) - the system never crashes! I have excluded, as far as I could tell, the roll of external memory or RTX in this situation. However, I still suspect RTX a little (even though my test programs never crashed).
My question: did you ever encounter such a situation? Where do I look best? can this be the result of a misbehaving peripheral? NXP have confirmed the LPC2478 is not the reason.

  • NXP have confirmed the LPC2478 is not the reason.

    Odd ... the maximum allowed ambient temperature for the commercial version of the chip is 85 degrees Celsius. "80+ degrees" sounds awfully close to that.

    Besides the CPU, there are a few other possible suspects, like the power supply. I wouldn't exclude the external memory until after a thorough examination - at 80 °C the external memory interface of the chip might be operating close to its worst-case specs. Another candidate for investigation is the flash memory of the chip (if it is used).

  • more data: if I empty all the RTX tasks - I experience no crash. How can the temperature determine the behavior, assuming all the components support these temperatures (they do) ?

  • Another thing: I assume those 80 degrees are the temperature inside the environmental test chamber. Is the device in some kind of case or enclosure, so that the components might experience higher temperatures?

    If so, it might be worth to stick a thermometer in there and check.

  • No experience of this problem but just a random thought: how susceptible to high temperatures is the quartz crystal that you are using? A quick Google search came up with this statement from a manufacturer:

    "Operating Temperature: Standard Operating Temperature ranges are generally considered as -20-+70 degrees Celsius (considered "commercial" Operating Temperature), and -40-+85 degrees Celsius (considered "Industrial" Operating Temperature)"

  • thanks for your attention. there is no casing involved (yet): the product is required to operate at temperatures of up to 70 degrees, but it might exceed that if the casing is installed, thus the vigorous testing. the crystal does not seem to be the problem - at least our hardware people said that...
    NXP claim that the LPC2478 was tested at their labs at up to 105 degrees, and some applications allow for us to 120 degrees...!

  • the crystal does not seem to be the problem - at least our hardware people said that...

    This is quite likely a hardware problem (I'd wager a few Euros ;) ), so the hardware people should be investigating it or at least be tightly in-the-loop. :)

    (This means they should be poking the device with scope probes when it is misbehaving.)

    At least around here, all things related to environmental testing are done by our wonderful hardware people.

  • Another thing: Did you test different devices, and are they all showing the same symptoms? Nothing is more annoying than a wild goose chase that, after many dead ends, resolves to merely being a manufacturing defect.

    Also, consider using cooling spray to find out which part of the circuit is sensitive to heat.

  • Does it matter if the processor caches memory accesses?

    Does it matter if the processor runs at 100% CPU load, or if it sleeps between interrupts?

    Have the hw guys verified/probed all external signals - oscillator, power, interrupt inputs, voltage references, ...

  • Christoph, Per,

    I _think_ this is something in the application itself, even though LR indicates the data abort originated from RTX code (that is always the case). I tried to change the tick rate to no avail. Tried cooling spray already - I think the processor is the factor, not sure though. My test applications in the same conditions never crash. Other prints suffer from the same problem, no localized defect...
    the hardware seems verified.

    Per,

    Could you please explain why you asked

    Does it matter if the processor caches memory accesses?

    Does it matter if the processor runs at 100% CPU load, or if it sleeps between interrupts?

    ?

  • The CPU load will affect the power consumed by the processor which will affect the additional heating internally. It will add extra stress to both processor and power supply.

    Caching of memory will both affect CPU load, but also memory timing. If the supply voltage isn't stable or is slightly changing, or the oscillator is jittering or slightly changing frequency, the safety margins can be reduced.

    If the chip is running with DRAM, then the refresh needs will be higher at high temperatures but the RAM access patterns can hide the refresh problem. Caching of RAM will change the access pattern and hence change the amount of refresh from normal access cycles. At room temperature, the DRAM refresh may be several times too slow without any problems possible to see because the much slower self-discharge at room temperature.

  • Per,

    thanks. I tried refining the parameters of the DRAM according to the data sheet - it is speced to work until 85 degrees, to no avail. but the DRAM is a suspect, as it seems to make a difference if I cool only that component...

  • if I empty all the RTX tasks - I experience no crash.

    Hmmm... if there's nothing being done any more by any task, and the system crashes: how exactly did you expect to experience that fact?

    How can the temperature determine the behavior, assuming all the components support these temperatures (they do) ?

    Maybe because that assumption is wrong, or some components are actually hotter than you think they are, or this close to their thermal limits, enough components have begun to change in behaviour that your electronic design has been driven beyond at least one of its design margins.

    This is pretty much guaranteed to be a hardware problem. The only way heat can affect software behaviour is by affecting hardware first.

  • even though LR indicates the data abort originated from RTX code (that is always the case).

    Does the RTX code do anything that the regular code does not (or only infrequently)?

    For example:

    - Run from internal flash
    - Use internal RAM

  • "The only way heat can affect software behaviour is by affecting hardware first."

    Unless the OP is [once again] using undocumented RTX calls ;)

  • guess what Christoph, how the failure happens somewhere else (LR indicates a non-existing "internal RAM" like address) so I cannot answer your question (I did not write down the data yesterday, darn!). I have asked Samsung to advise us about the changing timing of the DRAM at high temperature.