This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Program failure at 80+ degrees

Hello,

I was hoping to hear your opinion about a serious problem I have - it is either I solve it or reduce my LPC2478 CPU speed from 72[MHz] to 64[MHz] (11% loss. The problem does not seem to be occurring at lower MHz settings). I posted about this in the past but it was a long time ago.
When I place a controller in an environmental chamber and increase the temperature to 80+ Celsius degrees, I often see data abort exceptions, and sometimes I get the impression that the PC takes a hike (even the firmware LED that blinks every 1 second becomes irregular for a while before it stops). The program is launched by a boot loader and has a lower level supporting firmware layer that handles some interrupts (not all). I also see that if RTX is not started at all (but the application hangs in a "for (;;)" loop instead, hence the bootloader and firmware layer were/are involved, but the application is idle) - the system never crashes! I have excluded, as far as I could tell, the roll of external memory or RTX in this situation. However, I still suspect RTX a little (even though my test programs never crashed).
My question: did you ever encounter such a situation? Where do I look best? can this be the result of a misbehaving peripheral? NXP have confirmed the LPC2478 is not the reason.

  • I don't know about you, but for me, these problems are the cream of the crop of this kind of line of work. almost every problem is a mystery, every problem can be solved (?) in different ways. did I enjoy sitting 3 days in front of an environmental chamber (I have a few burn makes!) ? no way, and the problem is not solved yet (but the probable cause known). but in the end, it is/was a lot of fun!

  • I don't know about you, but for me, these problems are the cream of the crop of this kind of line of work. almost every problem is a mystery, every problem can be solved (?) in different ways. did I enjoy sitting 3 days in front of an environmental chamber (I have a few burn makes!) ? no way, and the problem is not solved yet (but the probable cause known). but in the end, it is/was a lot of fun!

    I second this!!, thats a real engineer soul, we are in some way... masochist geeks :)

  • Remember that a loop continuously accessing the DRAM will look like a super-charged RAM refresh. For problems with RAM refresh, it is often better to fill the RAM with a known pattern and then make sure that the RAM is not touched for a long time so that the only refresh there is comes from the DRAM controller performing background refreshes. Then revisit the chip one every hour and verify that the pattern is still correct.
    True.

    Then, for DRAM memories the test case should be: fast access and slow access.

  • related to the external RAM being addressed by both the application and the LCD controller.

    A conclusion you made impossible for anyone else to arrive at, by not mentioning anything about an LCD before, much less that it shared external RAM with the CPU. Is that dual-ported RAM, or how else do you organize shared access?

  • The data transfers LCDController <-> Memory are done by DMA, and there is an automatic mechanism for arbitration, This should not be a problem.

    But it should be taked into account if the Video Buffer is located in DRAM and you want to test the DRAM PerÂ's suggestion.

  • "I second this!!, thats a real engineer soul, we are in some way... masochist geeks :)"

    I would totally agree with that, I've been involved in plenty of projects where I've been totally engrossed for weeks/months on end, keeping a note pad by my bed for when I wake up with 'the ultimate answer' (much to the annoyance of my beloved wife).

    But ... the difference is, most don't keep trying to share this random blabber out.

    Have you ever been to a party and sat next to Mr. Boring?

  • Note that fast/slow accesses for a DRAM would most often be the actual timing of the signals. How long time for signals to settle, or to hold. Number of wait states.

    There are quite a lot of tests needed for memory. Some for prototypes. Some for factory production. Some for every boot or maybe even regularly when run.
    - correct supply voltages at all temperatures and loads
    - correct timing of signals
    - good flanks and high/low logic levels for signals
    - all unused chip-selects etc having pull-up/pull-down
    - correctly wired (no shorts/breaks)
    - all memory cells working
    - stability at maximum load at low/high temperature
    - stability at zero load at low/high temperature (refresh working)
    - low-power retention (mainly SRAM with super-cap or battery)
    - ...

  • "...the previous one did use some components whose thermal limits were below 80 degree."

    And you found it was failing at 80+ degrees. Well, there's a surprise (not).

    This thread might well have been started with:

    "If you violate the thermal specifications ..."

  • Hello,

    I have learned a little more about this problem in the mean time and was wondering if you can enlighten me further. I am currently running a weekend test of a controller that utilizes the LCD controller of the LPC2478 vs. a controller that does not. The first one is reduced to 64 [MHz] while the second one still runs at 72[MHz], and they communicate via a RS485 bus. Hopefully this remains stable but either way, I have just reduced the display's processing capacity by 12%...
    'Samsung' have promised me that their DRAM (K4S561632J) does not suffer from any issues and that the EMC timing settings used now should apply to the entire range of temperatures (maybe the controller was not warmed up entirely or long enough when I concluded otherwise). I am not sure about the refresh rate, but either way I did try to play with it without any positive results. I am aware that the signals to the DRAM should be measured, but that is not so simple at 80+ degrees.
    The latest LPC24xx data sheet elaborates on the AHBCFGx registers which determine the arbitration of the AHB busses (my LCD, DRAM and peripheral(MCI interface uses GPDMA) hang on AHB1) . This is a very fundamental setting that I have no experience changing. Do you think this could help me out? I did a few tests with a negative result, but I feel that I have not exhausted it. Either way, can you think of another system setting that might influence this particular problem? I have, for now, ruled out bad traces and noise as another controller (without an LCD) uses the same hardware design and accesses to external RAM (MCI DMA) ) does not crash.

  • I have found this reference myself, but unfortunately NXP do no explain the impact of modifying these registers. It is of course exceedingly hard to solve a problem that you do not fully understand with tools you do not fully understand...
    I believe this has something do to with how DMA/LCD DMA and the processor interact with the AHB bus, which changes slightly when temperature rises. I asked NXP to confirm that they have tested the LCD controller of the LPC2478 at these extreme temperatures but they have not replied yet.

  • If only you hadn't upset Master Zeusti.

  • Right now I am willing to use just about any help - Zeusti, that Steve figure from above, anything. It is either I solve this, or (assuming the system survives the weekend test!) CPU speed for the display has to go down to 64[MHz] !

  • It is either I solve this, or (assuming the system survives the weekend test!) CPU speed for the display has to go down to 64[MHz] !

    I quickread this thread and did not see it mentioned that the internal heat generated by the chip is proportional to the clock speed.

    NXP claim that the LPC2478 was tested at their labs at up to 105 degrees, and some applications allow for us to 120 degrees...!
    Under which operating conditions??

    Erik

  • Erik,

    Thanks for your comments. The answer to your questions is that I do not know: NXP did not elaborate, as far as I can tell, on the exact environmental conditions used to test the chip in any report I could get my hands on. I just don't have enough data to handle this properly...! And you are right: Going down to 64[MHz] might just mask a still existing problem. But at the moment, I don't have any other choice - product beta (thus, installation at the client site) phase is approaching.