This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Logging catastrophic software error on LPC1768

We are developing using LPC1768, we using a number of subsystems, TCP, SD card, SPI.

For some reason that we don’t understand occasionally we have some failure in the software that causes the system resets itself.

My question is what mechanism we could use to log the reason of the failure? we need to know at which precise moment the software failed so we could examine that after and correct the issue.

Thank you.

  • Processors seldom resets themselves after a failure. They instead tend to enter an exception handler.

    Do you happen to use a watchdog? So you get a watchdog reset because your program isn't able to kick the watchdog anymore?

  • So output enough telemetry/diagnostic information about the state of the system via a serial port, and connect it to a terminal and have that record the information to a file.

    Increase the level of information as you focus in on the cause.

    As Per suggests, it could be a watchdog, presumably you can see that as a cause at reset.

    Understand the flow of your code, and how it might have arrived at the failure condition.

  • Thanks everyone, the issue is that the failure is very random and difficult to reproduce, and we don’t have a serial connection plugged all the time so it is difficult to get continuous and detailed information flow about the progress of the program.

    We could try to keep some log in the flash.

    The watchdog is an option but we also want to get to the bottom of the issue, so we don’t want to mask it.

    I know this is wishful thinking but is it there any black magic, i.e. register that could contain information about what caused the failure?

    Best regards

  • Here is something you can try: Let the program fail. Configure uv4 to attach to process as JTAG is enabled: don't reset on startup, don't update binary while debugger startup and something else (I don't remember...). Then, load your .axf file using the console. If the binary is built using debug information, you will have a stack trace.
    Other that that you can try using a trace buffer that is not reset upon startup, and insert markers as the program runs, or send out the value of the link register after a crash via UART (assuming it still works...). There are many many other ways.

  • I used this technique in the part with ULink, and even nowadays - debugging a raspberry pi remotely (no JTAG) with gdb allows similar operations with remarkable success.

  • don't load application on startup ! That's what I forgot.

  • If he can't attach something as simple as a serial cable and a laptop, hanging a JTAG pod isn't going to be an option. Presume also he doesn't have a TRACE pod, or means to log in situ.

    The watchdog wasn't suggested as a solution, but as a cause of a reset.

    Isn't this LPC part using a Cortex-M3? What kind of fault handler and instrumentation have you added here? The processor has a whole bunch of magic registers (M3 TRM) describing the cause and location of faults, but it's transient and likely lost if you continue ploughing through the debris field.

    Flash is next to useless here, the amount of data you'd want to save is too high, and the slowness will impact the system. Sure, you could put a small ring buffer in SRAM and log check points into that, then in ResetHandler record the reset event cause (again probably a magic register documented by NXP), and last block of trace data from the ring buffer.

    Mark your stacks, and have a clear idea of maximal utilization on the bench.

  • Another alternative is to configure a block of RAM as no-init.

    Keep a rotating log there.

    If you get an unexpected reset (your startup code should be able to see if it was a power-on-reset or an external reset or if there was some other reason for the reset), then you set a flag that your logging code must not add any more log entries and that this RAM region now contains important trace information for you to extract and investigate.

    So you might log "enter" and "leave" in critical chains. And you might log stack position at critical positions.

    The problem with flash logging is that it just can't log with enough time resolution without you burning it to cinders very quickly. You are likely to need thousands of log entries every second unless you have a program that spends most of the time sleeping waiting for some far between interrupt to wake it up. And when you don't know what fails, then you can't just every 10 minutes write anything really meaningful to the flash - because exactly what is meaningful when the processor can go from "everything is well" to "let's crash and burn" in microseconds...

    Next thing you can do is to try some defensive programming. So if you still have code space and free MHz you could add asserts and you could try to recompute expected state of variables and compare current state with recomputed state. If you do see a problem then you do know what to log.

    You should obviously make sure you have nice exception handlers that can turn on some SOS LED or something and then just busy-loop while waiting for someone to get there and try to dump out as much information as possible - in this case all register and RAM content.