This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How do I Validate the code memory at run time?

Hai all,
I am working on saftey critical application,so my requirement is to validate the code memory at run time.. Is there any method to validate the code during runtime...calculating check sum at runtime and verifying is possible and effective ??...

Parents
  • Computing the checksum of the code area is no different when the program is running to computing it when the application boot.

    The only limitation is that you must limit the amount of processor time you spend, so that you still do all the critical things within the required times.

    Checksumming (you norally don't do a checksum, but a CRC or possibly MD5 or SHA160 or similar if the processor is up to it) may detect deteriorating flash data. But the big problem is how you will know. If there is a bit error in the code area, the checksum will get the wrong result. But if the bit error is at the wrong location, your checksum will think that it was ok because the final if test will be broken. Or more probably, your program will fail somewhere else - hopefully ending up with a watchdog reset.

    So in the end, regularly checksumming the code space may help a bit. But it is only a minor part of what you have to do. Besides, the risk is probably higher that the device fails because of a bug or because it gets hit by a big enough electrical spike.

    So you have to figure out how you can not only compute results, but also validate the computed results. If you compute C = A + B, it may be good to somewhere else in the code trying to verify that C - A matches B, to make sure that you can trust that the value C hasn't been damaged. And you need to think about complementing the hardware watchdog with a lot of software watchdogs, where the hardware watchdog doesn't get feed/kicked without proof that all your critical tasks are still being run and still producing reasonable results.

    No one has still managed to show a working recipe for producing really safe code. All you can do is spending time on trying to identify risks and then trying to figure out how hardware or software can be used to reduce these identified risks.

    The big problem is that to keep down on bugs, you have to think KISS (Keep It Simple, Stupid) but at the same time all the extra validation code will greatly increase the complexity. You end up trying to balance two incompatible design rules.

    And then you still have the problem of what to do if you detect an error. Is it an error in the hardware or the result of an anomalous event. Or is it a false error caused by bugs in the validation code? If the accelerator gets stuck in a car, you might want to kill the engine. But on the other hand, if "kill the engine" is the default error response, then you may kill a driver by killing the engine in the middle of an overtake maneuver. So it may not be enough for a safety-critical application to just detect a problem. It may have to try to figure out the safest response to the failure. But how do you do that, when you don't trust the hardware and software?

Reply
  • Computing the checksum of the code area is no different when the program is running to computing it when the application boot.

    The only limitation is that you must limit the amount of processor time you spend, so that you still do all the critical things within the required times.

    Checksumming (you norally don't do a checksum, but a CRC or possibly MD5 or SHA160 or similar if the processor is up to it) may detect deteriorating flash data. But the big problem is how you will know. If there is a bit error in the code area, the checksum will get the wrong result. But if the bit error is at the wrong location, your checksum will think that it was ok because the final if test will be broken. Or more probably, your program will fail somewhere else - hopefully ending up with a watchdog reset.

    So in the end, regularly checksumming the code space may help a bit. But it is only a minor part of what you have to do. Besides, the risk is probably higher that the device fails because of a bug or because it gets hit by a big enough electrical spike.

    So you have to figure out how you can not only compute results, but also validate the computed results. If you compute C = A + B, it may be good to somewhere else in the code trying to verify that C - A matches B, to make sure that you can trust that the value C hasn't been damaged. And you need to think about complementing the hardware watchdog with a lot of software watchdogs, where the hardware watchdog doesn't get feed/kicked without proof that all your critical tasks are still being run and still producing reasonable results.

    No one has still managed to show a working recipe for producing really safe code. All you can do is spending time on trying to identify risks and then trying to figure out how hardware or software can be used to reduce these identified risks.

    The big problem is that to keep down on bugs, you have to think KISS (Keep It Simple, Stupid) but at the same time all the extra validation code will greatly increase the complexity. You end up trying to balance two incompatible design rules.

    And then you still have the problem of what to do if you detect an error. Is it an error in the hardware or the result of an anomalous event. Or is it a false error caused by bugs in the validation code? If the accelerator gets stuck in a car, you might want to kill the engine. But on the other hand, if "kill the engine" is the default error response, then you may kill a driver by killing the engine in the middle of an overtake maneuver. So it may not be enough for a safety-critical application to just detect a problem. It may have to try to figure out the safest response to the failure. But how do you do that, when you don't trust the hardware and software?

Children
  • There is the Power ON testing that validates the firmware, hardware, and other requirements of the safety criteria. There is also continuous testing of the system, and commanded testing of the system.

    A power on Built-In Test (BIT) (or Basic Internal Test in some circles) is performed to ensure that the system is ready and safe to operate.

    A continuous BIT (CBIT) is performed as part of the regular duties of the software. This can monitor supply voltages, validate I/O states, do checksums, etc.

    In addition, some systems can be designed to do a more thorough commanded BIT that performs either a specific type of testing or a whole battery of tests.

    By designing in these types of testing, you can increase your reliability (or at least track it), and/or avoid a larger failure through early detection.

    The "what to do if it fails" is highly application dependent. But I'm sure that this was covered in your system design plan anyway. (a tinge of sarcasm there)

    Since the boot-time BIT takes time, you'll have to balance that with the "power-on to first action" time allowed in your system. While the CBIT can be done more leisurely, and split up between tasks or during "idle" states.

    --Cpt. Vince Foster
    2nd Cannon Place
    Fort Marcy Park, VA