We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Here is a link to a number of suggestions I have compiled for hardening of firmware.
I'm pretty sure that a lot can be said about the list, so please post coding tips or links to pages with good information of software hardening.
iapetus.neab.net/.../hardening.html
"yes yes yes yes and once more - yes."
LOL. This is a non-threaded forum with threaded behaviour, which sometimes makes it very confusing to read.
My post is shown directly after yours, Tamir, but was not related to your post, but to an earlier post by John Linq.
I agreed with the content, whether it was in response to my post or not. My swift response was because I have recent experience with that approach - a trace buffer that I have built into the system helped localize a major bug that crippled RTX - it was a command to insert a log into the trace buffer itself, that happened so often from interrupt content (due to bad logic) that the scheduler got blocked. It would have been absolutely impossible to find without such a facility.
I meant "interrupt context" not "interrupt content", of course.
The lack of bounds checking will not make a program fail. It will only reduce the probability that an error gets caught.
Another thing is that all parts of a firmware isn't equally important. A single LED flash being longer or shorter isn't critical. But having the state machine handling the diodes accidentally start to use a different blink pattern would be problematic. Instead of a single-iteration error, you may get a device that blinks "link ok" but doesn't really have any connection to a server. Or maybe it flashes "internal error - send me to service" when there is no hardware error.
When you design the software, you must complement the requirements specification with estimated costs. What is the cost (in goodwill, repair costs, human danger, ...) if a specific requirement gets broken. If a LED stops lighting and the hardware can't detect this, the cost may be insignificant. Not getting a green "ok" might result in a worst-case scenario that someone replaces the broken unit. Having the LED lit when it shouldn't might on the other hand signal that no human being is within reach of an industrial robot. Not being able to handle the reading from a mechanical limiter sensor may possibly destroying a $1000,000 hardware. When the failure cost is high (either from a single unit failure, or possibly multiplied by a huge number of produced units), you will have to spend more time looking at the hardware and software to guard yourself from a dangerous situation or a mass recall.
Most embedded equipment are not used for critical applications, but almost any failure of an embedded device has some form of cost associated with it. That cost - even if small - should be considered through the full development process. That is one of the reasons why Cpt. Vince noted that the actual coding is a limited part of the development cycle.
But you are correct - a very significant part of an embedded program may relate to error conditions. Error conditions are also very often forgotten in requirements specifications. But it is often the error handling that will differentiate a good and a bad product. It doesn't matter if you need to press one or two buttons to activate a feature in your VCR, but you will remember the VCR that requires you to regularly unplug it, or the VCR that accepts your programming but forgets to complain that there is no tape loaded. Or the VCR that ignores the end-of-tape sensor and shredds your tape.
Ling-hua Tseng is the real name of tinlans. I remember he is good at C/C++ programming and the Compiler Technology, but he should be younger than Per Westermark.
The greenhouse effect, climate change, and extreme weather bring some impacts to Human society. In the olden day, the weather changes mainly by season changes; however, the weather changes now very quickly and very often, sometimes, the temperature difference within a day could be 10 degrees Celsius. I believe this will make the past Reliability Test Conditions become insignificant. I am curious about, is there any Amendment being made to the existing Reliability Test Standard? For example, MIL-STD.
MIL-STD-883 does contain thermal shock testing. I guess that is what you are getting at (?). [I don't remember what the actual specs are, and don't have it handy here, but this is from my memory of it].
Think about it for a second, the aircraft/missile/thong/whatever leaves the earth at a balmy 74 degrees F, heats up during the flight, and/or cools off to the high-altitude temperatures. Then suddenly descends back to earth again. Lands, and all is good.
It does this within minutes, or seconds.
So a "sudden" 10 deg C per-day change due to the global thermal cycles, isn't going to stress out these widgets that are MIL/DoD rated.
Net effect, I don't think there is a specific revision being made due to any day-to-day temperature change: its already in there with +120 to -40 to +120 to -40 deg cycle changes within minutes, not days.
--Cpt. Vince Foster 2nd Cannon Place Fort Marcy Park, VA
"Normal" automotive environment tests are also way harder than what the nature can manage (if you don't try to temperature-cycle your equipment near a vulcano or having your equipment hit by lightning).
For automotive use, you may go from room-temperature in a garage to either extreme cold or extreme heat within minutes. And while a rocket may have to suffer a single heat-cycle, and high-end jets gets picked to pieces quite regularly, the electronics of your car will be virtually untouched until you scrap the car or the electronics fails and has to be replaced.
You don't have to limit yourself to just the engine electronics. Think about the standard car stereo - maybe a pleasant +22°C when you drive the car, and then down to -30°C or +90°C when you leave the car in the winter or in direct sunlight. Most companies who works with this kind of equipment has quite impressive climate chambers for cycling of temperature, moisture, ... and the tests are done with quite big and quite fast cycles, unless running long-time tests at one of the extremes.
Hi Cpt. Vince and Per,
Many thanks to your explanations. Your explanations are very logical.
Due to my limited English ability and professional knowledge, I only mentioned the temperature factor, but in fact, something else should be considered, I did some Google search and found that, what I thought is more like the Aging Test. (It shows that, "American Society for Testing and Materials" owns a lot of data about Aging Test). It is hard to say if the Global Climate Change does bring some impacts on Materials Aging. But it does bring some impacts to Human Health.
I am not sure what you mean exactly, but of course there is an issue of data retention of non-volatile memory such as EEPROMs. the effect of aging of data stored on them is normally very well documented in the respective data sheets.
Regarding: Materials Aging
Another thought ...
I have not seen any mention of firmware ageing.
Do loops get slower as they get older?
No, the loops don't get slower. But it is a well known fact that the quality of the compiled binaries or untouched source code degrades with time.
Code that has worked for years will suddenly start to misbehave. Code that has passed validation tests will suddenly stop doing it, so the age factor is important.
That is why so many developers are so very scared of inheriting old and trusted code. The manager says: You don't have to worry - we haven't needed to release an update in five years. The next week you get two customers with problems. Within a couple of months, a significant percentage has a problem. And despite not having touched the code, the new developer gets the blame since the problems started after he took over the responsibility. Now is the time for the poor *** to find out that the original compiler was not stored in the source code repository, or has a license registration method making it impossible to reinstall, and that the last code changes five years ago somehow wasn't commited...
An aging plastic IC package might cracks, and leads to some open/short problem, I guess.
There are a huge number of problems you can get with hardware.
- Oxidation with socketed components or with connectors - in some cases melt-downs because the contact resistance gets too high. - Wet capacitors drying out (normally from high temp). - Tantalum capacitors exploding because they have been run out-of-spec. - Metal fatigue in bonding threads inside the chips. - Electromigration in chips, power transistors or switch regulators because they have run at high currents and high temperature for a long time. - Solder joint whiskers. - Metal fatigue in solder joints. - Factories that hasn't baked components, getting moisture crack the chip. - ESD damages (a damage in the factory can take months or years until the failure). - Damaged conformant coating, resulting in leak currents or possibly PCB traces being corroded until they break. ...
The problem is to try to decide what hardware failures that should be possible to detect and what work-arounds there should be in the firmware. Is it enough to warn about a problem or is the failure critical, requiring the unit to "brick" itself? Should there be redundant hardware? What is the probability of producing incorrect results? What will happen if incorrect results are produced? What will happen if no results at all are produced? What is required by the certification?