We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Here is a link to a number of suggestions I have compiled for hardening of firmware.
I'm pretty sure that a lot can be said about the list, so please post coding tips or links to pages with good information of software hardening.
iapetus.neab.net/.../hardening.html
Please enlighten us, Mr. Sprat: How do you know that? Do you indeed carry the gift of telepathy (as Erik once suggested...) or did you use you amazing deduction skills to infer the above?
The latter, Mr. Michael, the latter.
I'm afraid that telepathy does not exist.
Hi Per,
May I be allowed to translate a very little parts of your "Some concepts for hardening embedded software" into Chinese, and to post the translated parts with a very short Chinese introduction to a BBS forum in Taiwan? (with a link to the original source and the name of you)? This is to introduce your documentation to my region.
LOL - the first actual comment about the page will be in a language I can't read :)
Yes, you may translate the text. Name + link to the english text will be fine.
I should update the page with some form of usage/license information to make it easier to make use of the text.
A Traditional Chinese Introduction to Per Westermark's "Some concepts for hardening embedded software"
www.ptt.cc/.../M.1239466995.A.F65.html
Per,
Thanks for an interesting article.
John,
Thanks for the translation. I hope to persuade my (Chinese) wife to read it. Then, maybe, she'll start to understand what my job is about.
But at the moment, it's throwing up a 404 error :(
Which link is giving 404? Both my links are up and working - tested from remote proxy.
And the link John Linq posted is also working ok.
It was John Linq's link that was giving the 404; but it's OK now.
Thanks.
Just a link about problems with bad firmware: www.theinquirer.net/.../seagate-barracudas-7200-11-failing
At least two, but probably three, TB+ disks failed so fast I didn't even had time to transfer the information to empty 1.5TB WD disks I already had laying around.
These bright guys seems to have intentionally bricked the units to protect the hardware, but at the same time making it impossible to update to fixed firmware, and Seagate will charge full recovery fees for restoring the data from fully functioning hardware.
I think it's time to update my backup program to not only count number of copies and geographic separation but also media brand/model.
The ability to accept new firmware should be kept at almost any cost. Bricked units don't exactly help with the goodwill.
Some comments from Taiwan. (in Traditional Chinese)
www.ptt.cc/.../M.1239467345.A.3A8.html
Will try to persuade them to join the discussion here in KEIL's forum.
1. sunneo says, he implements this kind of Hardening, by Operating System and Multi-Layer ISR. (I don't understand.)
2. tinlans says, the quoted code is "unreachable code" for compiler; and will be removed by compiler in most case with optimization. He suggests to implement this kind of Hardening by hardware.
for (idx = 0; idx < BUF_SIZE; idx++) { ... if (idx >= BUF_SIZE) { // loop variable has for some reason been corrupted. Take proper // action. perform_corrective_action(); } else { buf[idx] = new_data; } }
Mmmmm.....
My English ability and Technical skills are not good; hope my translation is not very incorrect/improper.
If you have a processor with MMU, then you can set up guard pages on either side of arrays, and have the processor generate an exception if your code tries to access any of these pages.
This is similar to how most full-size operating systems (not RTOS) automatically grows stacks.
But a very significant percentage of embedded equipment dont have the luxury of having an MMU.
The ability of the compiler to do dead-code elimination very much depends on the data declarations, and the full contents of a loop. Having an unsigned loop variable and trying to test for a negative value can be trivially deduced to be meaningles by a compiler. Tests for upper bounds can be eliminated if the compiler can see that a write to the loop variable is followed by multiple identical tests, where one or more of the tests comes after the break condition of the loop, in which case the following tests for the same value would be expected to evaluate to the same result - in this case not being reachable.
This is a reason why a sw design should avoid aliased accesses to variables, where two different pointers, or a pointer and a direct access may modify the same variable - the compiler may decide that it knows the contents of a variable even when modified. The program gets tested in a non-optimized debug build and then fails in a release build with full optimization, and then the compiler gets blamed.
This is also a reason why a lot of thought should be put on the use of the volatile keyword. It affects the compilers abilities to decide what is dead code, but will also make an aliased access take effect. A program with an aliasing bug may run perfectly because the compiler caches the relevant data in registers, but a trivial change to the code may exhaust the number of registers. A change of compiler version of compilation options may give the same result even without any code changes. The biggest disadvantage with volatile is of course the slowdown of the code and the increased load on the memory subsystem.
Not sure about your multi-layer interrupts, but an ISR is expected to be short and fast, so it should not contain any delays or big code constructs. If an interrupt requires a lot of work to be done, then you normally let the ISR trig an event and have either a RTOS task or possibly a lower-priority interrupt that allows nesting to perform the actual work. On Linux for example, you have the concept of tasklets that you may use to perform the real work after having been trigged by the ISR.
Another thing is that you may have an ISR separated into a top-half and a bottom-half, where the top-half runs with interrupts disabled and the bottom-half enables interrupts. The first part of the ISR is then a form of critical section, guarding from interference from new interrupts.
But this is a separate issue from having a stuck interrupt, where you either get no interrupts at all, or you instantly gets a new interrupt as soon as the ISR ends. If the interrupt state machine in the processor gets into an invalid state, it may be enough to reinitialize the interrupt source but you might just as well need to reset the processor. A level-trigged interrupt from a broken sensor would require the interrupt source to be deactivated until the sensor is fixed, possibly polling or inverting the logic of the interrupt input until the stuck condition goes away. A processor that can't invert the polarity of a level-trigged interrupt can make use of an XOR gate between the external hardware and the interrupt input, if polling isn't acceptable.
But edge-trigged interrupts can also get into troubles because of external hardware. An external failure such as the loss of a pull-up resistor can result in huge numbers of potentially very high-priority interrupts that may starve the main application or lower-priority interrupts.
If you have a timer tick that clears an event counter, and the event interrupt incrementing the counter, then the event interrupt can detect a counter that gets incremented too much. Either the timer interrupt has stopped working (or is starved because it has lower priority), or there are too many events within a time window. This is an example of using watchdogs for individual interrupt sources. It is also an example of why it is problematic to kick the watchdog from an ISR.
This is a reason why a sw design should avoid aliased accesses to variables, where two different pointers, or a pointer and a direct access may modify the same variable - the compiler may decide that it knows the contents of a variable even when modified.
void timers_v1(int *timer1, int *timer2, int *step) { *timer1 += *step; *timer2 += *step; }
this compile inefficiently (the value pointed by 'step' is loaded twice, between the increments) as the compiler cannot know in advance that 'step' and 'timer1' do not point to the same spot in memory.
but this works better:
void timers_v1(int *timer1, int *timer2, int *step) { int temp = *step ; *timer1 += temp ; *timer2 += temp ; }
Just an addendum about the use of an MMU.
Running separate tasks protected from each other is a great way of getting separation - one task can't overwrite the data of another task.
But this is not the same as catching a buffer overflow.
Guard pages can catch an out-of-bounds access. But the MMU normally works with pages that may be 4 or 8kB large.
An invalid access that is more than one page outside the array may skip the guard page an access other variables owned by the same task, without this access being caught. Only an explicit range test can catch this.
And an array that does not completely fill a number of memory pages will have an unprotected zone between the end of the array and the guard page. Testing of the code may conclude that no guard-page access happens, and fail to notice that the array had an off-by-one access (possibly a read of random data, possibly a write of data to a location that will not at a later time be copied to EEPROM for non-volatile storage).
In the end, a MMU is very valuable but should not be seen as a magic solution to catching problems. And on the Keil forum, most users don't even have a MMU to activate.
The first and most important line of defense is the developer - making sure every line of code is well designed, and running on a sound hardware design.
The second line of defense is defensive programming, where the code contains guard clauses to catch invalid states, out-of-range values, ...
A MMU would only form a third-line defense. When the MMU catches an error, then the problematic task has probably already done a lot of mischief.
In your case, it may even be required to use the second construct to make sure that both timers are always incremented by the same amount.
Aliasing is not a concept that should be forbidden, but something the developer must at all times design for and document.
The trivial case of a required aliasing is inserts/removes to a sorted array, and a reason for the existence of the memmove() function that handles overlapping memory regions.
But the above two situations are not problematic for the compiler. If your pointer was to a variable that could change during the function call, then you should have used the volatile keyword to force multiple pointer accesses. Without volatile, it will be up to the compiler to decide if it should read 'step once or twice.
The problematic and hurtful form of aliasing is when you have a global variable that you sometimes accesses directly, and sometimes using a pointer. A function that gets this pointer as a parameter will not know that writes to the pointer will change the value of the global variable, so it may decide to cache the contents of the global variable. And it will not know that a write to the global variable will change the value accessible through the pointer.
Note that your example isn't really aliasing.
You are making a copy of a value.
But the concept of variable aliasing is when you have more than one access path to a variable. And the problem comes when the compiler isn't aware of these multiple paths.