This blog post was authored by Emre Ozer, Principal Research Engineer at Arm Research
Lockstepping is an error detection technique that executes copies of the same program on redundant hardware, and checks the outputs of the redundant hardware for divergence. The CPU-level lockstepping replicates only CPUs, i.e., all levels of caches are shared and protected by ECCs, and a lockstep checker detects any divergence by comparing the CPU output ports at every cycle. CPU-level lockstepping has been adopted by the safety-critical industry, in particular in the automotive market such as TI Hercules TMS570, Infineon AURIX, NXP Qorivva MPC5643 and STM SPC56EL70L3, for its favorable properties including high error coverage, software transparency and relatively low overheads (i.e., only CPUs are replicated). At the recent Micro-51 conference, we presented a novel technique, error correlation prediction for lockstep processors in safety-critical systems.
Functional safety is an important concept in safety-critical systems, and is defined as the absence of unacceptable risk due to hazards caused by malfunctioning electronic systems. The end goal of functional safety is to prevent death or injury due to a failure in electronic systems (e.g., automobiles, trains, aircraft). Safety-critical systems must provide functional safety to deal with both systematic (i.e. design and manufacturing) and random faults (i.e. transient and permanent). A safety integrity level is assigned to an electronic system to measure the level of risk for a safety measure used by the system. In automotive functional safety, there are four levels of automotive safety integrity levels (ASILs) in the functional safety standard for automotive ISO26262. Of these, ASIL-A is the least, and ASIL-D is the most stringent. For example, chassis, powertrain, power steering and anti-lock braking systems (ABS) need to provide ASIL-D capability, while body systems often provide ASIL-A/B. CPU-level lockstepping is mainly used in ASIL-D-capable electronics control units (ECUs) because it is capable of achieving the most stringent safety level.
A safety-critical system must guarantee that the system reaches a safe state to prevent hazards (e.g. fatal crashes) upon detecting a divergence. When a fault occurs and manifests as an error, the lockstep error checker must detect and handle it within a strict deadline. Functional safety standards provide guidelines in system design to handle random errors (both soft and hard) in order to meet certain safety integrity levels, regardless of how rare these errors are. The time interval from fault occurrence to error detection is called the error detection time. The interval from the time at which the lockstep error is detected to the safe state is called the lockstep error reaction time, and missing this deadline can be fatal.
Lockstep error reaction time is statically provisioned for the worst-case error handling scenario in safety-critical systems targeting the most stringent safety level, and must not be violated. Thus, the detected lockstep error is always assumed to be 'hard', even though soft errors are more common than hard errors. The safety-critical system runs the online diagnostics using logic or software built-in-self-test (BIST) to find any hard error, which is an invasive procedure that renders the system unavailable, and can take from microseconds to milliseconds. For example, the software BIST (SBIST) is a software-based online error detection technique that uses special software test libraries (STLs) written in the instruction sets of the CPU to generate test patterns, and runs them to test the CPUs in order to detect stuck-at-faults. Normally, an STL is created for each unit within the CPU to ease development time. SBIST is initiated to run the software test libraries of each CPU unit in a static or random order to find any hard error if the system uses SBIST as online diagnostics. If a hard error is found in a particular CPU unit in SBIST, the diagnostics will stop. Otherwise, the SBIST continues. If the online diagnostics does not find an error, then the system decides that the error may be a soft one, and recovers from it (e.g., resets the lockstep CPUs and restarts the real-time task).
Any reduction in the provisioned lockstep error reaction time at run time is safe, and increases the availability of the system. However, the lockstep error checker is not aware of the error type, and where in the CPU the fault occurred to cause this error. If the error's likely location(s) within the CPUs from which the fault may have originated are known, the online diagnostics process can be performed more efficiently, e.g., by starting the diagnostics from these likely locations to speed up the process. If the error type (i.e., soft or hard) is known at the time of detection, an appropriate action can be taken. For example, online diagnostics can be avoided, and the lockstep processor can recover from the error if the error is likely to be soft.
We discovered that both error type and the error's likely location(s) within the CPUs can be predictable in a lockstep processor by capturing and analyzing the outputs of the CPUs at the time the error is detected. We call this phenomenon lockstep error correlation prediction. The main motivation of lockstep error correlation prediction is to reduce lockstep error reaction time at run time in order to increase the availability of the safety-critical system.
To demonstrate the phenomenon, we performed a rigorous fault injection study to inject faults into a dual-CPU Arm Cortex-R5 lockstep processor netlist running the EEMBC AutoBench suite. We injected a total of 10 million soft and hard faults in all flip- flops in the Cortex-R5 CPU across all benchmarks, ensuring that every flip-flop experienced many soft and hard faults. A fault was injected into one CPU during the simulation, and the lockstep error checker logic detected the divergence. We observed that faults injected into certain CPU units have distinct CPU output port signal signatures captured at the time of a divergence. These distinct error signal signatures can be used to predict the potential locations of faults. Similarly, the type of a detected error (i.e., soft or hard) can also be predicted from the output port divergence signature. This is because a hard fault can spread to more CPU output ports than a soft fault. This allows us to predict whether the lockstep error is caused by a soft or hard fault.
Figure 1 Lockstep error correlation predictor
We built a simple static predictor to exploit this phenomenon, as shown in Figure 1. The predictor is tightly coupled to the lockstep error checker that compares each signal coming from the lockstepped CPUs. It consists of an address mapping hardware that generates an address from the output port divergence information which is captured when the lockstep error is detected, and a static prediction table in memory is accessed with this address. Each entry in the table keeps a predicted order of CPU units for error location prediction, and a single bit predicting whether the error is soft or hard. The prediction table is populated with the prediction information collected in the offline analysis stage, and therefore the error correlation prediction is static in nature (i.e., the prediction table contents do not change at run time). Soft or hard errors are rare events, and predicting them statically makes the prediction hardware simpler. Thus, the prediction table can be kept in an off-chip memory (e.g., DRAM), and accessed by software (e.g. an exception handler).
When the lockstep error checker detects the error, all CPUs are stalled, and a lockstep error handler software is invoked to read the associated predictor table entry. The handler reads the predicted CPU location(s) and error type information, and takes the appropriate actions based on the prediction information. For example, the SBIST process is started if the error type is predicted to be a hard error. SBIST starts testing the CPU units in the predicted order sorted from most likely to least likely, rather than a predetermined CPU, which will speed up the SBIST process. If the SBIST does not find a hard error, the error must have been soft. In this case, a soft error handling process is started by resetting the CPUs and restarting the application. If, on the other hand, the error type is predicted to be a soft error, the time-consuming SBIST process will be skipped, and the soft error handling process will be initiated immediately.
Both error type and location prediction can improve the lockstep error reaction time, and thereby availability, by either avoiding unnecessary online diagnostics through the correct prediction of soft errors, or by finding the faulty CPU unit faster by performing the diagnostics in the predicted order of the units, in the case of hard errors. The lockstep error correlation predictor increases system availability by up to 65%. The off-chip static prediction table size is 2-5KB depending the number of predicted CPU units and CPU unit granularity. The overhead of the predictor is less than 2% in terms of area and total power with respect to the dual-CPU Arm Cortex-R5 lockstep processor.
The lockstep error correlation predictor hardware, though implemented in a dual-CPU lockstep processor, can scale to multi-CPU lockstep processors with no additional hardware cost, because increasing the number of lockstepped CPUs increases the hardware complexity of the lockstep error checker, but not the predictor.
Although we evaluated lockstep error correlation prediction in a specific CPU architecture (Arm) and CPU (Cortex-R5), the concept does not rely on the specifics of the Arm ISA, or its microarchitectural implementation, and therefore is applicable to lockstep processors based on different architectures and their implementations.
This work only scratches the surface of this as-yet unexplored territory. We have proposed a simple static predictor correlating an observed state (e.g. a lockstep error) to a hidden state (e.g. a fault in an instruction decoder). We have not yet explored the use of advanced machine learning techniques to improve the prediction accuracy. We believe that these techniques will pave the way to build more efficient software-hardware co-design predictors.
Lockstep error correlation prediction can also be made dynamically where prediction table entries can be updated with error prediction history, similar to branch prediction. This implies that the prediction table must be kept in hardware, and collect error history. However, errors are not frequent events like branches, so the accumulation of error history will take a longer time as compared to the branch history, and predicting lockstep error correlation dynamically may not be any more beneficial than static prediction. However, dynamic prediction may be required if the rate of errors in semiconductors increases to a level approaching to the frequency of other events in processors (e.g., branches).
The importance of functional safety is increasing in semi-autonomous vehicles (e.g. ADAS) in the near-term, and it will be even more prominent in driverless vehicles in the longer term. ECUs in future autonomous vehicles will accommodate more redundant hardware, in particular, lockstep processors to increase the safety of passengers. Autonomous vehicles will be required to have more accountable ECUs than today's ECUs - all errors have to be logged, and how, why and where they occur will matter more. Logging the potential causes and locations of all errors will allow offline inspection of the autonomous vehicles that may be required by legal entities and insurance companies.
Our discovery of this error correlation prediction phenomenon in lockstep processors will have another important impact on the ECUs of autonomous vehicles in addition to increasing their availability. Today, online diagnostics processes can identify the specific location of a hard error in lockstep processors, but this is not possible for soft errors because a soft error will disappear during the online diagnostics process. The error correlation prediction phenomenon allows prediction of a soft error and its probable locations within a CPU. This property of the phenomenon will make ECUs in autonomous vehicles accountable by logging the likely origins of both hard and soft errors.
Read the full paper