Functional safety for Silicon IP used to be a niche activity, limited to an elite circle of chip and system developers in automotive, industrial, aerospace and similar markets. However over the last few years that’s changed significantly. There’s now a more tangible vision towards self-driving cars with increasingly adventurous Advanced Driver Assistance Systems (ADAS) to capture people’s interest along with media-rich in-vehicle infotainment. Moreover the emergence of drones in all shapes and sizes and the growing ubiquity of industrial Internet of Things are also proliferating the requirement for functional safety, all of which are relevant to ARM®.
Much like any technology market surrounded in ‘buzz’ these burgeoning applications require semiconductors to make them happen and the fast-pace of product innovation has attracted huge interest from ARM’s partners. In the IP community ARM leads the way with a broad portfolio of IP from ARM Cortex®-M0+ to the mighty Cortex-A72 and beyond. With a heritage in secure compute platforms and functional safety ARM is well placed to enable the success of its silicon partners.
In a nut-shell, functional safety is what the name says, it’s about ensuring that products operate safely and continue to do so even when they go wrong. ISO 26262 the standard for automotive electronics defines functional safety as:
ISO 26262 “the absence of unreasonable risk due to hazards caused by malfunctioning behaviour of electrical / electronics systems”.
Standards for other markets such as IEC 61508 for electrical and electronic systems and DO-254 for airborne electronic hardware have their own definitions, although more importantly they also set their own expectations for engineering developments. Hence it’s important to identify the target markets before starting development and ensure suitable processes are followed – attempts to ‘retrofit’ development processes can be costly and ineffective so best avoided. Figure 1 illustrates a variety of standards applicable to Silicon IP.
Standards for functional safety of silicon IP
In practice, functionally safe means a system that is demonstrably safe to a skilled third-party assessor, behaving predictably in the event of a fault. It must fail safe which could be with full functionality or graceful degradation such as reduced functionality or a clean shutdown followed by a reset and restart. It's important to realize that not all faults will lead to hazardous events immediately. For example a fault in a car's power steering might lead to incorrect sudden steering action. However, since the electronic and mechanical designs will have natural timing delays, faults can often be tolerated for a specific amount of time. In the ISO 26262 this time is known as the fault tolerant time interval, and depends on the potential hazardous event and the system design.
Failures can be systematic, such as due to human error in specifications and design, or due to the tools used. One way to reduce these errors is to have rigorous quality processes that include a range of plans, reviews and measured assessments. Being able to manage and track requirements is also important as is good planning and qualification of the tools to be used. ARM provides ARM Compiler 5 certified by TÜV SÜD to enable safety-related development without further compiler qualification.
Another class of failure is random hardware faults; they could be permanent faults such as a short or broken via as illustrated by Figure 2. Alternatively they could be soft errors caused by exposure to natural radiation. Such faults can be detected by counter measures designed into the hardware and software, system-level approaches are also important. For example Logic Built-In-Self-Test can be applied at startup or shutdown in order to distinguish between soft and permanent faults. Error logging and reporting is also an essential part of any functionally safe system, although it’s important to remember that faults can occur in the safety infrastructure too.
Figure 2. Classes of fault
Selection of counter measures is part of the process I enjoy the most, it relates strongly to my background as a platform and system architect, and often starts with a concept-level Failure Modes and Effects Analysis (FMEA). Available counter measures include diverse checkers, selective hardware and software redundancy, as well as full lock-step replication available for Cortex-R5 and the ‘old chestnut’ of error correcting codes which we use to protect the memories of many ARM products.
Faults that build up over time without effect are called latent faults and ISO 26262 proposes that a system designated ASIL D, its highest Automotive Safety Integrity Level, should be able to detect at least 90% of all latent faults. As identified by Table 2, it also proposes a target of 99% diagnostic coverage of all single point failures and a probabilistic metric for random hardware failures of ≤10-8 per hour.
Table 1. ISO 26262 proposed metrics
These metrics are often seen as a normative requirement, although in practice they are a proposal, and developers can justify their own target metrics because the objective is to enable safe products, not add bullet points to a product datasheet.
A question I often ask myself in respect of semi-autonomous driving is whether it’s safer to meet the standard’s proposed metrics for ASIL D with 10,000 DMIPS of processing or have 100,000 DMIPS with reduced diagnostic coverage and enable ‘smarter’ algorithms with better judgement? The answer is application specific, although in many cases a more capable performant system could save more lives than a more resilient system with basic functionality, so long as its failure modes are not wildly non-deterministic.
Irrespective of the diagnostic coverage achieved, it’s essential to follow suitable processes when targeting functionally safe applications – and this is where the standards really help. Even if you’re not targeting safety, more rigorous processes can improve overall quality.
When developing for functional safety, an essential part of the product is the supporting documentation which needs to include a safety manual to outline the product’s safety case, covering aspects such as the assumptions of use, explanation of its fault detection and control capabilities and the development process followed.
Safety cases are hierarchical in use, the case for an IP is needed by chip developers to form part of their safety case which then enables their customer and so forth. Most licensable silicon IP will be developed as a Safety Element out of Context (SEooC), where its designers will have little no idea how it will subsequently be utilised. Hence the safety manual must also capture insight from the IP developers about their expectations in order to avoid inappropriate use.
At ARM we support users of targeted IP with safety documentation packages, which always includes a safety manual.
So in summary when planning for functional safety think PDS: