Silicon autopsy: Understanding when chips fail

July 14, 2015

7 minute read time.

In a lot of ways debug is similar to being a medical doctor. A patient comes in with some complaints and lists their symptoms, but you need to run tests in order to properly diagnose the issue before focusing the mind on how to fix it. A lot has been written and discussed in the past about debugging hardware, but most of the attention is dedicated to the pre-silicon stage when issues can be identified close to the source and rectified before it is too late. These bugs are similar to performing an autopsy on a body, sifting through all of the potential clues to narrow down what has gone wrong, and how it can be rectified. Bugs that are found in the silicon itself are typically much more difficult to identify, and can drain an enormous amount of time and resources to fix properly. Today I will speak about silicon debug, the challenges associated with it and what can be done to improve it.

Silicon autopsies require care and preparation to find out the true bug diagnosis

Silicon debug challenges

In the past, it was possible to use logic analyzers to gain visibility to interfaces when CPUs, buses, GPUs, memory controllers, etc. were separate components.

Tracing of interface signaling could be used to determine which component was not being responsive.

Data corruption could be traced and isolated to a component. With a logic analyzer you could see a cycle trace to the bus for example to see why the circuit is hanging. Through this you could isolate the problem down to a single bus. Narrowing things down is obviously critical to figuring out the real cause of the problem and how to fix it.

Contrast that to today’s silicon products which are highly integrated SOCs that have very little visibility. Two of the typical debug techniques are unsuitable here for the following reasons:

Using IO pins for additional visibility is cost prohibitive in terms of both die area and package costs.
External buses such as DDR, PCIe are running at very high frequency and require very expensive bus analyzers with probes that may be intrusive or are required to be soldered to the board

It doesn’t leave you with many options, indeed the only debug visibility that some users are able to provide is single chain scan. Single chain scan is accessed through the JTAG IEEE 1149 interface and provides full visibility of the signals in the chain at a single clock snapshot of flip flop state, making it extremely useful for debugging lock ups. However there are issues with single chain scan. It is not always being implemented or is not functioning when it is, there can be signal name to flip flop mapping issues or signal name polarity issues.

Another useful method is by using the ARM DS-5™ Development Studio. DS-5/DSTREAM is a powerful tool for debugging silicon failures, but there are some cases that require more debug observability. In short, there is a growing need from SoC designers for more visibility into what’s happening on-chip. than code trace, breakpoints, watchpoints, single step, etc.

Two areas that are especially challenging to debug are lockups and data corruption.

Lock ups are cases where the CPU(s) will not halt, making it impossible to determine the PC, registers, and there is no code trace available. Lock ups could be caused by software (access to powered down device) or hardware problems.

Data corruption simply means incorrect or invalid data (although not related to ECC failures). Normally you can spot the corruption with a print statement or debugger, but the source of corruption usually occurs much earlier in time than the detection. Often the use of breakpoints, single step and watchpoints may be intrusive to the failure. Some examples of this are a FIFO overrun, or a data path problem where we just don’t get the right data. The fact that we pick up on the problem much later than the source of corruption makes things quite challenging.

To summarize, far more visibility is needed to address the problems associated with silicon debug.

What can be done to improve debug visibility?

A current way of thinking to address this issue is to place the logic analyzer capability on the silicon. For this method to work it needs to fulfil a number of pre-requisites:

Needs to operate with other ARM CoreSight debug components
Must be small enough to not cause IP area growth
Cannot measurably affect battery life
Must be able to operate over a large frequency range
Needs to be supported with DS-5 software
The definition and support of the connection of debug signals

ARM’s solution to improve debug visibility is the CoreSight™ ELA-500 Embedded Logic Analyzer, which is a CoreSight component that can be connected to ARM IP and other IP blocks. It is programmable over debug APB via debugger or CPU for trigger condition setup and can:

Generate trigger from one of up to 12x128-bit signal groups via assertion styled conditions.

Trigger conditions are built by using trigger state transitions, event counting, comparators for criteria evaluation, and signal masking.

Trace capture selected signal group in embedded SRAM (configurable size) for later analysis and/or waveform capture over time. It also supports trace filtering.

Trigger amongst other ELAs and SoC components over CoreSight cross trigger interface matrix.

You can find out more about the CoreSight ELA-500 in my colleague williamorme's blog Taking the fear out of silicon debug. In it he explains how the ELA-500 connects to the Cortex®-A72 processor to increase the amount of visibility on-chip.

Integration improvements

Integrating the CoreSight ELA-500 in the IP has several benefits:

Adding the debug signal ports to the IP RTL is fast
- Port extractor and LEC scripts are more labor intensive
- The logical and physical locations of the ELA in the IP are the same

When it comes to implementation by the IP team, they reap the benefit of answering all the placement questions earlier to understand timing paths and routing congestion, such as ‘Can we connect more debug signals?’ and ‘Is the ELA too large?’. There are also valuable discussions regarding specification and physical IP when integrating, what tools and compilers work best to suit the requirements.

Recommendations for SoC Design Teams

SoC designers around the world and across many segments all want improved hardware debug capability. It’s an area that grows in the amount of money and time spent on minimizing the risk of bugs in an SoC.

It is a great benefit to be able to debug silicon issues. In rare cases silicon failures can be caused by IP bugs, so understanding the cause is important. Anything that can move you closer to root cause is a huge benefit. The way to better understanding of debug and new innovations comes one step at a time, so any new information can be most valuable.

Prior Planning and Preparation Prevents Poor Performance

While ARM products such as DS-5 and the CoreSight ELA-500 make it a lot easier to identify and remove bugs in silicon, it is becoming necessary to include sufficient hardware debug capability as part of the product plan/requirement. Adding debug support requires more effort if the project has already started. To use one example, the effort to create and document a port puncher script internally can be much greater than adding debug ports to the IP RTL (two months versus one week).

Finally, as well as setting aside resources for debug, designers should also plan a debug strategy for visibility that takes into account:

Isolation of failures
Considers partner usage
Consider visibility for complex, problematic logic that may have had several bugs found by verification
Review past errata

The understanding of the complexities of the human body increased dramatically when pioneers such as Leonardo da Vinci started to perform autopsies. Thankfully in this day and age, silicon autopsies are legal and indeed encouraged by the chip design community. In the rare case that silicon failure does happen, having the capability to take a deep look at the root cause of the issue is invaluable to preventing that type of problem from happening again.

Further Information

CoreSight Debug and Trace - ARM

CoreSight ELA-500 - ARM

CoreSight on-chip debug and trace (Infocenter)

How to debug: CoreSight basics (Part 1)

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Silicon autopsy: Understanding when chips fail

Silicon debug challenges

What can be done to improve debug visibility?

Integration improvements

Recommendations for SoC Design Teams

Prior Planning and Preparation Prevents Poor Performance

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC