In a lot of ways debug is similar to being a medical doctor. A patient comes in with some complaints and lists their symptoms, but you need to run tests in order to properly diagnose the issue before focusing the mind on how to fix it. A lot has been written and discussed in the past about debugging hardware, but most of the attention is dedicated to the pre-silicon stage when issues can be identified close to the source and rectified before it is too late. These bugs are similar to performing an autopsy on a body, sifting through all of the potential clues to narrow down what has gone wrong, and how it can be rectified. Bugs that are found in the silicon itself are typically much more difficult to identify, and can drain an enormous amount of time and resources to fix properly. Today I will speak about silicon debug, the challenges associated with it and what can be done to improve it.
Silicon autopsies require care and preparation to find out the true bug diagnosis
In the past, it was possible to use logic analyzers to gain visibility to interfaces when CPUs, buses, GPUs, memory controllers, etc. were separate components.
Tracing of interface signaling could be used to determine which component was not being responsive.
Data corruption could be traced and isolated to a component. With a logic analyzer you could see a cycle trace to the bus for example to see why the circuit is hanging. Through this you could isolate the problem down to a single bus. Narrowing things down is obviously critical to figuring out the real cause of the problem and how to fix it.
Contrast that to today’s silicon products which are highly integrated SOCs that have very little visibility. Two of the typical debug techniques are unsuitable here for the following reasons:
It doesn’t leave you with many options, indeed the only debug visibility that some users are able to provide is single chain scan. Single chain scan is accessed through the JTAG IEEE 1149 interface and provides full visibility of the signals in the chain at a single clock snapshot of flip flop state, making it extremely useful for debugging lock ups. However there are issues with single chain scan. It is not always being implemented or is not functioning when it is, there can be signal name to flip flop mapping issues or signal name polarity issues.
Another useful method is by using the ARM DS-5™ Development Studio. DS-5/DSTREAM is a powerful tool for debugging silicon failures, but there are some cases that require more debug observability. In short, there is a growing need from SoC designers for more visibility into what’s happening on-chip. than code trace, breakpoints, watchpoints, single step, etc.
Two areas that are especially challenging to debug are lockups and data corruption.
Lock ups are cases where the CPU(s) will not halt, making it impossible to determine the PC, registers, and there is no code trace available. Lock ups could be caused by software (access to powered down device) or hardware problems.
Data corruption simply means incorrect or invalid data (although not related to ECC failures). Normally you can spot the corruption with a print statement or debugger, but the source of corruption usually occurs much earlier in time than the detection. Often the use of breakpoints, single step and watchpoints may be intrusive to the failure. Some examples of this are a FIFO overrun, or a data path problem where we just don’t get the right data. The fact that we pick up on the problem much later than the source of corruption makes things quite challenging.
To summarize, far more visibility is needed to address the problems associated with silicon debug.
A current way of thinking to address this issue is to place the logic analyzer capability on the silicon. For this method to work it needs to fulfil a number of pre-requisites:
ARM’s solution to improve debug visibility is the CoreSight™ ELA-500 Embedded Logic Analyzer, which is a CoreSight component that can be connected to ARM IP and other IP blocks. It is programmable over debug APB via debugger or CPU for trigger condition setup and can:
Generate trigger from one of up to 12x128-bit signal groups via assertion styled conditions.
Trigger conditions are built by using trigger state transitions, event counting, comparators for criteria evaluation, and signal masking.
Trace capture selected signal group in embedded SRAM (configurable size) for later analysis and/or waveform capture over time. It also supports trace filtering.
Trigger amongst other ELAs and SoC components over CoreSight cross trigger interface matrix.
You can find out more about the CoreSight ELA-500 in my colleague williamorme's blog Taking the fear out of silicon debug. In it he explains how the ELA-500 connects to the Cortex®-A72 processor to increase the amount of visibility on-chip.
Integrating the CoreSight ELA-500 in the IP has several benefits:
When it comes to implementation by the IP team, they reap the benefit of answering all the placement questions earlier to understand timing paths and routing congestion, such as ‘Can we connect more debug signals?’ and ‘Is the ELA too large?’. There are also valuable discussions regarding specification and physical IP when integrating, what tools and compilers work best to suit the requirements.
SoC designers around the world and across many segments all want improved hardware debug capability. It’s an area that grows in the amount of money and time spent on minimizing the risk of bugs in an SoC.
It is a great benefit to be able to debug silicon issues. In rare cases silicon failures can be caused by IP bugs, so understanding the cause is important. Anything that can move you closer to root cause is a huge benefit. The way to better understanding of debug and new innovations comes one step at a time, so any new information can be most valuable.
While ARM products such as DS-5 and the CoreSight ELA-500 make it a lot easier to identify and remove bugs in silicon, it is becoming necessary to include sufficient hardware debug capability as part of the product plan/requirement. Adding debug support requires more effort if the project has already started. To use one example, the effort to create and document a port puncher script internally can be much greater than adding debug ports to the IP RTL (two months versus one week).
Finally, as well as setting aside resources for debug, designers should also plan a debug strategy for visibility that takes into account:
The understanding of the complexities of the human body increased dramatically when pioneers such as Leonardo da Vinci started to perform autopsies. Thankfully in this day and age, silicon autopsies are legal and indeed encouraged by the chip design community. In the rare case that silicon failure does happen, having the capability to take a deep look at the root cause of the issue is invaluable to preventing that type of problem from happening again.
Further Information
CoreSight Debug and Trace - ARM
CoreSight ELA-500 - ARM
CoreSight on-chip debug and trace (Infocenter)
How to debug: CoreSight basics (Part 1)