Chinese Version 中文版:了却芯片设计的恐慌
The modern SoC is a feat of engineering that continually squeezes greater performance from defined power and area constraints. However the arch nemesis of reliability is complexity.
“Debugging is twice as hard as writingthe code in the first place. Therefore,
if you write the code as cleverly as possible, you are, by definition, not smart
enough to debug it.”
— Brian W.Kernighan and P. J. Plauger in The Elements of Programming Style
As SoC complexity continues to grow exponentially, it is only wise to build in some advanced debug capability in to the SoC. We’re all familiar with the concept of “a stitch in time saves nine” and this is particularly relevant for debugging; the later you find a bug, the more tedious, time-consuming and expensive it becomes to resolve. Visibility is a precious resource to system designers, as it gives them an opportunity to spot bugs early, and make subtle changes that can alter and optimize an SoC’s performance. On-chip visibility acts as a screening process to identify any snags.
There are certain SoC bugs that tend to manifest themselves through either a data corruption or a system lock up which occurs only when a series of contributing factors align to cause the fault. Factors may be as diverse as manufacturing tolerances being exceeded, bit errors being introduced, complex real-world software exercising new unvalidated spaces, or race conditions between multiple out-of-order transactions.
So if your design does get hit by a rare, extremely difficult to reproduce and tricky to diagnose issue it’s critical you have some tools to deploy to help you get to the bottom of the problem as fast as possible. Almost by definition any bug found in silicon is not going to be found by a simple test case you can run on a simulator or emulator of parts of your design.
The complexity of multi-core processors and cache coherent interconnects mean much of what was previously visible through CoreSight Embedded Trace Macrocells (ETM), essentially the programmer’s view, is now hidden inside the IP blocks.
With this in mind, ARM has developed a new weapon to add to the CoreSight on-chip debug and trace armoury called the CoreSight™ ELA-500 Embedded Logic Analyzer in order to provide a more accurate diagnosis of system bugs. As the name suggests this is a logic analyzer-like IP block for embedding in to your SoC to monitor up to 12 groups of 128 signals, generate triggers from assertion-like conditions and with a small embedded SRAM to collect a recent trace history of selected signals.
So step one is to find out the state your system has got itself in to and what illegal or suspicious condition has occurred. The trace aspect of debug is similar to a detective using CCTV cameras when solving a crime. To help with this the ELA-500 contains a way to set up complex multi-state conditional triggers such as:
The ELA-500 provides a number of tools to discover any malicious conditions:
Figure1: Trigger set up in the ELA-500
Step 2 is to start looking at what happened around this suspect state or condition; which can be done by storing selected signal states to the ELA-500 dedicated SRAM, configurable between four and over one billion trace data entries, or by triggering another action outside the ELA-500.
Up to 8 programmable output actions that can be triggered for each trigger state, such as: Stop clocks, enter debug state, start/stop signal trace, trigger another logic analyser or ETM, or assert a CPU interrupt.
It is likely that from the information gleaned new trigger conditions will be set to see what other unexpected conditions or states are occurring, so repeating steps 1 and 2 to establish the chain of events leading to the error condition.
For really extreme cases even further visibility may be required around the trigger condition, not visible except through a scan chain dump. For this step 3 is to program a stop clock action on the ELA-500 and then use scan chain dump and information on the SoC’s scan chains to provide exact state or any and all registers within the SoC on a scan chain. The ELA-500 here provides the precision on which scan chain dumps to analyse, so less of this time-consuming exercise needs to be done.
The ELA-500 can monitor any signal you connect to its inputs. SoC designers will benefit from connecting up signals from ARM IP and proprietary or third party IP. A typical design might contain multiple ELA-500’s deployed to monitor signals in different domains of the SoC, as shown in figure 2, with one per main processor cluster, one for the Cache Coherent Interconnect and one for other signals selected by the SoC designer.
Figure 2: Example deployment of the ELA-500 in a system
Figure 2 shows the clock stop requests (in red) running the Clock Controller from each ELA and the connectivity (in black) of trigger in/out to the CoreSight Cross Trigger Interfaces (CTI) and the Cross Trigger Matrix (CTM). The debug APB bus is used to both set up trigger conditions and to read back the contents of the ELA’s SRAM, as controlled by the debugging tool, such as the ARM® DS-5™ debug tool.
For connection to ARM IP a Logic Analyzer IP Kit (LAK-500A) is provided with a pre-selected set of signals for that IP. The first of these is available for the recently released Cortex®-A72 processor to ensure the ELA-500 can sample signals at the maximum operating frequency of the Cortex-A72 without any impact on the operation of the processor.
The LAK-500A Logic Analyzer IP Kit includes the following:
The observation interface signals provide debug visibility of: each core-to-L2 interface, power-management interfaces, and the L2 memory system power-management interface. The core-to-L2 interface provides visibility of the physical addresses of L1 misses to the L2, and the following transaction details:
Future support is planned for new ARM Cortex-A and Mali™ processors as well as the CoreLink™ CCI Cache Coherent Interconnects, where transactions in flight and snoop traffic can be observed.
The CoreSight ELA-500 provides visibility into the states leading to lock-ups and data corruption. It provides visibility of CPU load, stores, speculative fetches, cache activity and transaction lifecycle; properties that are not visible with existing ETM trace of instructions. This offers a greater scope for finding corner-case bugs that could potentially spell disaster if discovered too late.
The ELA-500 can monitor error states and hazard conditions across the SoC, giving visibility to debug lock ups in designs without resorting to complex scan chain dump analysis, and cases with invalid accesses to device memory. The ELA can spot data corruptions early, whereas conventional timeouts occur too late and causation events are often lost/overwritten. I go into even more detail on some of the use cases for the CoreSight ELA-500 in a Video interview with silicon debug expert Mark LaVine
All this ensures you have the fastest debug route available should your SoC suffer a catastrophic failure found only when the silicon comes back and full software is running on the device.
A full specification of the CoreSight ELA-500 can be found on the ARM Infocenter
You can find more information on the CoreSight ELA-500 webpage