Chinese Version（中文版）：ARM 的系统验证：让合作伙伴能够构建更好的系统
(Huge thanks to the system validation team in Bangalore for providing me with all of the technical information here. Much appreciated!)
Functional validation is widely acknowledged as one of the primary bottlenecks in System-on-Chip (SoC) design. A significant portion of the engineering effort spent on productizing the SoC goes into validation. According to the Wilson Research Group, verification consumed more than 57% of a typical SoC project in 2014.
Source: Wilson Research Group
In spite of these efforts, functional failures are still a prevalent risk for first-time designs. Since the advent of multi-processor chips, including heterogeneous designs, the complexity of SoCs has increased considerably. As you can see in the diagram below, the number of IP components in a SoC is growing at a strong rate.
SoCs have evolved into complex entities that integrate several diverse units of intellectual property (IP). A modern SoC may include several components such as CPUs, GPU, interconnect, memory controller, System MMU, interrupt controller etc. The IPs themselves are complex units of design that are verified individually. Yet, despite rigorous IP-level verification, it is not possible to detect all bugs – especially those that are sensitized only when the IPs interact within a system. This article intends to give you some behind-the-scenes insight into the system validation work done at ARM to enable a wide range of applications for our IP.
Many SoC design teams attempt to solve the verification problem individually using a mix of homegrown and commercially available tools and methods. The goal of system validation at ARM is to provide partners with high quality IP that have been verified to interoperate correctly. This provides a standardized foundation upon which partners are able to build their own system validation SOC solutions. Starting from a strong position, their design and verification efforts can be directed more at the design differentiation they add to the SoC and its interactions with the rest of the system.
The verification flow at ARM is similar to what is widely practiced in the industry.
The ARM verification flow pyramid
Verification of designs starts early and at the granularity of units, which combine to form a stand-alone IP. During the entire verification cycle it is at unit-level when engineers have the greatest amount of visibility into the design. Individual signals that would otherwise be deep within the design may be probed or set to desired values to aid validation. Once unit-level verification has reached a degree of maturity, the units are combined to form a complete IP (e.g. a CPU). Only then can IP-level verification of the IP commence. For CPUs this is very often the first time assembly program level testing can begin. Most of the testing until this point is by toggling individual wires/signals. At IP level the tests are written in assembly language. The processor fetches instructions from memory (simulated), decodes them executes etc. Once top-level verification reaches some stability multiple IPs are combined into a system and the system validation effort begins.
IPs go through multiple milestones during their design-verification cycle that reflect their functional completeness and correctness. Of these, Alpha and Beta milestones are internal quality milestones. LAC (Limited Access) represents the milestone after which lead partners get access to the IP. This is followed by EAC (Early Access), which represents the point after which the IP is ready to be fabricated for obtaining engineering samples and testing. By the REL (Release) milestone the IP has gone through rigorous testing and is ready for mass production.
IPs are usually between Alpha and Beta quality before going through the system validation flow. By this phase of the design cycle the IPs have already been subjected to a significant amount of testing and most low-level bugs have already been found. Stimulus has to be carefully crafted so that the internal state of the micro-architecture of each IP is stressed to the utmost. The stimulus is provided by either assembly code or by using specially designed verification IPs integrated into the system. ARM uses a combination of both methods.
Many of these bugs could result in severe malfunctions in the end product if they were left undetected. Based on past experience ARM estimates these types of bugs to take between 1-2 peta cycles of verification to discover and 4-7 man months of debug effort. In many cases, a delay that large would prove fatal to a chip’s opportunity to hit its target window in the market. Catching them early enough in the design cycle is critical to ensure the foundations in the IP are stable, before they go on to being integrated as part of an SoC.
The nature of ARM’s IP means it is used in a diverse range of SoCs, from IoT devices to high end smartphones to enterprise class products. Ensuring that the technology does exactly what it is designed to do in a consistent and reproducible manner is the key goal of system validation, and the IP is robustly verified with that in mind. In other words, Focus of verification is IP but in a realistic system context. Towards this end, ARM tests IPs in a wide variety of realistic system configurations that are called Kits.
A kit is defined as a “group of IPs” integrated together in the form of a subsystem for a specific target application segment (e.g. Mobile, IoT, Networking etc.). It typically includes the complete range of IPs developed within ARM – CPUs, interconnect, memory controller, system controller, interrupt controller, debug logic, GPU and media processing components.
A kit is further broken down in to smaller components, called Elements. Elements can be considered building blocks for kits. It contains at least one major IP and white space logic around it, though some of the elements have several IP integrated in together.
These are designed to be representative of typical SoCs with different applications. One result is that it gives ARM a more complete picture of the challenges faced by the ecosystem of integrating various IP components together to achieve a target system performance.
The system validation team uses a combination of stimulus and test methodology to stress test kits. Stimulus is primarily software tests that are run on the CPUs in the system. The tests may be hand-created - either assembly or high-level language – or generated using Random Instruction Sequence - RIS tools, which will be explained in the upcoming sections. In addition to code running on CPUs, a set of Verification IPs (VIPs) are used to inject traffic into the system and to act as observers.
In preparation for validation, a test plan is created for every IP in the kit. Test planning captures various IP configurations, features to be verified, scenarios that will be covered, stimulus, interoperability consideration with IPs, verification metrics, tracking mechanisms , and various flows that will be a part of verification. Testing of kits starts with simple stimulus that is gradually ramped up to more complex stress cases and scenarios.
The testing performs various subsystem level assessments such as performance verification, functional verification, and power estimation. Reports documenting reference data, namely the performance, power, and functional quality, of selected kits are published internally. This document focuses on functional aspects only and more on Performance and Power related topics will be covered in future blogs.
The system validation team at ARM has established a repeatable and automated kit development flow, which allows us to build multiple kits for different segments. ARM currently builds and validates about 25 kits annually.
The mix of IPs, their internal configuration, and the topology of the system are chosen to reflect the wide range of end uses. The kits are tested on two primary platforms – emulation and FPGA. Typically testing starts on the emulator and subsequently soak testing is done on FPGA. On average every IP is subjected to 5-6 trillion emulator cycles and 2-3 peta FPGA cycles of system validation. In order to run this level of testing, ARM has developed some internal tools .
System Validation Tools
There are three primary tools used in System validation, which are focused on areas like Instruction pipeline, Ip level and system level memory system, system coherency, Interface level interoperability, etc. Two of these tools are Random Instruction Sequence (RIS) generators. RIS tools explore the architecture and micro-architecture design space in an automated fashion, attempting to trigger failures in the design. They are more effective at covering the space than hand written directed tests. These code generators generate tests to explore different areas of architecture and micro-architecture in an automated fashion. The tests are multi-threaded assembly code, comprised of random ARM and Thumb instructions, designed to thoroughly exercise the functioning of different portions of the implementation.
The third tool is a lightweight kernel that can be used as a platform to develop directed tests. The validation methodology uses a combination of directed testing and random instruction based automated testing. It supports basic memory management, thread scheduling, and a subset of the pthreads API, which allows users to develop parameterized directed tests.
In order to stress test IP at the system level a more random approach is used rather than a directed approach. This enables ARM to cover a range of scenarios, stimulate multiple timing conditions and create complex events. To this end, Kits support various verification-friendly features like changing the clock ratios at different interfaces, enabling error injectors, stubbing out components that are not required for a given feature verification etc. Bus clock ratios at various interfaces in the system like CPU, interconnect and dynamic memory controller can be changed to stimulate realistic system clocking conditions.
The diagram above shows how the system is initially brought up and how test complexity is gradually scaled up.
Integration Tests & KVS
Initial testing starts with a set of simple integration tests are run to confirm basic stability of the kit and flush out minor integration issues. Following which a suite of tests called Kit Validation Suite (KVS) is used to thoroughly test the integration of the kit. These tests are run early in the verification cycle to validate the Kit is good enough to run more stressful payloads. KVS can be configured to run on a wide variety of kits. It includes sub-suites to test integration, power, CoreSight debug and trace, and media IPs. There are specific tests in KVS to test integration of GPU and display as well as GPU coherence. Initial boot is usually done on simulation and gradually transition to emulators (hardware accelerators) for the integration testing.
RIS Boot and Bring up
After that we boot all the RIS tools with basic bring up tests on the kit to work through any hardware/software configuration issues.
RIS: Default and Focused Configurations
Once the kit is stable the complexity of tests and therefore the stress that they place on the system is increased. Random stimulus can cover the design space faster than directed stimulus and requires less effort towards stimulus creation. Therefore, for stress testing there is more reliance on random stimulus than directed tests. Initially default configurations of the RIS tools are run and after a suitable number of verification cycles, the tools are re-configured to stress the specific IPs in the kit.
In the final phase of system validation the kit is soak tested on FPGAs. Though emulators are more debug friendly, FPGAs are faster and can provide a lot more validation cycles. Therefore, once the IPs are stable and mature, ARM does soak test on FPGAs to find complex corner cases.
Metrics, Tracking, Coverage and Milestone Closure
The number of validation cycles run for every Kit is one of the metrics that is tracked to ensure the target number of validation cycles have been met. This is especially useful to ensure the soak-testing cycle target has been met, increasing the confidence of the quality of the IP in various applications. In addition to that we quantify and track coverage using a statistical coverage method to ensure the full design including potential corner cases have been exercised sufficiently.
The latest version of the ARM Juno test chip was subjected to a total validation run time of 6,130 hours, the equivalent of 8 and a half months of testing. This gives a unique perspective into corner cases within the system that makes ARM better able to support partners who are attempting to debug issues within their own design. Furthermore, the bugs that are found during the validation process are then fed back into the IP design teams who use the information to improve the quality of the IP at each release milestone, as well as guide next-generation products.
System complexity has increased in line with SoC performance capabilities, causing a significant growth in the amount of time and money spent on validation. ARM verifies its IP for interoperability before it is released to partners to make sure it is suitable for a wide range of applications. ARM’s IP teams are continuously designing at the leading edge, and are helped by the system validation team to ensure they work together in the systems our partners are building.
Frank Schirrmeister of Cadence Design Systems cites the validation of their tool interoperability as one benefit. “As an ARM ecosystem partner, Cadence relies on pre-verified ARM cores and subsystems that can be easily integrated into the designs that we use to validate our tool interoperability. ARM’s software-driven verification approach reflects the industry’s shift toward the portable stimulus specification and allows us to validate the integration and interoperability of ARM cores and subsystems on all Cadence System Development Suite engines, including simulation, emulation and FPGA-based prototyping engines.”
Due to the wide variety of applications that the ARM partnership designs for, it is necessary to ensure our IP is functional in many different systems. The multi-stage approach to system validation at ARM gives our partners the peace of mind that they can rely on our IP. Over time the validation methodology has evolved into one that tests several system components and stresses most IPs in the system. In the future we have plans to extend and further improve our test methods to ensure an even higher standard of excellence across ARM IP.
There is now a full white paper on this subject which goes into a lot more detail, available here - System Validation at ARM: Enabling our Partners to Build Better Systems