Central Processing Unit (CPU) microarchitects invest significant engineering efforts in delivering improvements in Instructions Per Cycle (IPC) and maximum operating frequency (FMAX) in successive CPU product generations. However, squeezing out the extra IPC is increasingly proving to be a tough ask.
High-performance techniques that have previously served us well, such as more aggressive speculation, further parallelism and specialized execution units add to on-chip transistor count. This was not an issue in the past. Previously, making transistors smaller allowed more of them to be packed within the same die area (Moore’s law scaling) while reduction of supply voltage (Dennard Scaling) made transistor operation more efficient. Today, process-technology innovation through 3D integration, wafer-scale computing and novel transistor technologies still delivers ever-greater transistor integration. Unfortunately, voltage-scaling has effectively stalled, meaning much reduced efficiency gains with scaling alone.
Packing in more transistors to unlock more architectural features unfortunately lead to greater switching activities and so greater current demands. This impacts CPU power-consumption, which has effectively trended upwards in recent years. To make matters worse, power-delivery resources have not kept pace. For example, interconnects are particularly resistive at 5nm and below. Similarly, package technology has progressed at a slower rate and is unable to sustain the di/dt demands of modern CPUs. A combination of these factors contributes to increased power sensitivity in product roadmaps, driving the need for greater CPU power introspection, both during design and at runtime.
Existing techniques for power-introspection at design-time rely on EDA tools that are extremely accurate. This makes them the approach of choice where signoff-grade power-benchmarking is required. However, this is expensive and typically consumes significant computational resources over a period of days (if not weeks) when analyzing complex, high-performance CPU cores. The impact on runtimes when analyzing power up complete systems is even more significant. Alternative approaches use activity-based power-models that rely on counting specific micro-architectural events, which are often referred to as power-proxies in the context of power-modelling. These include cache-activity, rate of instruction retirement and rate of activation of specific functional units. These approaches can have high accuracy levels when coarse-grained resolutions are required (order of milliseconds), and typically have poor fidelity when power is measured over finer windows (or fine-grained temporal resolution). This impact on accuracy makes these approaches unsuitable for runtime di/dt management where mitigation against aggressive voltage-droops needs to occur within 10 or fewer CPU cycles. Thus, the existing techniques for power-introspection offer a stark contrast between accuracy, computational-speed, and temporal-resolution of power-tracing.
At Arm Research, we are working on a technique rooted in machine learning and data science approaches, called APOLLO. APOLLO achieves fast, yet accurate power-modelling for both design- and runtime power-introspection within the same unified framework (figure 1). Algorithmically, APOLLO uses a new power proxy selection technique based on minimax concave penalty (MCP) regression. It selects a small subset (<0.05%) of RTL signals to estimate CPU power-consumption, achieving high accuracy (~90%) with a per-cycle temporal granularity. The APOLLO model can also be synthesized into a low-cost on-chip power meter (OPM) which has a sub-1% area overhead due to the small number of RTL signals monitored as power-proxies. APOLLO has been prototyped on the Neoverse N1 CPU and will be evaluated further to fully explore its future potential.
Figure 1: APOLLO unifies a design-time power-model with a run-time OPM within the same framework. After initial prototyping on the Neoverse N1, APOLLO is being evaluated further to explore its future potential.
Techniques that rely on data-science approaches are sensitive to training-data generation. They require a diverse set of training-data with adequate coverage of functional units. This is typically hard to achieve when you consider the underlying complexity of the CPU micro-architecture. With APOLLO, we circumvent these practical engineering challenges by autogenerating the training set of micro-benchmarks using a genetic algorithm (GA)-based framework [1] that is micro-architecture agnostic. The GA-based optimization loop is primed to generate the worst-case power-consumption benchmark, or a power-virus, as indicated by the envelope of the plot shown in figure 2. A combination of low and high power-consuming benchmarks across generations naturally creates a rich diversity of benchmarks, spanning a large power range.
Figure 2: The plot is a distribution of power consumption of automatically generated micro-benchmarks using [1]. Early generations show a preponderance of lower power benchmarks while later generations show higher-power.
APOLLO’s power-estimation accuracy is shown in figure 3, tested on Arm power-indicative workloads. Figure 3 shows the per-cycle prediction from the APOLLO model with 159 power proxies, where we obtain excellent agreement with the ground-truth envelope (shown in pink).
Figure 3: The APOLLO power model with 159 proxies shows very good agreement with power-indicative workloads. Labels indicate the ground-truth power measured for the same workloads using EDA tools. The prediction is overlaid on top showing a MAE of ~7%.
We measure the prediction accuracy in mean absolute error (MAE), root-mean squared error (RMSE), and R-squared correlation. The MAE is less than 10 percent for all workloads, and the error is largely due to the small variations in cycle-by-cycle estimations. The x axis is the cycle index, and the y-axis is the scaled power value.
The APOLLO approach and the key results have recently been published at MICRO where the paper received the Best Paper Award.
We are continuing to evaluate the future potential of APOLLO and envision several possible applications. A key area of interest is di/dt mitigation. We believe the fast, yet accurate modeling capability of APOLLO could enable detailed di/dt evaluation on multiple workloads. Further, current estimation can also aid di/dt mitigation leading to profound system efficiency improvements. APOLLO is micro-architecture agnostic and can extend from beyond the CPU into other components of the SoC. We believe that future techniques for SoC power and thermal management will be underpinned by APOLLO-based power-estimators. Please watch this space as we conduct further research into these and related areas.
Read the Paper Contact Sid
[1] Zacharias Hadjilambrou, Shidhartha Das, Paul N Whatmough, David Bull, and Yiannakis Sazeides, “GeST: An automatic framework for generating CPU stress-tests” - In ISPASS 2019.