Combining large and small compute engines - ARM Cortex-A7

September 11, 2013

9 minute read time.

Today the ARM Cortex-A7 processor was announced...the power of big.Little processing is finally realized!

I drive a Honda Fit, mainly for the fuel efficiency, on a 20 mile city street commute. Sometimes I wish my car had a faster engine, but most of the time I'm happy to drive for high gas mileage. But I have to say I was a reluctant convert to economy cars; I often find myself longing for the performance of a Porsche or BMW, but I only really want that performance a small percentage of the time I'm driving. Wouldn't it be great if it were possible to drive a car with the average efficiency of a 4-cylinder engine, a car that could switch to a high performance of a turbocharged V8 engine for the small percentage of time you actually wanted peak performance? What if the average fuel economy was closer to the 4-cylinder and the peak performance was closer to that of the turbo V8?

In today's mobile world we are developing similarly schizophrenic use cases, especially in the ultra portable mobile world, smartphones and tablets in particular. The types of applications that these phones run or envision running - augmented reality, content creation, a device that drives much larger screens - require increasingly more of a high performance mentality in the processing capability, within the all-important mobile thermal and battery limit. These devices are also always on and constantly connected with Twitter feeds, Facebook and push email updates, and require constant low-intensity performance as well. Finally as these devices are becoming our mainstays as communication, consumption and computation platforms, we want the battery life to increase to support our ever-active workdays.

In automotive design, the weight of a second engine would make a combination of a big and little engine impractical even though the urge to drive a performance monster with the cost-effectiveness of an economy car is still certainly there.

In the mobile world we already use this in concept: a phone chip has a CPU, a graphics processor, video engine, audio engine and more on the same die, each tuned to provide the maximum performance and functionality per unit of energy consumed. So adding a second CPU economy engine to the "V8 engine" CPU is a question of ROI on silicon real-estate. Adding ARM's latest ARM Cortex-A7 CPU, our smallest and most energy-efficient application processor to date, to the high-end ARM Cortex-A15 CPU in order to achieve the performance economy of a dream machine makes perfect sense. We call this concept big.LITTLE processing: using a small, extremely energy-efficient ARM CPU in tightly synchronized combination with a fully compatible high-performance ARM CPU. The Cortex-A7 processor has been designed to be a natural fit as the little CPU in a big.LITTLE pair with the high-end Cortex-A15 CPU, and in this brief I'd like to share with you just how we did it.

The first thing we wanted to guarantee is that we don't want to see "fits and starts" when we switch engines.

In big.LITTLE processing, this means ensuring 100% software compatibility between the small and large CPU. From the standpoint of user and OS software, the big and little cores had to look identical. Specifically, this means they are identical from an architectural standpoint. Every instruction, data type and addressing mode that exists on Cortex-A15 processor exists on the Cortex-A7 processor, and produces identical results. Several other aspects of the design have been aligned as well, such as the cache line size, 40 bit physical address space, hardware virtualization, and the 128b AMBA 4 native bus interface.

The second critical element is to ensure the most optimal points for engines...The maximum miles per gallon on the LITTLE and peak performance of a turbo V8, but with no missing gears.

Our approach in big.LITTLE processing was to identify the critical performance points in the mobile profile for next generation mobile platforms and distinct power profiles. The big CPU (the Cortex-A15) was designed to provide a significantly higher performance than today's high-end while staying in the mobile power envelope. This consisted of a more complex, parallel, out-of-order pipeline that was 15 stages or higher depending on instruction stream. We developed a dramatically different core micro-architecture for the little CPU (the Cortex-A7) consisting of an in-order 8 stage pipeline with the ability to dual-issue most common instruction pairs. The NEON SIMD unit for greater media and Floating Point performance are also, scaled down in comparison to those of the high performance core. Supporting only in-order instruction completion, but still supporting all of the same operations including 64b double precision floating point calculations, dual and quad word SIMD operations on integer and single precision floating point data types. If the power and area of the smaller CPU were too close to that of the larger CPU, the energy savings from the switch would not be sufficient to justify the addition of the second CPU cluster. Similarly, if the performance of the smaller CPU were not high enough relative to high performance core, a gap between their capabilities would result in spotty performance at the point of discontinuity. It was critical therefore for us to walk a fine line between performance and power efficiency to deliver a CPU appropriately sized to make big.LITTLE feasible.

The third thing we needed to guarantee is constant linear acceleration and deceleration. No handover problems between engines.

Central to ARM's implementation of big.LITTLE processing is an extremely rapid task migration between large and small CPU's. An obstacle to rapid context migration is the time required to clean and invalidate cache memories on the outbound CPU cluster, the one the active context is being switched out of. Both the (little) Cortex-A7 and the (big) Cortex-A15 processors feature AMBA Coherency Extension (ACE) interfaces which allow them to snoop across ARM Cache Coherent Interconnect fabric (CCI-400) to perform lookups in the L1 and L2 caches of the other CPU cluster. The benefit of this capability is the outbound CPU cluster only needs to save a small context consisting of the register files, CP15 register values, and security state. This small context can then be restored on the inbound cluster in a total save/restore time of less than 20 microseconds in typical implementations (e.g. 1GHz or more on the big CPU). This rapid context switch has several benefits. Because the overhead cost of switching is so low, the power management framework can decide to switch to the little cluster even for a very brief period and still save energy, or switch to the big CPU cluster for a very short burst period to deliver instantaneous bursts of peak performance. The switching decision is simple and the software making the decision can therefore be simple. Also, the context can be switched in the middle of an application, for example when the CPU starts rendering a web page the big CPU can be switched on, and once the page is rendered the context can switch to a smaller CPU until a new page needs to be loaded. There is no need to segment apps to the CPUs; the SoC's power management facilities can switch instantaneously to the right sized CPU element.

The fourth and final thing is to ensure these engines work with a regular transmission.

We needed to ensure there was a simple software approach to controlling the big.LITTLE switch consistent with power management mechanisms already in place. Current smartphones and tablet devices make use of Dynamic Voltage and Frequency Scaling (DVFS) and multiple idle modes for individual CPU cores and IP blocks in the application processor SoC. Our implementation of big.LITTLE modifies the back end of the driver which controls the processor's DVFS operating point (for example cpu_freq in Linux/Android). Instead of three or four DVFS operating points, the driver now is aware of two CPU clusters each potentially with three or four independent voltage and frequency operating points, extending the range of performance tuning that existing smartphone power management solutions use. A big.LITTLE CPU cluster can be operated in a pure switching mode, where only one CPU cluster is active at a time under control of the DVFS driver, or a big.LITTLE heterogeneous multiprocessing mode where the OS is explicitly controlling the allocation of threads to the big or little CPU clusters and is thus aware of the presence of the different types of cores.

Taken together, these attributes of ARM's big.LITTLE processing enable a best-of-both-worlds solution for modern mobile devices: energy savings of up to 70% over today's high-end smartphone application processors, with peak performance significantly higher than the highest end 2011 smartphone. Note this is not an either/or proposition, it really is both higher peak performance and energy savings, on the same workload. This is possible because smartphone and tablet workloads are highly dynamic. For key workloads like web browsing, video streaming, casual gaming, and mp3 playback, the apps CPU spends 70 to 90% of run-time in the lowest DVFS operating point and less than 5% or so at the highest DVFS operating point. Even for high end gaming workloads or heavily interactive websites, the peak operating point is typically 20~30% of runtime with opportunities to switch to or allocate threads to the little CPUs for 70~80% or more of CPU runtime. This maps very well to big.LITTLE processing where the Cortex-A7 (little) can handle typically all, but the highest two operating points of currently shipping high-end application processor CPUs. This enables the Cortex-A7 to deliver the same level of required performance at a significant power and energy savings for over 80% of CPU runtime, then switch instantaneously to the high performance Cortex-A15 CPU on demand to deliver peak performance. Coming back to a car analogy, this is like having a turbo V8 engine on standby for the times when you want to pass going uphill, or accelerate from a dead stop, then switch to the fuel efficient engine once you let off the gas a bit, in less than a blink of an eye.

Now in engineering I've learned that there is no free lunch. You can't get both high performance and high efficiency without trading off some other variable. In this case it is area, the extra CPU cluster takes up a little extra area relative to the high-performance CPU on its own. However in modern process geometries such as 28nm, the Cortex-A7 CPU takes up less than half a square mm per core, so the on-die cost is quite small and the combined system fits into the silicon real estate ear-marked for the CPU cluster¢. This is the kind of tradeoff that makes a lot of sense to me even if I have to include a little extra area on an SoC to get better average efficiency than today's mainstream smartphone with significantly higher peak performance than today's highest performance smartphones. I only wish such a tradeoff were available in a car: the cost efficient high mileage extreme performance dream machine!