One of the canards that’s regularly trotted out in discussions of ARM vs. x86 processors is the idea that ARM chips are intrinsically more power efficient thanks to fundamental differences in the ISA (instruction set architecture). A new research paper examines these claims using a variety of ARM cores as well as a Loongson MIPS microprocessor, Intel’s Atom and Sandy Bridge microarchitectures, and AMD’s Bobcat.
This paper is an updated version of one I’ve referenced in previous stories, but its methods and claims are worth investigating in more detail. ISA investigations are intrinsically difficult given that it’s effectively impossible to separate the theoretical efficiency of an architecture from the proficiency of its design team or the technical expertise of its manufacturer. Even products that seem identical can have important differences — ARM revised the Cortex-A9 core four different times and has released three updates to the Cortex-A15. Then you have the particulars of manufacturing — Intel, TSMC, Samsung, and GlobalFoundries aren’t carbon copies of each other and the CPU inside a Tegra K1 isn’t 100% identical to the Cortex-A15 inside a Samsung Exynos SoC.
That’s just the hardware side of the equation. Toss in compiler optimizations and library support and it’s even harder to write a definitive apples-to-apples comparison of any two architectures.
With that said, the team from the University of Wisconsin has taken a pretty good whack at an incredibly complex problem and compared the following architectures.
The chips in question were tested in desktop, mobile, and server workloads with a mixture of programs including CoreMark, WebKit, SPEC tests, and a variety of other benchmarks. Power consumption data was gathered at the SoC level, while performance information was gathered using a variety of profiling techniques.
All of the systems save the Cortex-A15 were tested using Linux 2.6 LTS with minor patches. The A15 had to be tested with Linux 3.8 due to compatibility issues. All tests were compiled with GCC 4.4, all target-independent optimizations were enabled (O3), with machine-specific tuning disabled. None of the tests included SIMD code and while auto-vectorization was enabled, very few SIMD instructions were generated for ARM or x86. All of the tests were compiled in 32-bit mode for all of the architectures.
We’ll start with performance data, since its tied directly to the question of device power consumption. Here’s workload power consumption in each area for each core. The slight gap between the A8/Atom (Clover Trail) results and the other chips reflects the fact that Atom and the A8 are in-order architectures while all of the other chips are out-of-order. In this chart, results are normalized against the Core i7, with its score of 1x. Lower bars means faster performance.
Nothing particularly unusual here. The Cortex-A15 is faster than any of the other low-power architectures in every workload but server tests, where Bobcat holds that distinction. The Cortex-A8 is the slowest core, especially in SPEC FP — the Cortex-A8 didn’t have a hardware FPU and it shows in these benchmarks. Loongson also struggles, possibly due to poor compiler optimization.
The Cortex-A9 is a middle-of-the-road core, often not quite as fast as Atom, but vastly superior to the Cortex-A8.
Now that we’ve established our baseline performance, let’s look at power consumption. Here’s the overall comparison between A8, A9, Loongson, Atom, A15, Bobcat, and the i7.
These results have been normalized against the Cortex-A8, which scores a 1x in every category. Raw average power consumption shows the Cortex-A9 consuming significantly less power than Atom (which generally ties Loongson), while Bobcat and the Cortex-A15 are in an entirely different category. The i7, of course, soars above every other chip.
Next up, we have raw average energy consumption. Energy is distinct from power — average power simply states that the chip consumed X number of watts, while energy includes the amount of time it took to complete a given workload. Again, results are normalized against the Cortex-A8.
This is the data that shoots down the “ARM is intrinsically more power efficient than x86” argument. In mobile, where ARM is strongest, the Clover Trail Atom is more efficient than the Cortex-A8, but doesn’t quite match the Cortex-A9. Bobcat is actually less efficient than the Core i7, and the Cortex-A15 draws far more power than any previous ARM designs.
The SPEC tests further emphasize the Cortex-A9’s high efficiency — SPEC INT has the i7, Bobcat, and A15 nearly tied with each other while SPEC FP illustrates how beefing up certain aspects of a core can improve energy efficiency. Bobcat acquits itself well in each area, while Loongson continues to fall behind in energy efficiency.
Put the data together and you can plot the performance/power trade-offs that characterize each core.
To be clear, the ISA can sometimes matter. The report notes that in certain, extremely specific cases where die sizes must be 1-2mm2 or power consumption is specced to sub-milliwatt levels, RISC microcontrollers can still have an advantage over their CISC brethren.
When every transistor counts, then every instruction, clock cycle, memory access, and cache level must be carefully budgeted, and the simple design tenets of RISC become advantageous once again. This mainly plays out at the microcontroller level — if you have a Cortex-A8 or above, the differences are entirely microarchitectural.
A modern RISC chip (the recently announced RISC-V, in actual fact)
Companies that try to claim RISC still has enormous benefits over x86 at higher performance levels are explicitly ignoring the fact that RISC and CISC are terms that describe design strategies and that those strategies were formed in response to technological limitations of the day.
The entire reason CISC architectures emphasized complex multi-cycle instruction execution is because memory accesses were orders of magnitude slower than the processor and data storage was extremely limited. RAM costs could dwarf the cost of other system components and compilers were primitive. Programmer-friendly architectures were a response to these constraints.
Meanwhile, RISC chips could run at significantly higher clocks than their CISC counterparts thanks to reduced complexity — but that’s no longer true today. In the modern era, process technology controls clock speed, not one’s choice of RISC vs. CISC, and we’re bumping against the fundamental limits of silicon for any architecture. Research that pushes this backwards is now the major focus — not overhauling the ISA.
Read our featured story: Intel dismisses ‘x86 tax’, sees no future for ARM or any of its competitors
Factor in the myriad advances that both design philosophies have incorporated, and the old terms simply aren’t accurate any longer. The first RISC chips looked nothing like their CISC counterparts, whereas today a Core i7 and Cortex-A57 have far more in common. Decades of experience have led designers to adopt strategies and structures that work, even if the underlying ISA is different.
Another reason this myth persists is that compiler choices and optimizations introduce enormous confounding variables. The University of Wisconsin research team found multiple instances where one architecture executed far more instructions for a given workload than another, or was severely impacted by branch mispredictions or cache misses.
Check the Tonto and bwaves benchmark core cycle counts for an example of this trend. The Cortex processors execute dramatically more instructions for the same results. If the research team had included the impact of multiple compilers and SIMD optimizations, the results would be even more convoluted — and could easily be used to tilt the comparison in a direction that favored ARM or Intel.
The RISC vs. CISC argument should’ve passed into history a long time ago. It may still have some relevance in the microcontroller realm, but has nothing useful to contribute to the modern era. An x86 chip can be more power efficient than an ARM processor, or vice versa, but it’ll be the result of other factors — not whether it’s x86 or ARM.