ARM big.LITTLE™ technology is becoming increasingly recognized within the industry as the way forward to meet the demands of higher performance with low power consumption in mobile devices. Since its launch there are now more than twelve ARM partners actively designing with big.LITTLE technology.
As the adoption of big.LITTLE processing spreads there are a number of common questions which have been raised. I'd like to use this blog to answer some of the more common queries:
Number 1: Can you switch on all the cores at once?
In the earlier big.LITTLE software models (cluster migration and CPU migration), the software switched between cores and could not switch all cores on simultaneously. In the more recent software model, Global Task Scheduling, software can enable all cores to be active at once because the OS is aware of the big and LITTLE cores in the system and is in direct control of thread allocation among the available cores. With Global Task Scheduling the OS power management mechanisms will continue to idle unused cores in the same way it does in standard multi-core systems today.
One architecture "" three main software usage models
Number 2: Is the software similar to existing mechanisms like DVFS and SMP scheduling?
In current smartphones and tablets, dynamic voltage and frequency scaling (DVFS) is used to adapt to instantaneous changes in required performance. The migration modes of big.LITTLE extends this concept by enabling a transition to "big" CPU cores above the highest DVFS operating point of the LITTLE cores. The migration takes about 30 microseconds. By contrast, the DVFS driver evaluates the performance of the OS and the individual cores typically every 50 milliseconds, although some implementations sample slightly more frequently. It takes about 100 microseconds to change voltage and frequency. Because the time taken to migrate a CPU or a cluster is shorter than the DVFS change time and an order of magnitude shorter than the OS evaluation period for DVFS changes, big.LITTLE transitions will enable the processors to run at lower operating points, more frequently, and further, be completely invisible to the user.
In the Global Task Scheduling model, the DVFS mechanisms are still in operation, but the operating system kernel scheduler is aware of the big and LITTLE cores in the system and seeks to load balance high performance threads to high performance cores, and low performance or memory bound threads to the high efficiency cores. This is similar to SMP load balancers today, that automatically balance threads across the cores available in the system, and idle unused cores. In big.LITTLE Global Task Scheduling, the same mechanism is in operation, but the OS keeps track of the load history of each thread and uses that history plus real-time performance sampling to balance threads appropriately among big and LITTLE cores.
Number 3: Can it run with Android today, with no changes to Android?
big.LITTLE software is available as a patch set to the Linux kernel. It effectively operates underneath Android in the kernel. The Global Task Scheduling software (ARM's implementation of Global Task Scheduling is called big.LITTLE MP in the open source tree) is hosted on a Linaro git tree that is freely accessible to all, and it is in the process of upstream submission. The patch set can be applied to the standard Linux kernel operating underneath Android. ARM has demonstrated Global Task Scheduling on several development boards and with lead partners on production silicon at private events and at Mobile World Congress and CES. The first production implementations of big.LITTLE use the Cluster and CPU migration modes, as the software freeze date for those systems happened in 2012. Global Task Scheduling is expected on production systems starting in the second half of 2013.
Number 4: How does big.LITTLE enable higher performance?
Because of the presence of highly efficient Cortex®-A7 cores, SoC designers can tune the Cortex-A15 processor for high performance knowing that average power can remain well within the existing mobile power envelope. This allows the use of the higher throughput Cortex-A15 CPU at full capacity for bursts of performance, throttling back on voltage and frequency then migrating work to LITTLE cores for sustained and background performance.
Additionally, in the Global Task Scheduling software model, the OS can allocate additional work to the Cortex-A7 CPUs when the Cortex-A15 CPUs are all fully loaded. Today this is most beneficial in benchmarks like Antutu, Geekbench, ANDeBench, and other multi-core workloads, but as software matures to take better advantage of additional cores, the presence of additional cores in the big.LITTLE system will allow higher aggregate performance.
Finally, we observe that many key workloads today, such as web browsing, feature one or two very demanding threads (WebViewCoreThread and SurfaceFlinger in Android). This kind of workload is very well suited to big.LITTLE - the high performance threads can each be assigned to a high performance Cortex-A15 CPU, while the background threads can be scheduled to one or more LITTLE CPUs. By allocating the lower performance threads to the LITTLE CPUs, the entire capacity of the high performance cores can be devoted to the most demanding threads, enabling higher performance overall.
Number 5: How does big.LITTLE enable greater energy savings than just lowering the voltage?
The Cortex-A15 cluster and the Cortex-A7 cluster in current generation big.LITTLE SoCs can run at independent frequencies. Alternative approaches have advocated the use of identical cores with asynchronous voltage scaling to reduce energy. With big.LITTLE, the big and LITTLE cores can scale voltage and reduce energy further by migrating less intense work to a simpler pipeline that is 3x more efficient. Across the whole performance range of the LITTLE CPU cores, they enable energy savings significantly higher than voltage scaling alone.
Only big.LITTLE has the benefit of a tuned micro-architecture that is 3 or more times the efficiency of the high performance CPUs.
- LITTLE cores are built using a completely different microarchitecture than the big cores, so the LITTLE cores save power by the nature of their simpler design, in addition to the voltage and frequency scaling benefit.
- The LITTLE cores can be implemented to target lower leakage and a more moderate performance point, independently from the physical implementation of the big cores that are often tuned for higher frequency.
Ultimately this means that big.LITTLE offers a greater opportunity to save power, than with a single CPU microarchitecture implementation. There are some solid benefits to asynchronous DVFS within a CPU cluster. We view asynchronous DVFS as an endorsement of the concept of scalable performance and a good solution in its own right. However, big.LITTLE technology has advantages over and above this. The parallel development of these technologies shows a strength of the ARM Ecosystem - they will compete for adoption in the market, and each approach will likely evolve over time based on that competition at a faster pace than if a single architecture and implementation were all that existed.
Number 6: Are the power savings available from big.LITTLE significant at the system level?
Saving fifty percent or more of the power of the CPU subsystem is a significant saving at the system level. When combined with DVFS, power gating, clock gating, and retention modes, big.LITTLE plays in important role in the overall power management of a mobile device, and it brings opportunities for future power reduction as software power management policies evolve and work more closely together to manage shut-down, core migration, voltage, and frequency in a coordinated policy. Bottom line, the power reductions are very good now, and they will get even better.
Number 7: Can big.LITTLE save power on high performance tasks too?
High performance applications have periods of lower intensity, for example when waiting for user input or while the GPU is active. During these periods, existing smartphone SoCs downshift to lower DVFS points and/or idle the cores. From the diagram below, we can see that during play, an HD racing game causes the DVFS mechanisms to idle the dual-core Cortex-A9 CPUs almost half the time, while operating below 1GHz over ninety percent of the time. All of these idle periods and low frequency states map well to LITTLE cores and present the opportunity to save energy, even for a high performance workload like the GT Racer HD game.
Other examples of high performance workloads that have low intensity periods abound. Web browsing immediately after a page is rendered, high performance tasks that are waiting on memory. Because of the extremely fast migration of work from big to LITTLE cores, even very short periods of lower intensity can be mapped to LITTLE CPUs to save energy.
Number 8: How much user level code needs to be changed to support big.LITTLE?
None. Decisions about whether to use big or LITTLE cores are the job of the OS. big.LITTLE is a power management technique that is completely invisible to user level software, much like dynamic voltage and frequency scaling (DVFS) or CPU shutdown in a multi-core SoC.
There are opportunities to be exploited when using big.LITTLE that can be driven by user space. User space can know whether a thread is important to user experience, and for example allow user interface threads to use big CPUs, whilst preventing background threads/apps from doing so. Other examples include preventing the use of big cores when the screen is off, or pinning threads in use cases where you know you can do the compute with just LITTLE cores, say during a call. User space has the opportunity to take you that little bit further, but none of these techniques are required and big.LITTLE does not require any user space awareness for it to save energy and deliver high performance.
Number 9: Can there be a different number of big and LITTLE cores?
With Global Task Scheduling, it is possible for the software to automatically support different numbers of big and LITTLE cores. There are no extra requirements for the system, the software automatically load balances among the different number and type of cores. We expect this type of asymmetric system topology, with different numbers of big and LITTLE cores, to become more common as big.LITTLE Global Task Scheduling is more broadly deploying beginning in the second half of 2013.
Number 10: I hear it's hard to use... how complicated is it?
The software is actually quite straightforward. There are no user level or middleware level code changes. The big.LITTLE software lies entirely in kernel space and is delivered as a relatively small patch set that is applied by the silicon vendor in board and chip support libraries. There is some tuning by the silicon vendor and OEM, similarly to the way DVFS operating points and core shutdown policies are tuned in standard multi-core systems today. The patch is in the kernel, and so transparent to the user - once you build it, and tune it, it just works.
The application developer can get all the benefits of big.LITTLE (high speed Global Task Switching, optimum efficiency, higher performance) simply, as all the integration work has been done by ARM and the ARM partners with exactly this in mind. ARM has an example hardware implementation of big.LITTLE using the Versatile Express V2P-CA15_A7 CoreTile, which is a great starting point for evaluation.