Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Elba Processor Power Management
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • pop
  • soc_design
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Elba Processor Power Management

John Goodacre
John Goodacre
September 12, 2013
4 minute read time.

Blog orgininally posted on 11 July 2011 on blogs.arm.com

Having chosen the optimal implementation, as described in the previous blog (Elba - How do we know it works?) we now turned our attention to power management.

Simulations of Elba at this point of the program were starting to supply some rather noticeable power levels for the processor, especially at the design corner we were most familiar with. The worst case design corner is a statistical point across the variations that you could potentially see from a silicon process at a temperature that is assumed to not exist. Remember, ARM's primary market was mobile devices, so for these devices a manufacturer wanted to know that every chip delivered from the fab would achieve the defined performance. So speed would always be defined by the statistically slowest piece of silicon, and the power would be defined by the statistically fastest and hottest piece of silicon – neither of which would ever exist in reality, but allowed the manufacturer to maximise device yield without testing each part for its performance. You may know that around 1 billion phones were sold last year – and that would be a lot of cost to ‘speed-bin’ parts across that market.

There are various ways to speed-bin a SoC, but basically the two main ones are to split them across their maximum clock speed, or the power they consume while achieving a given clock speed. The general microprocessor market is very familiar with the first, for years folk have bought and paid more for the fewer parts that go faster than other parts. So, rather than sell all parts at say $25 and say all of them will achieve, say 500MHz as was typical in the ARM ecosystem, this speed-binning would allow the exact same silicon to be sold at say $20 for the few parts that can only reach 500MHz, and then maybe double for the fast ones that would typically be able to achieve 1GHz. As vendors expert in binning parts also know, you can sell parts that would typically have been sold as fast parts as low-power parts since these can reach the target speed using a lower voltage – a good reason to block such a device from being overclocked I think.

Anyway, back to the power management of the processor macro. The power number we were seeing kicked off two new aspects to the program, the first was the creation of various independent power regions across the macro, and the other was the physical IP layer, the actual transistor level of the design, where we started to look at various different transistor designs that could be used in the “G” process but cause as much leakage. The design of the actual gates was then defined in collaboration with the processor designers so that specific logic paths through the RTL design could maximize performance, while reducing power on other non time-critical paths. Both these developments are now available as a physical IP product, the multi-channel library, and the Artisan Processor Optimization Pack, (PoP).

Within the macro, there were eight independent power regions, each allowing the power to be removed from that aspect of the macro, these included each CPU, each NEON unit, each debug trace unit, debug itself, the MBIST controller and finally the L2 controller and processor snoop unit. With so many power domains, clearly a lot of effort was then needed to ensure the current in-rush when these blocks were brought back online didn’t surge higher than the design envelope. The complexity of the problem was further increased with a design goal of ensuring power could be restored within 100ns. This was achieved with a hierarchy of power switches throughout the design and integral logic to restore synchronization.

Power Optimized Design

The key component to address in the power optimized design was the gate leakage, especially at higher temperatures. We already had all the typical transistor types available, HvT transistors are typically used to reduce leakage, but these were not enough for the power optimized macro to have any commercial interest. So we set ourselves the goal that it must be able to clock faster than the equivalent “LP” progress while consuming less power at each temperature/voltage point, a goal that needed something very different. The ‘magic bullet’ was to design cells that had exactly the same dimensions of the standard cells for the process, but design them with an increased channel length. In our case, this meant having 50nm cells available for the 40nm process. These cells could be used interchangeably with the native 40nm cells, and could also be used in combination with Hvt and other transistor speeds too. Together, the result is the power optimized 40G macro has an active power characteristic that is higher speed and lower power than 40LP, and actually more closely matches the more costly 32LP process – something that has proven to be commercially very interesting.

In part four of this blog I’ll outline how we brought the complete design together and the conclusions we drew.

  • Part 1: Wouldn't it be interesting if we... - Giving Birth to "Elba"
  • Part 2: Elba - How do we know it works?
  • Part 4: Elba - Bringing it all together
Anonymous
Architectures and Processors blog
  • When a barrier does not block: The pitfalls of partial order

    Wathsala Vithanage
    Wathsala Vithanage
    Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
    • September 15, 2025
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025