Blog orgininally posted on 11 July 2011 on blogs.arm.com
Having chosen the optimal implementation, as described in the previous blog (Elba - How do we know it works?) we now turned our attention to power management.
Simulations of Elba at this point of the program were starting to supply some rather noticeable power levels for the processor, especially at the design corner we were most familiar with. The worst case design corner is a statistical point across the variations that you could potentially see from a silicon process at a temperature that is assumed to not exist. Remember, ARM's primary market was mobile devices, so for these devices a manufacturer wanted to know that every chip delivered from the fab would achieve the defined performance. So speed would always be defined by the statistically slowest piece of silicon, and the power would be defined by the statistically fastest and hottest piece of silicon – neither of which would ever exist in reality, but allowed the manufacturer to maximise device yield without testing each part for its performance. You may know that around 1 billion phones were sold last year – and that would be a lot of cost to ‘speed-bin’ parts across that market.
There are various ways to speed-bin a SoC, but basically the two main ones are to split them across their maximum clock speed, or the power they consume while achieving a given clock speed. The general microprocessor market is very familiar with the first, for years folk have bought and paid more for the fewer parts that go faster than other parts. So, rather than sell all parts at say $25 and say all of them will achieve, say 500MHz as was typical in the ARM ecosystem, this speed-binning would allow the exact same silicon to be sold at say $20 for the few parts that can only reach 500MHz, and then maybe double for the fast ones that would typically be able to achieve 1GHz. As vendors expert in binning parts also know, you can sell parts that would typically have been sold as fast parts as low-power parts since these can reach the target speed using a lower voltage – a good reason to block such a device from being overclocked I think.
Anyway, back to the power management of the processor macro. The power number we were seeing kicked off two new aspects to the program, the first was the creation of various independent power regions across the macro, and the other was the physical IP layer, the actual transistor level of the design, where we started to look at various different transistor designs that could be used in the “G” process but cause as much leakage. The design of the actual gates was then defined in collaboration with the processor designers so that specific logic paths through the RTL design could maximize performance, while reducing power on other non time-critical paths. Both these developments are now available as a physical IP product, the multi-channel library, and the Artisan Processor Optimization Pack, (PoP).
Within the macro, there were eight independent power regions, each allowing the power to be removed from that aspect of the macro, these included each CPU, each NEON unit, each debug trace unit, debug itself, the MBIST controller and finally the L2 controller and processor snoop unit. With so many power domains, clearly a lot of effort was then needed to ensure the current in-rush when these blocks were brought back online didn’t surge higher than the design envelope. The complexity of the problem was further increased with a design goal of ensuring power could be restored within 100ns. This was achieved with a hierarchy of power switches throughout the design and integral logic to restore synchronization.
The key component to address in the power optimized design was the gate leakage, especially at higher temperatures. We already had all the typical transistor types available, HvT transistors are typically used to reduce leakage, but these were not enough for the power optimized macro to have any commercial interest. So we set ourselves the goal that it must be able to clock faster than the equivalent “LP” progress while consuming less power at each temperature/voltage point, a goal that needed something very different. The ‘magic bullet’ was to design cells that had exactly the same dimensions of the standard cells for the process, but design them with an increased channel length. In our case, this meant having 50nm cells available for the 40nm process. These cells could be used interchangeably with the native 40nm cells, and could also be used in combination with Hvt and other transistor speeds too. Together, the result is the power optimized 40G macro has an active power characteristic that is higher speed and lower power than 40LP, and actually more closely matches the more costly 32LP process – something that has proven to be commercially very interesting.
In part four of this blog I’ll outline how we brought the complete design together and the conclusions we drew.