I will be talking at ARM® Technology Conference (Techcon) this week about our latest Cortex-A9 implementation project, which I am pretty excited about. Cortex-A9 is a scalable high-performance, low-power multiprocessor from ARM. It is ideally suited for applications which require high levels of performance, where the power-budget or battery life is critical. ARM implementation teams have been pushing the frequency achievable on the Cortex-A9 on various process nodes like TSMC CLN40G, Common Platform (CP) CMOS32LP etc. Recently we implemented a dual-core configuration of Cortex-A9 processor in-house and it achieved 1GHz+ frequency in the CMOS32LP process at low power. The chip was fabricated using the Samsung 32nm HKMG process, and the silicon is fully functional and worked as expected.
Exceeding 1GHz on 32nm LP
Achieving 1GHz+ on CMOS32LP is a pretty significant result since the LP process node traditionally trades-off performance for low leakage and is generally expected to be much slower than a G node. The baseline implementation of the dual-core Cortex-A9 with the CMOS32LP Foundation IP resulted in ~800MHz. To improve on this performance, we worked closely with our Physical IP division to identify physical IP components that could improve the overall performance of the CPU implementation. Approximately half of the frequency uplift to the 1GHz+ implementation point came from physical IP components built and optimized specifically for Cortex-A9 processor. The rest came from methodology and floorplan improvements. The physical IP components built for the processor are available as a package, called the ARM Artisan® Processor Optimization Package (POP).
Optimized High Speed Standard Cell Libraries and Memories
The POP consists of a standard cell library called High Performance Kit (HPK) and optimized RAM instances called Fast Cache Instances (FCI). The POP is designed to be used in a fully synthesizable flow and can be incorporated into any EDA tool implementation flow easily.
The POP technology utilizes ARM's understanding and knowledge of the critical paths in the Cortex-A9 processor to design various standard cells and RAMs to improve these paths. The HPK consists of new flip-flops and other cell architectures. For example, the HPK flops can improve the flop insertion delay in critical paths by ~100ps. The FCI RAM instances were optimized to improve Cortex-A9 RAM paths. The Cortex-A9 RAM inputs are half cycle paths and so any improvement to the setup into them is critical to improving the overall performance of the implementation. Similarly, for Tag RAM paths, the output from the RAMs is critical and improvements to the access time help these paths. The HPK and FCI show an instant uplift in frequency once they are inserted in the design. The POP technology is silicon proven, and ready for Partner adoption.
In addition to the POP package, improvements to the design flow and floorplan also contributed to the performance uplift. The flow used industry-standard tools for implementation and analysis, and is easily deployable in Partner implementations. There were several techniques applied to improve placement, clock-tree synthesis, route and sign-off optimizations. Although the primary focus was on performance, the design has several low power features like aggressive clock gating, Dynamic Voltage and Frequency Scaling (DVFS) for dynamic power reduction and power gated large sections of the design for reducing leakage.
Stay tuned for details on the processor implementation techniques and results at the ARM Technology Conference (Techcon) session titled "Proven Methodologies for Cortex-A9 Implementation >1 GHz" on Tuesday, November 9, 2010, 1:45pm — 2:35pm.
Jinson Koppanalil, Staff Engineer & Technical Lead, Processor Division, ARM, is based in Austin. Since 2002 he has been working at ARM, implementing high-performance cores and chips based on a variety of application processors including the Cortex-A8, Cortex-A9 and the recent Cortex-A15. Jinson holds a masters degree in Computer Engineering.