We have just announced our 2020 Arm Mobile IP, including the Arm Cortex-A78 CPU as the next step up in sustained performance on smartphones. But this year we do not stop there. We are mindful that Arm’s ever-expanding ecosystem are demanding more solutions and products based around their own specific needs and demands.
Therefore, we are delighted to announce the Cortex-X Custom (CXC) program. In close collaboration with Arm engineering teams, program partners can shape a final CPU product to meet their specific market demands. This allows program partners to define their own performance points outside of the usual Cortex-A design envelope of performance, power, and area (PPA). This final custom CPU, designed and built by Arm, will then be delivered under the Arm Cortex-X brand. The very first CPU as part of the CXC program is the Arm Cortex-X1 CPU.
Cortex-X1: the most powerful Cortex CPU
Cortex-X1 is the most powerful Cortex CPU to date, bringing 30 percent peak performance improvements in the next generation over the current Arm Cortex-A77 CPU. It is designed to bring ultimate performance for next-generation custom solutions. This is in response to partners who wanted to maximize performance in line with their own specific use-cases.
Cortex-X1 also provides performance uplifts when compared to the Cortex-A78, offering 22 percent integer (single-thread) performance improvements¹. This short high-performance burst is best for reactivity and responsiveness when using devices, enabling the highest performance ever for smartphones and large screen devices. Furthermore, Cortex-X1 offers 2x machine learning (ML) performance improvements over Cortex-A77¹. The big improvement has been made despite the previous generation bringing a significant step-up for on-device intelligence. This is part of our wider push for more local compute performance.
Cortex-X1: designed for ultimate performance
As described in the Cortex-A78 blog, the DynamIQ cluster of 4x Cortex-A78 and 4x Cortex-A55 provides 20 percent sustained performance improvements over the 4x Cortex-A77 and 4x Cortex-A55 cluster². However, introducing Cortex-X1 enables even greater scalability through bringing a boost in peak performance. Adding 1x Cortex-X1 as part of the DynamIQ cluster alongside 3x Cortex-A78 and 4x Cortex-A55, the peak performance is 30 percent over the previous generation². When combined with the premium efficiency of Cortex-A78, it delivers the best sustained and peak performance. Therefore, it perfectly fits the ever-expanding need of performance for mobile devices.
The Cortex-A78 and Cortex-X1 DynamIQ clusters compared to the previous generation
The key market for solutions with Cortex-X1 are smartphones and new form factors. The performance uplift supports the move towards new foldable designs and bigger, multiple screens. Cortex-X1 provides quicker, more seamless user experiences, with faster app loading times and improved webpage scrolling responsiveness. The big ML uplift enables more advanced AI and ML-based experiences.
Similar to Cortex-A78, Cortex-X1 enables improvements to multiple digital immersion use-cases and experiences on mobile. These range from common productivity, communication, security, and camera-based use-cases right through to advanced gaming and XR (augmented reality and virtual reality) experiences.
The Cortex-X1 microarchitecture upgrades for maximum performance
As you can see from the image above, Cortex-X1 has various microarchitecture upgrades that enable ultimate performance. Compared to Cortex-A78, the decode bandwidth has been increased by 25 percent to 5 instructions decoded per cycle. Moreover, the MOP cache throughput has been increased by 33 percent to 8 MOPs per cycle. On Cortex-X1, the Neon engine gets two additional pipes, doubling its compute capacity over Cortex-A78. Finally, on cache sizes, Cortex-X1 supports 64kB L1 and up to 1MB L2 cache. The DynamIQ cluster has also been upgraded to now support 8MB of L3 for ultimate performance. This larger L3 can also be used by Cortex-A78 when used in conjunction with Cortex-X1.
Cortex-X1 is the very first example of a Cortex CPU that the CXC program can produce. It extends the digital immersion capabilities of smartphones through new levels of performance, making Cortex-X1 Arm’s most powerful CPU to date.
As part of the CXC program, subscribed partners collaborate with Arm to define custom CPUs that push performance at an envelope outside of the Cortex-A PPA. As a result, partners will have a CPU that is specific to their market needs and shows differentiation beyond roadmap Cortex-A CPUs. Through the CXC program, we are meeting the needs of the ever-expanding ecosystem, taking the best of Arm and applying it to the next level.
[CTAToken URL = "https://www.arm.com/products/cortex-x" target="_blank" text="Learn more about the Cortex-X Custom program" class ="green"][CTAToken URL = "https://www.arm.com/company/news/2020/05/new-arm-ip-delivers-true-digital-immersion-for-the-5g-era" target="_blank" text="Visit the Arm newsroom blog" class ="green"]
¹ Comparing Arm single core peak performance at 3.0GHz. Cortex-X1: 1MB priv-L2, 8MB L3 cache vs Cortex-A78 (32kB) / Cortex-A77 512KB priv-L2, 4MB L3 cache. Machine learning performance based on Matrix multiplication theoretical throughput. Measured estimates on SPECint*_base2006 (SPECspeed* Integer component of SPEC CPU* 2006) Arm single-core performance estimated for mobile platform. Results are measured estimates using specific computer systems, software, components, operations, and functions and changes to any of these factors will cause the results to vary.
² Comparing Arm single core performance at 1 watt on Cortex-A78 and Cortex-77, comparing Arm single core peak performance on Cortex-X1 to Cortex-A78 and comparing cluster area on Cortex-X1/Cortex-A78/Cortex-55 1+3+4 topology and Cortex-A78/Cortex-A55 4+4 topology to Cortex-A77/Cortex-A55 4+4 topology, including architectural and process improvements (compared to 2019 devices).
Does Cortex-X1 act as a coprocessor? and applies only to cell phones?
Frequency entitlement of the Cortex-X1 is not limited to EUV process, and has been measured at 3GHz on non-EUV process (7nm) as well.
Any approximations on the Cortex-X1 speed with a non-EUV machine thickness?
Frequency entitlement of the Cortex-X1 is similar to Cortex-A78, measured around 3GHz on 5nm process nodes. As for instruction throughput, Cortex-A78 is able to process 4 instructions / 6 macro-ops, and Cortex-X1 5 instructions / 8 macro-ops for maximum performance.
Curious about frequency, and instructions or operations per clock.