Ampere was founded with a vision to disrupt the cloud-native ecosystem and introduce innovations that improve computing from hyperscale data centers to processors out at the edge. But the company also values consistency and predictability.
So when the California-based startup introduced Ampere Altra — what they call the industry’s first cloud-native CPU and based on the Arm Neoverse N1 platform — the goal was not only to be first in class. Ampere also aims to be the last one standing.
“It is important that performance is predictable when you have multiple users running on a single machine or even single user running many services on a machine,” says Jeff Wittich, SVP of products at Ampere. “So we have architected our products to be very predictable as well as high performing.”
Founded in 2018, Ampere remains an emerging star in its space. “One of the things I hear consistently is that customers want to know that Ampere products are going to perform for a very long time,” says Travis Lazar, a senior staff engineer focused on software strategic initiatives and continuous performance. “Software changes a lot and the software ecosystem around aarch64 hardware is going through a very exciting growth and maturity period.”
As the availability of hardware, compiler optimizations, and expertise grows in the Arm ecosystem, so does the reliance on the stability of that performance.
“Five to seven years ago you would be hard pressed to find a stable, complete, natively compiled, and widely delivered software stack for AArch64 — all of which are criteria for deploying enterprise workloads into the modern datacenter,” says Lazar. “Our customers want to have the confidence that the software ecosystem is not going to break one or two years down the line. So how do we, as a company, provide them with that confidence? It is a very difficult thing to do.”
But not impossible.
To ensure the quality of Ampere’s Arm-based processors and validate their performance, Lazar and his team developed a new testing infrastructure for the entire development cycle called Continuous Integration, Deployment, and Regression (CIDR).
“When you’re trying to bring disruptive new products to market, you need to instill confidence in your customers. Much like the automotive industry, which is famous for rigorously testing each component of a car or truck, we battle-test our hardware against dozens of critical software components each and every day,” says Lazar. “Instead of asking our customers to ‘trust us’ we bring proof and data.”
Last year, GitHub had more than 40 million developers who merged 87 million pull requests and closed 20 million issues — any one of which could inadvertently impact the performance that Ampere promises for its processors.
To help fix that problem, Ampere teamed up with Packet to deploy a software testing lab featuring Ampere’s previous-generation silicon, eMAG. With Packet’s automation and focus on bare metal, Lazar and his team are able to quickly expand CIDR’s resources, while ensuring a high-quality testing environment.
“It takes a lot of time to write, deploy and run meaningful performance tests, especially against a variety of platforms — time that developers could devote to community bugs or feature requests,” says Lazar. And then there is the issue of cost. Servers are expensive — especially physical machines with higher core-counts or significant memory footprints. These are ideal for complex tests, but can further limit the performance testing capabilities of many projects.
“As we developed the CIDR roadmap, it struck us that this is where community engagement really shines,” Lazar noted. “We bring open source projects into our infrastructure and run the workloads on a range of bare metal Ampere hardware, something that the community may not otherwise be able to do at scale.”
With this community spirit as their driving force, Lazar’s team chose to take regression testing even further. While performance regressions test individual workloads on a system, such as disk IO, memory performance, and compute performance, even a single issue can cause a cascade of failures across multiple categories. Using statistical modeling and purpose-built scoring algorithms, CIDR provides full workload test data to calculate a performance number that developers can apply within the context of their hardware and software stacks to determine the health of their ecosystems.
“With functional regression testing, you often run a series of pass/fail functional tests on a fully integrated platform. You’re asking questions like ‘does feature ABC work when deployed on XYZ,’” says Lazar. Performance regression testing, on the other hand, offers the entire performance measurement history, the configuration of the system it was run on, and the list of code changes with every commit. This allows users to monitor whether the performance is as stable and reliable as expected.
Here is a chart detailing the number of performance regressions we’re seeing by day.
We break down performance regressions by day, plots them by severity, and then color them by cluster. So, the X-axis is the date, the Y-axis is the level of regression severity (higher is worse), and then we use a trained ML model to try and group these regressions by common cause.
“Since software changes every day, and applications involve so many different layers of software, we created a Machine Learning (ML) model to assist in root cause analysis and noise reduction within our mountain of performance data,” says Lazar. “We feed all of our data into our model, and it groups failures into clusters, which are then tied to a date, time, commit, and hardware configuration, allowing us to better understand issues over time across a large swatch of data, which would be nearly impossible with a manual or traditional process.”
As a test run, Ampere wanted to use an active code base to improve its ability to identify performance regressions. They zeroed in one of the largest: Python.
Since Python is widely leveraged by applications running in the cloud and at scale, there is a lot of activity in the code base, making it the perfect test case for CIDR. “This is a code base with many contributors on the master branch where you have a number of commits that may not go through full performance regression test cycles,” says Lazar.
The lesson learned? When there are a large number of commits, there are a proportionately high number of data variations. One clear risk is that one code change improves performance by N percent while another change reduces performance by N percent. Everything seems consistent, but under the hood that’s not the case.
“Within the in-development code base, we looked at the performance regressions at a very detailed level which led to a number of systematic improvements to CIDR,” explains Lazar. “Instead of simply producing one metric to gauge performance, Ampere collects unique heuristics and machine learning analysis that can be applied across the software ecosystem. This sets us apart.”
And in the case of the Python code base, Ampere showed up with the open-source goods. “We found a few performance regressions and quickly filed bug reports,” says Lazar. “The community was very responsive and appreciative, especially since they rarely see silicon makers with this kind of involvement in the software ecosystem.”
“Most individual software projects do CI/CD, but from a full system integration standpoint, very few focus on regression loops,” says Lazar.
Ampere is an obvious exception. In the past year alone, they’ve collected over one million test results from 6,000 bare metal instances. “With Packet, we’re churning out 50 provisions a day for end to end testing with very few failures — if any,” says Lazar.
“We are not doing anything fancy. We’re running standard operating systems and software just like any developer would if they had hardware sitting under their desk or in a datacenter,” says Lazar. “But as the data we collect grows, we increase confidence that Ampere’s Arm-based platforms are ready for prime time performance across a wide range of use cases.”
According to Wittich, customers have, in fact, taken notice. “We’ve been working with some of the biggest cloud service providers in the world, and we have made it very easy for them to use our products right out of the box,” he says.
And though Arm is the relative new kid on the datacenter block, any initial reluctance fades in the face of its consistent, powerful performance.
“Anytime you move off of something like x86 that’s been industry standard for the last 20 years, there can be a bit of hesitation,” says Wittich. “But it is actually very easy to move to Arm, and we’re able to provide proof points to demonstrate why that is. It gives people the confidence that when they move to Arm — and especially when they move to Ampere — that they will be supported over time.”