Software runs best when it fully utilizes the underlying hardware, squeezing every last drop of performance out to be used by the applications that sit on top.
Generally the cycle of improvements on performance is first to get the code working, and then to implement a rigorous testing process. Once an extensive body of conformance and performance tests are in place — can projects turn their attention to taking advantage of the superpowers of specific hardware.
Optimization work is especially impactful when software is ported to new architectures. With the massive amount of software (and new users) that has been introduced to the Arm architecture over the last few years, we are starting to see significant performance improvements due to optimization work.
Here are some observations I have seen impact real systems, using optimization techniques that are well within the reach of compiler and software library developers. What is nice about each of them is that once the code in a compiler or a library is improved, then the 99% of programmers who are writing application code (as opposed to those working on system or library code) can see performance benefits.
Single Instruction Multiple Data (SIMD) is a set of instructions for a chip capable of operating on more than one piece of data at a time. Rather than crunch numbers one at a time, you use SIMD codes to do the same to 4, 8, or even 16 words of data at the same time. (This function was known as Neon in earlier Arm designs.)
While you might think of mathematics as the first place to look for SIMD operations, it turns out that some of the most interesting recent work has been in libraries for parsing text — specifically, the family of libraries that do very fast parsing of JSON structured data files. When you have terabytes of log data in JSON format that you want to read and extract data from, it is worth optimizing to save time. JSON is everywhere on the internet, and computers spend collectively a lot of time turning it back into data structures, so optimizations here have far-reaching implications.
A team led by Daniel Lemire of the University of Montreal (TELUQ) has been working on simdjson, which parses JSON data about 2.5x faster than other production-grade JSON parsers. The simdjson work starts with fast libraries in C++, and then provides ports of that code to Rust, Go, Python, and other popular languages. See the paper, Parsing Gigabytes of JSON per Second (Lemire and Langdale, 2019), for a detailed explanation of the algorithms involved in this effort. Each of these sets of codes uses building blocks from the respective SIMD implementations on the chip, including those instructions on Arm processors.
Since launching Graviton in 2018, Amazon Web Services has been at the forefront of some notable optimization efforts. A team led by Sebastian Pop at AWS has been tackling improvements for the PHP language. PHP is one of the first languages designed specifically for serving up web pages, and it powers popular web services such as WordPress. As an older web-native language it has some quirks from the original language design that are now being sorted out to keep ease of use but to also improve readability and performance.
In “Improving performance of PHP for Arm64 and impact on AWS Graviton2 based EC2 instances” Sebastian describes the methodical approach taken for his compiler optimization work. A whole series of low-level functions (addition and subtraction, computing hashes and polynomials, encoding and decoding) get modest incremental improvements by using built-in arm64 instructions to compute things directly using functions built in to the chip.
The big win for PHP – resulting in a 20 percent speedup overall – ends up involving removing code rather than adding code. A feature from PHP 4 known as “constructors” turned out to be confusing to programmers, and was not widely used. The language designers decided to take it out of the language in PHP 7 and PHP 8, and it turns out that a performance rewrite of this part of the code path had a big impact on throughput.
Turn out, if code is confusing for programmers, chances are it might also be slow for the computer to make sense of.
Compression is another very important task that is used all over the internet. We’ll look specifically at the compression of files to disk as one of those tasks where optimization leads to outsized improvements on some of the newest Arm systems, especially those with lots of processing cores.
Ordinary compression tools often start at the beginning of a file and read sequentially to the end. This is great if your file is small and you have a single CPU, but woefully inefficient if you have many CPU cores that are otherwise idle.
Pigz (pronounced “pig-zee”) is a modern multi-core implementation of the gzip library. By breaking a big file into smaller parts and then compressing each part at the same time using separate cores on a multicore computer, it can achieve remarkable parallelism. Chunks of a file are compressed in separate threads and then reassembled into a single gzip-compatible file. It is best for compression, since decompression of this file format relies on sequential operations.
Amdahl’s Law (from Gene Amdahl, a computer scientist and mainframe designer in the 1960’s and 1970’s) states that the speedups due to parallelism in a multiprocessor system are limited by the time it takes for the slowest serial process. In other words, it doesn’t matter how many cores you have if the slowest process takes too long on that one step.
Optimization of big computing problems onto multiple cores on modern processors involves both breaking down problems so that they can be effectively run across the whole processor as well as making sure that algorithms are tuned to crunch tasks as fast as they can.
When I first started working with datacenter-grade Arm processors at Packet in 2015 and 2016, Amdahl’s Law was a common Achilles heel. Systems like the ThunderX had an amazing number of cores, but software was not able to take advantage of those cores, and often had slow points that gummed up the works. Early efforts, including those by Cloudflare on Golang, led to amazing improvements in performance and efficiency.
As more developers rely upon and have access to powerful Arm-based servers, and those servers utilize Arm processors with even more (and more powerful!) cores, the flywheel effect is starting to deliver real results. Innovators like Marvell, Ampere, AWS, and Fujitsu are able to deliver more cores to a single System-on-Chip (SoC) but also to have the individual cores run efficiently as they collectively work together.