Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog What’s up doc? Optimizations for Arm
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Works On Arm
  • Equinix
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

What’s up doc? Optimizations for Arm

Ed Vielmetti
Ed Vielmetti
July 13, 2021
5 minute read time.

Software runs best when it fully utilizes the underlying hardware, squeezing every last drop of performance out to be used by the applications that sit on top. 

Generally the cycle of improvements on performance is first to get the code working, and then to implement a rigorous testing process. Once an extensive body of conformance and performance tests are in place — can projects turn their attention to taking advantage of the superpowers of specific hardware. 

Optimization work is especially impactful when software is ported to new architectures. With the massive amount of software (and new users) that has been introduced to the Arm architecture over the last few years, we are starting to see significant performance improvements due to optimization work. 

Here are some observations I have seen impact real systems, using optimization techniques that are well within the reach of compiler and software library developers. What is nice about each of them is that once the code in a compiler or a library is improved, then the 99% of programmers who are writing application code (as opposed to those working on system or library code) can see performance benefits.

SIMD for Parsing JSON

Single Instruction Multiple Data (SIMD) is a set of instructions for a chip capable of operating on more than one piece of data at a time. Rather than crunch numbers one at a time, you use SIMD codes to do the same to 4, 8, or even 16 words of data at the same time. (This function was known as Neon in earlier Arm designs.)

While you might think of mathematics as the first place to look for SIMD operations, it turns out that some of the most interesting recent work has been in libraries for parsing text — specifically, the family of libraries that do very fast parsing of JSON structured data files. When you have terabytes of log data in JSON format that you want to read and extract data from, it is worth optimizing to save time. JSON is everywhere on the internet, and computers spend collectively a lot of time turning it back into data structures, so optimizations here have far-reaching implications.

A team led by Daniel Lemire of the University of Montreal (TELUQ) has been working on simdjson, which parses JSON data about 2.5x faster than other production-grade JSON parsers. The simdjson work starts with fast libraries in C++, and then provides ports of that code to Rust, Go, Python, and other popular languages. See the paper, Parsing Gigabytes of JSON per Second (Lemire and Langdale, 2019), for a detailed explanation of the algorithms involved in this effort. Each of these sets of codes uses building blocks from the respective SIMD implementations on the chip, including those instructions on Arm processors.

Speeding Up PHP at AWS

Since launching Graviton in 2018, Amazon Web Services has been at the forefront of some notable optimization efforts. A team led by Sebastian Pop at AWS has been tackling improvements for the PHP language. PHP is one of the first languages designed specifically for serving up web pages, and it powers popular web services such as WordPress. As an older web-native language it has some quirks from the original language design that are now being sorted out to keep ease of use but to also improve readability and performance.

In “Improving performance of PHP for Arm64 and impact on AWS Graviton2 based EC2 instances” Sebastian describes the methodical approach taken for his compiler optimization work. A whole series of low-level functions (addition and subtraction, computing hashes and polynomials, encoding and decoding) get modest incremental improvements by using built-in arm64 instructions to compute things directly using functions built in to the chip. 

The big win for PHP – resulting in a 20 percent speedup overall – ends up involving removing code rather than adding code. A feature from PHP 4 known as “constructors” turned out to be confusing to programmers, and was not widely used. The language designers decided to take it out of the language in PHP 7 and PHP 8, and it turns out that a performance rewrite of this part of the code path had a big impact on throughput. 

Turn out, if code is confusing for programmers, chances are it might also be slow for the computer to make sense of. 

Divide and Conquer: Fast Compression

Compression is another very important task that is used all over the internet. We’ll look specifically at the compression of files to disk as one of those tasks where optimization leads to outsized improvements on some of the newest Arm systems, especially those with lots of processing cores.

Ordinary compression tools often start at the beginning of a file and read sequentially to the end. This is great if your file is small and you have a single CPU, but woefully inefficient if you have many CPU cores that are otherwise idle.

Pigz (pronounced “pig-zee”) is a modern multi-core implementation of the gzip library. By breaking a big file into smaller parts and then compressing each part at the same time using separate cores on a multicore computer, it can achieve remarkable parallelism. Chunks of a file are compressed in separate threads and then reassembled into a single gzip-compatible file. It is best for compression, since decompression of this file format relies on sequential operations.

Parallel Computing and Amdahl’s Law

Amdahl’s Law (from Gene Amdahl, a computer scientist and mainframe designer in the 1960’s and 1970’s) states that the speedups due to parallelism in a multiprocessor system are limited by the time it takes for the slowest serial process. In other words, it doesn’t matter how many cores you have if the slowest process takes too long on that one step. 

Optimization of big computing problems onto multiple cores on modern processors involves both breaking down problems so that they can be effectively run across the whole processor as well as making sure that algorithms are tuned to crunch tasks as fast as they can. 

When I first started working with datacenter-grade Arm processors at Packet in 2015 and 2016, Amdahl’s Law was a common Achilles heel. Systems like the ThunderX had an amazing number of cores, but software was not able to take advantage of those cores, and often had slow points that gummed up the works. Early efforts, including those by Cloudflare on Golang, led to amazing improvements in performance and efficiency.

As more developers rely upon and have access to powerful Arm-based servers, and those servers utilize Arm processors with even more (and more powerful!) cores, the flywheel effect is starting to deliver real results. Innovators like Marvell, Ampere, AWS, and Fujitsu are able to deliver more cores to a single System-on-Chip (SoC) but also to have the individual cores run efficiently as they collectively work together.

Anonymous
Tools, Software and IDEs blog
  • What is new in LLVM 16?

    Pablo Barrio
    Pablo Barrio
    Arm contributions from Arm to the new release include the usual architecture and CPU additions and new features such as, function multi-versioning and strict floating point support.
    • May 1, 2023
  • Product update: Arm Development Studio 2023.0 now available

    Ronan Synnott
    Ronan Synnott
    Arm Development Studio 2023.0 now available with support for Arm Neoverse V2 processor.
    • April 27, 2023
  • What is new in LLVM 15?

    Pablo Barrio
    Pablo Barrio
    LLVM 15.0.0 was released on September 6, followed by a series of minor bug-fixing releases. Arm contributed support for new Arm extensions and CPUs.
    • February 27, 2023