The capabilities of Arm Neoverse platforms and Arm Architectures are continually evolving. This creates opportunities to improve the system efficiency of fundamental algorithms through close collaboration between hardware and software architects. Arm Neoverse platforms are now able to take advantage of the Cryptographic Extensions in Armv8-A to accelerate important cryptographic operations. This blog introduces the Galois/Counter Mode (GCM) of the popular Advanced Encryption Standard (AES) algorithm. And shows how to make the best use of these instructions and expose the full potential of Arm Neoverse platforms.
Galois/Counter mode is a mode of operation of the Advanced Encryption Standard (AES-GCM) [1] block cipher algorithm. It enables high levels of parallelism when compared with the older Cipher Block Chaining mode (AES-CBC). This blog compares the performance of AES-GCM between c6g, c7g (Arm instances), and c6i (3rd Gen Intel Xeon Scalable) provided by Amazon EC2 instances. The blog also introduces two optimizations, loop unrolling and EOR3 instruction to further improve the performance of AES-GCM on Arm cores. The optimizations mentioned in this blog are expected to be effective for other modes in the AES family.
AES-GCM is an authenticated encryption algorithm that turns an unencrypted plaintext into an encrypted and authenticated ciphertext and can be thought of in two distinct phases. First, blocks in the plaintext are encrypted using the AES-CTR algorithm to create ciphertext. Second, the stream of encrypted blocks is authenticated by building a Galois Hash (GHASH) across the ciphertext.
AES-CTR encrypts the plaintext into the ciphertext. It is computed across each 16-byte block in the plaintext according to the following algorithm:
The hash used for integrity in AES-GCM is closely related to a cyclic redundancy check (CRC). Like for CRCs, each block to be authenticated contributes a linearly independent part of the resultant hash. Unlike CRCs the method of generating the contribution requires knowledge of the secret hash key. In GHASH the block size is 16 bytes, and the contribution of each block to the resultant hash is calculated by the following two steps:
This block hash is then added to the hash using a 16-byte exclusive or operation.
Arm recommends that the fused loop is unrolled to expose more opportunities for parallel execution to the microarchitecture. The degree to which unrolling is beneficial, known as the unroll factor, depends on the available execution resources of the microarchitecture and the execution latency of paired AESE/AESMC operations. A higher unroll factor increases the complexity of managing architectural resources. And it requires extra code size to handle partial executions of the unrolled loop where the required number of iterations is not a multiple of the unroll factor.
The recommendations for out-of-order microarchitectures consider the much greater capabilities of the processor to find hidden instruction level parallelism through register renaming and dynamic execution scheduling. For the c7g platform, Arm recommends an unroll factor of eight to make the best utilization of the considerable execution resources of the microarchitecture.
Arm recommends that implementations of AES-GCM targeting microarchitectures which include the Cryptographic Hashing extensions introduced in Armv8.2-A use the EOR3 instruction.
The EOR3 instruction implements a three way exclusive-or operation. This allows a programmer to replace this chain of instructions:
EOR v0.16b, v1.16b, v2.16b
EOR v0.16b, v0.16b, v3.16b
With the single instruction sequence:
EOR3 v0.16b, v1.16b, v2.16b, v3.16b
This is beneficial both in reducing the code size of the library routine and in reducing the number of operations that must be completed by the processor.
The optimizations mentioned above have been merged to the mainstream of OpenSSL [2].
We tested AES-GCM included in OpenSSL on AWS instances. The basic information about the instances and OS kernel version is as follows:
gcc-10
-mtune=native
To run the tests: openssl speed -multi num_of_threads -evp aes-bits-gcm. We tested AES-GCM with different number of threads: [1,2,4,8,16,32,64], encryption bits: [128,192] and buffer size (bytes): [1024, 8192]. Each test is named aesXXgcm-YY_evp, meaning it adopts XX encryption bits and YY bytes of buffer.
openssl speed -multi num_of_threads -evp aes-bits-gcm
The comparison results of AES-GCM between c6i, c6g, and c7g are shown in the figure below. Note that the results on Arm instances have not included any optimizations. The performance result shows the throughput: MB/s.
Overall, the performance of AES-GCM on c7g is around 100% higher than that on c6i. c6i achieves a higher performance when testing with one thread. However, it does not scale well with multiple threads. With hyperthreading on c6i, if one hyperthread executes AVX512 code, the core has to reduce its frequency, which slows down the execution on the sibling hyperthread of the same core. The performance on c6g is slightly lower than c6i, but it catches up with c6i on multi threads.
We evaluated the performance gains on c7g brought by the two optimizations mentioned above: loop unrolling and eor3 plus loop unrolling. Results are presented below. With the size of the buffer increasing from 1024 to 8192, the performance improvement by the optimization also grows. For buffer_1024, loop unrolling improves the performance by ~22% and eor3 achieves another 8% improvement. For buffer_8192, loop unrolling improves the performance by ~34% and eor3 offers another 11% improvement. It must pay attention that the performance gains drop starting from thread_number=16. This is because the performance is limited by the memory bandwidth above 16 cores.
AWS Graviton3 (c7g) shows strong performance uplift on AES-GCM encryption workload, compared to AWS Graviton2 (c6g) and 3rd Gen Intel Xeon Scalable (c6i). With multi threads, the performance of AES-GCM on c7g almost doubles that on c6i. The performance on c6g catches up with c6i on multi threads. Moreover, the new instruction extension (EOR3) and micro-architecture dependent loop unrolling enable further optimizations to AES-GCM on c7g. Based on the evaluation results, an unroll factor of eight make the best utilization of the micro-architecture, which improves the performance up to 33% on c7g. Applying the new instruction EOR3 brings another 10% improvement to the performance. Though not tested, the optimizations mentioned in this blog are expected to be effective for other modes in the AES family.
[1] https://en.wikipedia.org/wiki/Galois/Counter_Mode
[2] https://github.com/openssl/openssl/commit/954f45ba4c504570206ff5bed811e512cf92dc8e
[3] https://aws.amazon.com/ec2/spot/pricing/