Optimizing HPCG for Arm SVE

June 19, 2019

12 minute read time.

The methodology was the result of a team collaboration between myself, Miguel Tairum-Cruz and Roxana Rusitoru.

The HPCG benchmark

The computational capability of High Performance Computing (HPC) systems is measured by running a set of well-defined benchmarks that are widely accepted by the scientific community. Traditionally, High-Performance Linpack (HPL), a compute-bound, floating-point-heavy workload, was used to rank systems. The result of this was that many Top500 machines targeted high HPL performance, rather than more realistic workloads, especially by today’s standards. To meet the challenge of creating a more representative micro-benchmark, Mike Heroux (along with Jack Dongarra and Piotr Luszczek) created the High Performance Conjugate Gradient (HPCG) benchmark, which as of ISC-2017 is officially ranked as a parallel list to the standard Top500, known officially as the HPCG Performance List.

The HPCG benchmark solves a linear system of equations by using a preconditioned conjugate gradient method. The most interesting bits of this benchmark are the characteristics of the computations performed within its kernels, which are representative of real-world scientific applications run on HPC systems, such as computational fluid dynamics and computational photography. The benchmark exercises all aspects of the compute system and emphasizes the significance of both compute and data-delivery subsystem (memory, storage, interconnect) to the overall performance.

Due to the importance of this benchmark in the HPC community, Arm has been working on optimizing HPCG. These optimizations targeted the lack of parallelism present in the Gauss-Seidel kernel, at the cost of losing single-core performance. Detailed information about the parallelization techniques applied can be found in a this HPCG blog post. The next step was to recover the single-core performance lost along the way. For this, we decided to explore vectorization of the main HPCG kernels whilst porting them to the Arm Scalable Vector Extension (SVE). But, how to optimize a code using a vector extension for which no hardware has been publicly released yet? Through emulation or simulation.

We chose the former, and picked the Arm Instruction Emulator (ArmIE) for the following reasons:

ArmIE is publicly available, making it possible for virtually anybody to get started with this guide today;
The possibility to use larger workload input sets due to the faster emulation compared to simulators;
The lack of tuned SVE performance models in simulators;
Extensible application instrumentation through custom ArmIE plugins.

Arm Instruction Emulator

The Arm Instruction Emulator (ArmIE) enables users to execute unsupported instructions on Armv8-A platforms, such as those from the SVE instruction set, by dynamically converting those instructions into native ones. However, due to this conversion, any kind of timing information is lost.

In addition to emulation, ArmIE can be expanded via dynamic binary instrumentation clients. These clients can be used to extract different metrics such as dynamic instruction counts or memory and instructions traces. ArmIE supports an emulation API that enables users to write their own clients, thus expanding even more the ArmIE instrumentation capabilities.

You can find further information on how ArmIE works and how to use it here.

The Arm SVE methodology

SVE optimization methodology

SVE optimization methodology steps

When optimizing HPCG, we wanted to infer the potential relative performance benefits from the metrics offered by ArmIE via its clients. To achieve this, we created a flexible methodology whose steps can be applied in any order. These are pictured above. The ratio of SVE instructions in your code directly tells if your vector units will be used at all. Unless you use SVE intrinsics, vectorization mainly relies on the compiler. Therefore, checking what the compiler was able to vectorize is really important, since it can point the user to problematic areas of the code (e.g., loop is not vectorizing because it is reversed).

To aid the optimization process, metrics should be obtained not only for the whole application but also for specific parts of the code (i.e., Regions of Interest or RoI). The ArmIE memory trace client already supports RoI instrumentation, and more clients will too in future versions. For this work, we added the RoI functionality to all the clients.

Another metric that could affect performance is the average lane utilization of the vectors. SVE uses predicate registers to specify which lanes in the vector are enabled or not. Disabled lanes will not update their destination register values. Therefore, even if SVE instructions are issued, if the average number of enabled lanes per vector instruction is low, vectors will not be fully utilized, thus potentially missing out on higher performance. This metric can be already derived from the memory traces generated by the ArmIE memory trace client.

Memory instruction mix (i.e., how many times each kind of memory access has occurred) can also be derived from the memory traces generated with ArmIE. This metric provides the ratio of SVE memory instructions against non-SVE ones. On top of that, for each of these SVE memory accesses, it reports the kind of memory access it was (i.e., contiguous or gather/scatter access). This information can be obtained by post-processing the memory traces generated by ArmIE. For all the memory accesses, the number of bytes loaded or stored is also available.

Lastly, cache statistics can tell the user if the code could potentially perform better or not. Since cache statistics require a cache model which ArmIE, as an emulator, does not have, we wrote a cache simulator which supports prefetchers implemented as plugins. We used a stride prefetcher in our experiments.

We combine all this data analysis in our methodology. To reiterate, we obtain all the metrics after running the applications on ArmIE, analyse the results, and then infer relative performance variations.

HPCG versions

To improve the single-core performance of our optimized HPCG code, we developed a version with SVE intrinsics on top of optimized code. Hence, we will be focusing on these three versions of HPCG, all compiled with an SVE-capable compiler.

For the purpose of readability, we use the following naming scheme for our HPCG versions:

Baseline (reference HPCG code)
No-intrinsics (optimized HPCG code without SVE intrinsics)
Intrinsics (optimized HPCG code with SVE intrinsics)

Compiler auto-vectorization analysis

We started our HPCG optimization journey by checking how different compilers would vectorize. We compiled HPCG with different compilers and checked out how many loops were automatically vectorized. In order to understand differences with other SIMD technologies, we added an AVX2-enabled compiler for comparison.

Compiler		# Vectorized loops
Compiler		baseline	no-intrinsics
SVE	GCC 8.2.0	8/8	12/17
SVE	Arm HPC Compiler 19.0	8/8	12/17
AVX2	Intel Compiler 19.0.3	8/8	17/17

We observed that the compiler left some loops unvectorized in the no-intrinsics code. After further inspection, we discovered those loops were contained in the most executed kernel, the symmetric Gauss-Seidel (i.e., SymGS). This helped prioritize which loops we hand-optimized first.

Instruction counts

We know the compiler is able to vectorize most of the loops found in the main computational kernels. But how many instructions those represent in comparison to the total number of instructions executed? To gather this information, we used the instruction count client shipped with ArmIE. We delimited the RoI to one conjugate gradient iteration.

The following chart shows the breakdown of the dynamically executed instructions for each version, differentiating between SVE and non-SVE instructions.

Instruction count reduction when increasing vector length

Instruction count reduction when increasing vector length

We noticed that the optimized versions of HPCG executed more instructions than the baseline code. This was expected because of the overhead caused by the parallelization techniques applied. With hand-crafted SVE intrinsics, we were able to reduce the total number of dynamically executed instructions when compared to the HPCG version without intrinsics, thus reducing the gap against the reference code.

Looking at the percentage of SVE instructions versus non-SVE ones, the intrinsics code presents a lower ratio than the other two versions. This was not expected, so we went a bit deeper and gathered instruction counts with ArmIE at kernel level. The chart below shows the data obtained.

Percentage of SVE instructions per computational kernel

We noticed that vectorization was more evenly present across the kernels in the intrinsics version. Also, the multi-grid kernel presented a lower percentage of SVE instructions executed. Dot-product vector instruction ratio is also lower, while SPMV and WAXPBY presented a similar ratio compared to the other two versions of HPCG.

Looking at the multi-grid kernel case, the reason for the lower ratio can be explained by a more efficient use of SVE instructions. In fact, the total number of instructions is actually reduced compared to the no-intrinsics code. The DotProduct kernel presents a similar behavior as the multi-grid, the intrinsics code features a lower percentage of SVE instructions, but at the same time, the number of total instructions dynamically executed is also lower.

As for the WAXPBY, we realized that the compiler was generating both SVE and non-SVE versions of the code. Which version of the kernel is executed is decided at runtime, and in all the executions we performed, the non-vectorized version of the kernel was always chosen.

Vector lane utilization

After understanding how much vectorization was present in the code, we focused on finding the vector utilization. To get this information, we ran the three versions of HPCG through the memory trace client in ArmIE. The generated memory traces can then be post-processed to obtain the number of lanes enabled for each SVE memory access. The chart below shows this information.

Average lane utilization

Average lane utilization

All three versions of the benchmark presented the same characteristics. Around 10% of the SVE memory accesses have 0 to 33% of their lanes enabled, while around 15% of the SVE memory accesses have 34 to 99% active lanes. Around 75% of the SVE memory accesses were instructions where all the lanes were enabled.

Assuming a similar vector utilization for non-memory SVE operations, we can infer that the vectors were fully utilized most of the time, averaging a vector lane utilization of ~82% for all HPCG versions.

Memory accesses breakdown

Although SVE memory accesses presented a good average vector lane utilization, we cannot expect the same latency for all kinds of SVE memory accesses, i.e. contiguous and gather/scatter accesses. In general, a good approach to increase performance is trying to avoid the use of gather loads or scatter stores since they will be potentially accessing a higher number of different cache lines, thus being more resource demanding.
To gather this information, we performed further post-processing analysis on the memory traces, this time counting the number of different memory accesses. The chart below shows the information obtained.

Memory accesses breakdown

Memory accesses breakdown

The memory instruction breakdown was similar for all three versions of the code, with the intrinsics code presenting a higher ratio of SVE memory accesses and a lower percentage of non-SVE memory accesses when compared to the other two versions of the benchmark.

As for the different memory accesses present in the code, we split them between:

SVE contiguous memory accesses
SVE gather loads and SVE scatter stores memory accesses
non-SVE memory accesses

The chart also distinguishes between SVE memory accesses with all lanes active, or some of the lanes disabled. We observed that around 60% of the memory accesses were generated by SVE memory instructions, with half of those being contiguous accesses with all lanes enabled. SVE gather/scatter accesses represented around 20% of the all memory accesses.

Cache simulations

To complement the memory tracing analysis, we run the traces on the cache simulator.
In our experiments, we configured the simulator with the parameters present in the table below.

	L1	L2	L3
CACHE SIZE (KB)	32	256	1024
LINE SIZE (#WORDS)	8	16	16
WORD SIZE (BYTES)	4	4	4
SET SIZE (N-WAYS)	8	8	32
LATENCY (CYCLES)	4	8	27
MEMORY LATENCY (CYCLES)	NA	NA	156
STRIDE PREFETCHER TO L1

When implementing the parallelization techniques to optimize the reference HPCG code, we were aware of the potential cache hit ratio degradation. Indeed, we observed this behavior with the cache simulator, as reflected in the chart below.

Interestingly, the hit ratio was improved in the intrinsics version, when compared to the optimized HPCG code without intrinsics.

Simulated cache hit ratio

Simulated cache hit ratio

This increase in L1 hit ratio also translated into a lower average number of cycles per memory access when comparing against the optimized HPCG code without SVE intrinsics, as can be seen in the chart below.

Average number of cycles per memory access

Average number of cycles per memory access

Putting everything together

Up to this moment, we have presented different metrics, all of them obtained with different ArmIE clients and different post-processing procedures. Although all these metrics present value on their own, they should not be used independently to infer potential performance variations. Instead, one should look at all the metrics combined.

In our specific scenario, three different versions of HPCG were compared. Looking back at the metrics obtained, we summarize the observed changes in both optimized codes (with and without intrinsics) compared to the reference HPCG implementation.

	Optimized HPCG
	Without intrinsics	With intrinsics
SVE instruction ratio	Similar to reference code. Some kernels are not vectorized.	SVE instructions are present in all computational kernels. Total instructions decrease against no-intrinsics code.
Vector lane utilization	Same as in the reference code.	Same as in the reference code.
Memory instructions	Same as in the reference code.	More SVE memory accesses and less non-SVE ones are executed.
Cache hit-ratio	Lower than in the reference code.	Improved hit-ratio compared to no-intrinsics code.

From the results obtained, we expect a loss of single-core performance in the optimized HPCG code without intrinsics, versus the reference code. This is corroborated by the vector lane utilization and the types of memory instructions, with their numbers being very similar to the reference code, as well as the higher number of instructions executed and higher cache miss ratio.

Looking at the intrinsics code, we can infer a potential performance gain compared to the optimized HPCG without SVE intrinsics. With a similar average vector lane utilization, lower number of total instructions, higher number of SVE memory accesses, lower number of non-SVE memory accesses and an improved cache hit ratio, we expect the performance to be higher than the no-intrinsics version.
When comparing against the reference code, it is hard to say which will present a higher performance. The reference code executes less instructions and the cache hit ratio is higher, while the intrinsics code presents a better memory instruction mix and higher ratio of SVE instructions per computational kernel. As HPCG is known to be heavily memory bound, we would expect a better performance with the intrinsics version since the memory instructions breakdown presented more favorable characteristics.

Conclusions

In an effort to further optimize HPCG, we needed to create a way of optimizing applications for SVE in the absence of tuned performance models or real hardware. The main output of this work comprises the methodology to optimize applications for SVE.

The methodology presented here can thus help developers optimize for SVE. This methodology heavily relies in ArmIE and its clients, which were extended to provide us with metrics necessary for our evaluation. ArmIE development is still ongoing, and users can expect new and more refined clients, as well as more features and stability in future versions.

Optimize your apps for Arm SVE

Parents

Jerry Wang over 4 years ago

Good. But anyone study what's the relation between SVE and MemoryBW. how to balance it as using memory BW limited.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Jerry Wang over 4 years ago

Good. But anyone study what's the relation between SVE and MemoryBW. how to balance it as using memory BW limited.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

High Performance Computing (HPC) blog

Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog post, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024
Arm Performance Libraries 24.10

Chris Goodyer

In this blog post, we review the improvements made to Arm Performance Libraries 24.10.
- November 11, 2024
Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL

David Lecomber

This blog outlines the strategies utilized to enhance Pardiso's performance by leveraging the Arm architecture and presents a comparative study with Intel MKL Pardiso.
- October 22, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Optimizing HPCG for Arm SVE

The HPCG benchmark

Arm Instruction Emulator

The Arm SVE methodology

HPCG versions

Compiler auto-vectorization analysis

Instruction counts

Vector lane utilization

Memory accesses breakdown

Cache simulations

Putting everything together

Conclusions

Expanding Arm on Arm with the NVIDIA Grace CPU

Arm Performance Libraries 24.10

Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL