# Parallelizing HPCG's main kernels

In Novemeber 2017, a new Top500 ranking was added to highlight the diversity of HPC workloads. The newly added benchmark is called High Performance Conjugate Gradient (HPCG), and results are updated on the same 6-month cadence as the other Top500 rankings.

The main purpose of HPCG is to stress different system capabilities than in HPL. As a general example, whereas HPL works on dense matrices, HPCG performs almost the entirety of its operations on sparse ones, making it representative of a different set of applications.

The HPCG code aims to solve the following linear system

where `A` is a matrix of size `NxM` and `x` and `r` are arrays of size `M`. `A` and `r` are known, while `x` is unknown and its values are meant to be approximated during the execution of the benchmark. A conjugate gradient method is used for solving this linear system.

In the specific case of HPCG, the matrix `A` represents the connectivity graph of a 3-dimension grid of size `NX`, `NY` and `NZ`, where each element is connected to 26 other elements. Therefore, the resultant sparse matrix representing this grid has a total of `NX x NY x NZ` rows, each of these with a maximum of 27 non-zero column values (i.e., 27-point stencil).

To solve the sparse linear system, the benchmark implements preconditioned conjugate gradient iterations, and a multigrid method is used as a preconditioner. Specifically, the multigrid method consists of applying a symmetric Gauss-Seidel as smoother and then coarsening the grid by using a restriction operation. This is done 3 times, therefore, applying the smoother to a total of 4 grids. Once the coarsest level is reached and the smoother is applied, the grid is refined by using the prolongation operation and smoothing again as many times as the restriction operation is performed. The restriction operation in HPCG consists on reducing each dimension of the grid by 2 (i.e., reducing 8 times the number of elements in the grid) while the prolongation operation consists of the symmetric computation. The function name where the smoother is implemented is called `ComputeSYMGS`, while the ones for restriction and prolongation are called `ComputeRestriction` and `ComputeProlongation`, respectively.

### HPCG features

The current HPCG implementation is written in C++ and uses MPI and/or OpenMP for parallelization. The benchmark is split in 5 main phases:

1. Problem setup phase and validation.
• Data structures are generated on this phase and are checked to ensure benchmark rules are fulfilled. Setup time is recorded and used afterwards to compute the final performance.
2. Compute reference Sparse Matrix-Vector multiplication (SpMV) and Symmetric Gauss-Seidel (SymGS).
• SpMV and SymGS are timed. This information will be used afterwards to check the possible improvements obtained by user optimizations.
3. Compute reference Conjugate Gradient (CG).
• The reference CG is executed for 50 iterations and the reduction residual is stored. This residual needs to be reached by the optimized version.
4. Setup optimized CG run and validation.
• User's optimization code is executed and a run of the optimized CG is timed. This time is used for deciding how many CG sets are run during the next phase. Setup time is recorded and used afterwards to compute the final performance.
5. Optimized CG run.
• The actual benchmarking phase. Performance is recorded and reported.

As one may notice, when generating the problem for the first time and when applying different techniques to optimize it, the time spent on these operations is recorded and used to compute the final performance. The point is that you can change how the problem is generated and how it is optimized while adhering to certain rules, but the less time you spent on this the better in order to not negatively affect performance.

HPCG features a few main kernels, each one stressing different system characteristics. Each of these kernels is properly isolated from each other in terms of source code and procedures, thus enabling easier handling of the code for potential source modifications. The benchmark phase (i.e., the optimized CG run) is represented with the following call structure:

Each function performs the following operation:

• `ComputeDotProduct`
• Given two vectors, it performs a dot product.
• `ComputeWAXPBY`
• Computes the the scaled vector  `W ← αX + βY` where `W`, `X` and `Y` are vectors and `α` and `β` are scalars.
• `ComputeSYMGS`
• Compute a single symmetric Gauss-Seidel iteration. This iteration includes a forward and a backward sweep.
• `ComputeSPMV`
• Given a sparse matrix and a vector, computes a sparse matrix-vector multiplication.
• `ComputeRestriction`
• Coarses the current grid by applying a restriction operation.
• `ComputeProlongation`
• Refines the current grid by applying a prolongation operation.

### Kernels of interest

`ComputeSYMGS` is the kernel where most of the time is spent. The Gauss-Seidel pseudo-code can be found below. Notice that the snippet only shows the forward sweep. The actual implementation is composed of two sweeps, one forward and one backwards. The only difference between them is the way the sparse matrix is traversed; from row `0..N` in the forward sweep and from `N..0` in the backwards.

```double *x;
double *rhs;
double *diagonal_values;
int **A_indices;
double **A_values;

for ( int i = 0; i < numberOfRows; i++ ) {
double s = rhs[i];
// Accumulate contributions of neighbours, including yourself
for ( int j = 0; j < numberOfNeighbours[i]; j++ ) {
s = s - A_values[i][j] * x[A_indices[i][j]];
}
s = s + A_diagonal_values[i] * x[i]; // Remove own contribution
x[i] = s / A_diagonal_values[i]; // Update X vector
}```

The main issue with the symmetric Gauss-Seidel algorithm is that it is sequential by definition. To update the `i-th` position of the array `X` we need the values on the `X` array of the neighbors of the `i-th` row. On top of that, previous updates on the vector `X` are used within the same loop, thus carrying dependencies. HPCG's reference implementation of the Gauss-Seidel algorithm is, in fact, not parallelized.

There are several techniques to parallelize the symmetric Gauss-Seidel algorithm though. Each of them has its pros and cons. We will cover some of these next.

## Multi-coloring

The multi-coloring technique is, probably, the easiest to implement. It consists of assigning colors to every node in a way that any two given elements that have a direct dependency do not share the same color. Then, rows with the same color can be processed in parallel during the Gauss-Seidel algorithm since they will not depend on each other. Of course, this technique breaks carried dependencies, thus relaxing the Gauss-Seidel algorithm. This leads to an increase in the amount of times we need to apply the multigrid method in a given CG iteration to reach the same residual as with HPCG's reference Gauss-Seidel implementation.

A negative effect of this technique, apart from the relaxation, is how the sparse matrix rows are accessed. In HPCG's reference implementation, consecutive rows were accessed one after the other (i.e., `0, 1, 2, ..., N`). When using the multi-coloring technique, rows are accessed by color. This means that two consecutive accesses will not access two consecutive rows anymore, thus reducing the spatial locality present in the reference code. On top of that, since consecutive rows also share neighbors, temporal locality is negatively affected when not accessing consecutive rows on consecutive accesses.

These effects can be minimized by reordering HPCG's data structures so consecutive accesses end up accessing two consecutive memory locations. This improves spatial locality, but it does not help temporal locality unless the reordering considers both rows and neighboucrs' new access order, which can result in a huge impact on the optimization phase and therefore on the final performance.

## Block multi-coloring

This technique follows the same idea as in the multi-coloring one, but now grouping elements (usually consecutive) into blocks and applying colors by block instead of by element. Colors are assigned in a way that blocks sharing the same color do not have any direct dependency between elements within each block. Therefore, blocks with the same color can be processed in parallel, while elements within a block must be processed sequentially.

Compared to the mulit-coloring technique, both spacial and temporal locality are improved. The bigger the blocks, the more similar to a sequential implementation the resulting code will be, thus increasing spacial and temporal locality. The trade-off is that the bigger the blocks, the less parallelism (i.e., fewer number of blocks, which means fewer number of blocks per each color). In a similar way, changing the amount of colors also impacts parallelism and the number of iterations needed to reach the same residual (e.g., increasing the amount of colors decreases the amount of blocks per color, thus decreasing parallelism). The general rule is, the higher the distance between the closest rows of two given blocks with the same color, the less parallelism and the less relaxation, therefore fewer iterations to converge. When decreasing this distance by using smaller blocks or by increasing the number of colors, the parallelism increases but at the cost of an increase in the Gauss-Seidel relaxation, therefore needing more iterations to converge. Data structures can again be reordered to further improve data locality.

## Multi-level task dependency graph

Until now, all described techniques use coloring. This changes with the multi-level task dependency graph, which consists of building a dependency graph where nodes in the graph represent the different elements of the grid.  Dependencies between elements on the grid are translated into graph edges. The graph traversal order matches the order of computation.

In addition, the graph is split into a set of levels which are processed sequentially from `0..N`. Elements of a given level have their dependencies fulfilled when the level is processed. All elements on the same level do not depend on each other (i.e., they can be processed in parallel).

The main difference compared to other coloring techniques is that the multi-level task dependency graph method respects the data dependency order of computation when compared to the reference implementation. This translates into no penalty on the number of iterations needed to reach the same residual. However, not everything is advantageous with this technique. The order of computation, even though respected in terms of data dependencies, is, in fact, modified, as nodes whose dependencies have been resolved are now computed in parallel. On the sequential implementation of the Gauss-Seidel method, the `i-th` row is processed on the `i-th` iteration. This order ensures that dependencies are fulfilled when the `i-th` row is processed. When using the multi-level task dependency graph technique, the `i-th` row is processed as soon as its dependencies are fulfilled without considering the initial iteration space, thus being able to process more than one row at a time. As we saw previously, when changing the order of computation the spatial and temporal localities are negatively affected. Another drawback is that different graph levels have varying amounts of parallelism. In fact, there is no parallelism at all in the first two levels, and from there parallelism steadily increases until reaching the maximum point, and then decreases again, presenting no parallelism at the last two levels.

### Merging all together

As already explained, during the multigrid method a restriction of the sparse matrix is performed 3 times. Each time, the grid size is reduced by 8, therefore, the first coarse level of the grid will be 8 times smaller than the finest grid, while the next coarse level will have 64 times fewer elements and the coarsest level will feature 512 times fewer elements than the finest level.

 Technique Pros Cons Multi-coloring Color assignment is easy to implement. Its impact is also considerably low to the final performance. Enables parallelism of SymGS. Temporal locality is lost. Spacial locality is lost unless reordering is applied during the optimization phase, thus negatively impacting the final performance due the increase in the execution time of this phase. Hugely increases the number of iterations to reach the same residual as the reference version. Block multi-coloring Enables parallelism of SymGS. Keeps some degree of the original spacial and temporal locality. The bigger the block size, the less impact on spacial and temporal locality. Increases the number of iterations in order to reach the same residual as the reference version. Lack of parallelism when using bigger block sizes. Multi-level task dependecy graph Same number of iterations to reach the same residual as in the reference version. Parallelism is not consistent across different levels of the dependency graph. Lack of parallelism when used on grids with fewer elements. Not recommended to use for the coarser levels of the grid.

Our analysis shows that choosing one technique to parallelize the Gauss-Seidel kernel is not always the best way to go. For example, if choosing the multi-level task dependency graph, sufficient parallelism will be exposed while processing the finest level of the grid, but it will not be enough when processing the coarsest level, therefore leading to idle resources. Similarly, when using the block multi-coloring technique, keeping the same block size and the same number of colors across the different levels of the multigrid will harm parallelism.

Our analysis also shows a poor automatic compiler vectorization of the Gauss-Seidel kernel. Due to the memory access patterns and the amount of inter-element dependencies, this is expected. In fact, the Gauss-Seidel algorithm is proven to be difficult to vectorize due to its explicit serialization. In the case of block multi-coloring, elements within a block are processed sequentially, therefore, dependencies between consecutive nodes are still present. In order to break these dependencies, one can merge different blocks with the same color by interleaving their elements. This way, consecutive rows within a block no longer depend on each other since two given elements of different blocks with the same color cannot depend on each other. As for the multi-level task dependency graph, rows within the same level do not depend on each other already.

After several experiments, we found that the best solution was to mix different techniques depending on the level of the multigrid being processed. In our current HPCG implementation, we use the multi-level task dependency graph technique for the finest level, since the number of elements available provides enough parallelism. For the coarser levels, we use a modified block multi-coloring, where the block size and number of colors depend on the actual grid size and the number of threads available. Vectorization, even though manual vectorization via intrinsics or hand-made assembly code is not provided, should be easier to implement with the techniques we applied. In fact, the code changes we performed enabled manual unrolling, closing the gap to a potential manually-vectorized version.

### Summary

In this article we introduce the HPCG benchmark and the different techniques that we used to enable parallelization of its main kernel. Different techniques leverage different metrics, while harming others, therefore, we use a different approach depending on the level of the multigrid being processed. In this sense, our current HPCG implementation uses:

• Multi-level task dependengy graph technique for the finest level of the multigrid.
• The number of elements on this level provides enough parallelism across the Gauss-Seidel.
• A modified block multi-coloring for the the rest of the levels of the multigrid.
• Since the number of elements on each level differs, block size and number of colors are decided at execution time, maximizing the use of the available resources.
• Blocks are actually created by applying a restriction operation. Colors are then assigned as in the multi-coloring technique.
• Blocks with the same colors are interleaved to break dependencies between consecutive elements, enabling vectorization.

Our target is to offer an open-source HPCG code which makes a solid starting point for further community-driven optimizations, and in this sense, we believe the code showcased in this article fulfills this objective.

• ### Profiling Python and compiled code with Arm Forge – and a performance surprise

If you are developing HPC applications, there is a good chance that you have been in contact with Python these days.
• ### HPC milestones: Arm at SC18

Arm’s journey into the HPC market reached several significant milestones this week at SC18. This blog highlights some of the key news and sightings from SC18 in Dallas, TX.
• ### Optimizing the performance of HPC codes on Arm systems

My last blog on Running HPC Applications on Arm talked about the work that has been going on about getting a range of HPC applications running on Arm systems. This blog looks at optimizations you can make…