Accelerating HPC with Advanced Programming Techniques (2/2)

October 12, 2020

9 minute read time.

In part one of our blog series, we looked at the challenges raised by tChe lack of interoperability between Message Passing Interface (MPI) and OpenMP, and how it hinders MPI-level tasking. Bridging the gap between the two worlds poses opportunities, and Arm Research’s MPI+OpenMP approachoffers a solution. In part two, we delve into improving HPC performance, and how we applied tasking to the miniAMR code across MPI and OpenMP, leveraging TAMPI.

MPI-level taskification approach

The Exascale Computing Project develops a catalog of proxy applications. These are simplified codes that represent the features of large complex applications without having to incorporate the large and complex code bases. MiniAMR is the selected application for this work given its complex parallelization, communication and synchronization properties, due to the varying levels of precision of the physics model at a given point in time. Figure 1 shows a miniAMR visualization of a 3-dimensional space, represented at different refinement levels. See our previous post for more details on miniAMR.

A representation of a physical 3D space with different levels of mesh refinement.
Figure 1: Representation of a physical 3D space with different levels of mesh refinement.

The OpenMP-level taskification in our previous work focused on load balancing and overlap of the computation, communication (at the shared-memory level) and checksum phases, allowing parallelism exploitation across stages in a timestep. However, communication at the MPI-level was disabled as the experiments focused on single MPI rank executions. The refinement phase was not taskified because it was not parallelized with OpenMP in the original fork-join MPI+OpenMP version. In this work, the taskification covers all application phases including the MPI-level communication and refinement. It allows for overlap computation and communication at both the OpenMP and MPI levels and not only across stages but also across time steps.

MiniAMR works with 3D blocks that are distributed across MPI ranks. These blocks hold the values of variables involved in the simulation, such as speed and weight. Each block has neighboring blocks that are stored either in the same rank, or in a remote rank. The high-level miniAMR pseudo-code after applying our taskification strategy is shown below. The main loop is almost the same as the original parallelization that uses pure MPI [4].

The loop iterates through different time steps (line 1), with each time step divided into different stages (line 2). For simplicity, we can assume a single group of variables, so the inner loop (line 3) consists of a single iteration that processes all the variables the user has indicated. Each stage first communicates (line 4) the neighboring faces for each block in the rank. The exchange can be a local copy between blocks in the same rank (shared memory), or across different ranks (MPI). These exchange types are called the intra-rank and inter-rank communication, respectively. After communication, the code executes a stencil algorithm (line 5) to all blocks and variables, with those computations among blocks and variables being independent.

The high-level miniAMR pseudo-code after applying our taskification strategy.

Every few stages, there is a checksum phase that locally reduces the values of all the variables (line 6), which are then globally reduced with an MPI collective, and validated (line 10). The local reductions are parallelizable, but the global reduction and validation are serialized after closing the parallelism with a taskwait global synchronization (line 9). The communicate, stencil and local checksum functions internally create tasks to work on the blocks. We interconnect these tasks using data dependencies on the corresponding block, allowing the overlap of these three different phases (no global synchronization between them). Then, every few time steps, the loop closes parallelism (line 14) to perform a refinement and load balance operation (line 15). In that phase, the blocks that need more precision are split into smaller blocks, while the refined blocks that can be coarsened are consolidated into bigger ones. The load balancing consists of redistributing the blocks to maintain a well-balanced workload among MPI ranks.

Next, we explain the code modifications in the communication phase, which uses MPI, as an example of this taskification strategy.

Inter-rank communication

The communicate function performs both intra-rank and inter-rank communication. The exchanges of faces between neighboring blocks at different MPI ranks require the packing and unpacking of faces in communication buffers so that the face data can be sent or received through MPI as consecutive data. The original implementation, using pure MPI, leverages non-blocking operations (MPI_Isend and MPI_Irecv) to exchange the data, manually managing the resulting MPI requests and waiting for their completion using MPI_Wait and MPI_Waitany. This original implementation aims to overlap the packing and unpacking of faces, MPI communication and intra-rank copies.

In our approach, we remove the complexity of managing MPI requests and the manual overlap by taskifying all these subphases and letting TAMPI control all the communications issued by tasks. We show our taskified communicate function in the code below. The code iterates through all three directions (line 1), exchanging the faces from each direction. In the pure MPI version, the exchange of faces from different directions cannot overlap (that is, one after the other), but we allow them to naturally overlap using tasks in our hybrid approach. The first step is to start receiving the faces for each of the neighboring ranks using a task (line 4) for each face that calls the asynchronous TAMPI_Irecv function. The task declares an output dependency on the receiving buffer, since the remote data will be stored there. The task may finish its execution quickly but will only complete once the face data arrives.

Next, we iterate through the faces that have to be sent to remote ranks and pack them using a task. The packing task (line 10) copies the data from the block face (input) and writes it to a section of the sending buffer (output). Then, we instantiate a task (line 12) to send the face using TAMPI. The next step is the intra-rank communication (line 16), which also creates tasks to perform the local copies among faces (not shown in the following code). Lastly, we instantiate a task (line 19) for unpacking each received face. An unpacker task is the successor of the corresponding receiver task, such that when the face data arrives, TAMPI transparently satisfies the dependencies of the unpacker.

Our taskified communicate function.

Our taskification strategy focuses on maintaining the code structure of the original miniAMR as much as possible, but removes the complexity of manually managing MPI communications. We also allow the safe and efficient issuing of MPI communications in parallel from multiple tasks using TAMPI features. The previous code is a simple pseudocode, where each communication task sends or receives a single face. But we added several program options to control the parallelism exposed in this phase, such as the number of communication tasks per neighbor and the number of communications in each direction.

Performance evaluation

We analyzed the performance of our miniAMR taskification using TAMPI and OmpSs-2 and compared it against the execution of the pure MPI and the MPI+OpenMP hybrid fork-join baselines [4]. The testbed is the MareNostrum4 supercomputer, hosted at Barcelona Supercomputing Center, that features dual-socket nodes with 24 cores in each socket. The experiments are scaled from one node up to 256 nodes, totaling 12288 cores.

A graph showing the application figure of merit (top) and parallel efficiency (bottom).

Figure 2: Application figure of merit (top) and parallel efficiency (bottom).

Figure 2 shows the weak scaling analysis, where we double input problem size as we double the number of nodes. The original pure MPI approach is called MPI-only, the fork-join MPI+OpenMP is called MPI+OMP and our taskified version is called TAMPI+OSS. The MPI-only variant uses 48 ranks per node (filling up each node), whereas the hybrid approaches use 4 ranks per node and 12 cores per rank.

We show the throughput in GFLOPS from 4-256 nodes for all three variants in the upper figure. We compute the throughput by dividing the total number of floating-point operations in the stencil phases by the total execution time. Our TAMPI+OSS approach gets the best performance in all configurations. We obtain a 1.50x and 1.49x speedup with respect to the MPI-only version on 128 and 256 nodes, respectively. We can also see that the fork-join MPI+OMP is far behind our approach and it has almost the same performance as the MPI-only.

We show parallel efficiency in the bottom figure, which measures how efficiently a code leverages the available resources, that is, how well it scales. The range of values goes from 1 (perfect scaling) to 0 (lowest efficiency). We compute it with respect to each variant’s throughput in one node. The results are given for the total execution time, and the execution time minus the refinement phase (marked as NR). The NR efficiency is useful to see the impact of the refinement phase in each variant’s scaling. We can see that our TAMPI approach is the one that best scales across all configurations. Our overall efficiency reaches 0.86 on 256 nodes, while the MPI-only and fork-join MPI+OMP reach 0.72 and 0.75, respectively. The NR efficiency in all variants is significantly greater than the overall efficiency, which means that the refinement phase has an important impact on the overall execution. Our NR efficiency reaches the high value of 0.94 on 256 nodes.

We investigated the differences in performance by analyzing the graphical execution traces of all variants, and identified several reasons behind our improvements:

Computation and communication phases overlap in our approach
We reorder communication tasks
We are less sensitive to load imbalance in the communication phases
We significantly improve the number of instructions per cycle (IPC) due to a locality-aware scheduling policy in the OmpSs-2 runtime library.

This last improvement results from executing the immediate successor when a core finishes the execution of a task, therefore reusing its cached data and dramatically reducing cache misses.

Accelerating the Global Goals with advanced programming techniques

This collaboration with Barcelona Supercomputing Center revealed the benefits of adopting an asynchronous parallelism model across the full stack of programming models. This highlights the importance of incorporating interoperability capabilities between MPI and OpenMP. This is made evident in the example of miniAMR by the large performance gains that those asynchronous transfers and task-aware message passing capabilities in TAMPI and OmpSs-2 provide.

Software advances like this one, alongside hardware improvements, are paramount to improving the performance and efficiency of future supercomputers. They can be applied to a broad set of HPC applications to solve important societal problems, most of them aligned with the UN Global Goals for Sustainable Development. Supercomputing simulations are paramount to making progress towards those goals. Climate change modeling drives zero-carbon policies, while disease understanding and drug design simulations improve the good health and well-being of humanity. Great examples of this are being demonstrated in recent COVID-19 simulations in supercomputers around the world. The specific case of AMR codes is highly used in astrophysics and hydrodynamics problems, which are well aligned to provide improvements in industry, innovation and infrastructure.

This work is part of the program at the IEEE Cluster Conference 2020. More details on the development approach and experimental insights from our experience are available in our paper “Towards Data-Flow Parallelization for Adaptive Mesh Refinement Applications”. The online presentation is also available on YouTube to view.

Read the Full Paper

Contact Alex Rico

This blog is part of a two part series. Use the link below, and catch up on part one:

Accelerating HPC with Advanced Programming Techniques

0 comments
0 members are here

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024

Research Articles

Accelerating HPC with Advanced Programming Techniques (2/2)

MPI-level taskification approach

Inter-rank communication

Performance evaluation

Accelerating the Global Goals with advanced programming techniques

HOL4 users' workshop 2025

TinyML: Ubiquitous embedded intelligence

To the edge and beyond