In a prior blog post, I talked about the opportunity of using a task-parallel programming model to improve performance and system utilization for a multi-core compute node. The main challenges addressed by tasking are the load imbalance suffered across different threads of computation, and the limitations of the fork-join parallel model to efficiently exploit concurrent execution. The results showed better scaling and performance using tasking for shared memory code in multiple processors from different vendors. Performance improvements were between 10 and 20 percent, with up to 35 percent lower execution time. These optimizations lead to faster simulation turnaround time, accelerating scientific progress for High-Performance Computing (HPC) users around the globe.
That research focused on shared memory parallel execution in a single node of computation using OpenMP. OpenMP is the number one parallel programming model for shared memory parallelism and accelerator offload in HPC. However, large supercomputers are not single shared-memory systems, but rather a collection of compute nodes, each with its own memory, connected through a high-bandwidth low-latency network. Message passing is the preferred programming paradigm for such distributed memory systems. Message Passing Interface (MPI) is the major parallel programming model for HPC distributed memory systems.
Most scientific and engineering applications are parallelized using pure MPI strategies, where each compute thread from the participating compute nodes works in a subdomain of the total domain problem. In the context of MPI, each compute thread is referred to as an MPI process or rank. The technique of partitioning a large domain into smaller subdomains that are distributed among different ranks is called domain decomposition. Generally, this partitioning requires the exchange of subdomain boundary data among ranks that are neighbors in the domain space. In this case, the neighboring ranks exchange messages by sending and receiving the boundary data using the MPI interface. However, pure MPI strategies are not the best option to exploit intra-node parallelism. The application becomes more sensitive to load imbalance, and so becomes complex to overlap application phases and threads. They communicate using unnecessary explicit messages instead of using the shared memory space.
Recent collaborative work between Barcelona Supercomputing Center and Arm Research reports on the experience of taskifying an Adaptive Mesh Refinement code from the US Exascale Computing Project, at both the OpenMP and MPI levels. Performance results and takeaways from enabling the interoperability of both shared-memory and distributed-memory libraries are published in the paper “Towards Data-Flow Parallelization for Adaptive Mesh Refinement Applications” presented at the IEEE Cluster 2020 conference. The paper includes details on the tasking methodology, which exploits an automatic full-stack load-balancing and communication-computation overlap to achieve better scaling, higher system utilization, efficiency and performance.
Before diving into advanced programming techniques, we need to introduce the basic programming strategies used in scientific codes, apart from the pure MPI approach. The most common alternative is hybrid parallel programming. The combination of MPI and OpenMP to parallelize large-scale scientific codes brings an opportunity to exploit the best of both worlds while mitigating their weaknesses. Hybrid MPI+OpenMP applications create a set of MPI ranks, and each of those ranks can then execute a set of OpenMP threads.
Typically, scientific applications feature iterative algorithms. An iteration usually performs a time step in a simulation, featuring a computation section that operates on the data, and a communication section where ranks exchange the updated data for the next iteration. Commonly, in a hybrid MPI+OpenMP code, the computation section has all OpenMP threads in all MPI ranks working, and the communication section has MPI ranks passing messages. The communication section is usually executed serially by the master thread (shown in blue in the figure below). This simple approach is typically named fork-join in the context of hybrid programming. The following figure depicts a timeline of a hybrid MPI+OpenMP application.
Figure 1: Typical hybrid MPI+OpenMP execution model
MPI has the advantages of being fully parallel and having an inherent locality advantage. All ranks in an MPI application run independently from initialization to the end of the execution. They work on a private partition or copy of the data, which prevents unwanted shared data issues. OpenMP, on the other hand, is inherently serial and opens parallelism on parallel sections only, which work on shared data. It may also suffer from remote caching effects and coherence artifacts, such as false sharing. OpenMP working on shared data has the advantage of avoiding data copying for message passing, given that all threads can access a single copy of the data. Combining both approaches allows the inclusion of MPI ranks to leverage a consolidated message passing model to communicate in a distributed memory system, with each rank running OpenMP lightweight threads exploiting shared data, reducing overall data copying requirements.
A programmer could definitely program with OpenMP following a similar scheme. They could have full concurrent execution from beginning to end, and distribute work across threads while still accessing shared data. However, it is common practice, unfortunately, to parallelize with OpenMP following a bottom-up approach: parallelizing individual loops and keeping serial sections in between. This imposes the scaling limits stated by Amdahl’s Law.
The common configuration of ranks in hybrid applications is one rank per compute node, or one rank per Non-Uniform Memory Access (NUMA) node. OpenMP threads in an MPI rank implicitly communicate through shared data structures in the shared memory space, instead of exchanging MPI messages. Leveraging one rank per NUMA node usually improves data locality since threads in a given rank access the same NUMA node’s memory. Accessing memory from different NUMA nodes introduces significant memory latency differences across threads, leading to imbalanced scenarios.
This hybrid model provides the advantages of both models, but leaves opportunities on the table. Features such as asynchronous transfers (e.g., MPI_Isend/MPI_Irecv) provide some of the benefits of the hybrid model by allowing some communication and computation overlap. However, a fork-join model with global synchronizations (shown in green in Figure 1) limits the amount of computation-communication overlap, and enables the overlap of the execution in different iterations on different ranks. To break away from the fork-join model and allow the higher levels of parallelism to be exploited, along with the asynchronous computation and communication that tasking provides, the MPI and OpenMP libraries need to work together.
This interoperability does not exist today. Both libraries working independently of each other, and orchestration between the two is left to the programmer. In the current MPI and OpenMP standards, executing MPI communication operations in concurrent tasks (for example, tasks exchanging the boundaries of the subdomain in parallel) is both unsafe and inefficient.
On the one hand, it is unsafe to call blocking MPI functions from concurrent tasks. Notice that blocking MPI operations block the current thread inside the MPI library until the operation completes. Figure 2 illustrates the problem. We assume a hybrid application with two MPI ranks: one instantiating multiple concurrent tasks to send different chunks of data, and the other instantiating the same number of concurrent tasks to receive the data. We also assume they call the common blocking MPI_Send and MPI_Recv methods to send and receive each chunk, and that each chunk data message is tagged with its chunk identifier.
The program could hang if the number of communication tasks is greater than the number of OpenMP threads that can run tasks, which is two per rank (one per core) in this case. That is because the communication tasks are concurrent, the OpenMP scheduler can freely decide their execution order based on the scheduling policy and the execution circumstances. Since it is not guaranteed that the execution order will be the same in both ranks, the running tasks could be trying to exchange a distinct set of chunks. This blocks the two OpenMP threads from both ranks inside the MPI library, provoking a deadlock situation. Notice that when an OpenMP thread blocks inside the MPI library, there is no way for the OpenMP thread scheduler to know that the thread has been blocked, so it cannot schedule another OpenMP thread on that core. Thus, the core cannot execute other ‘ready’ communication tasks in the meantime.
Figure 2: Lack of MPI-OpenMP operability may lead to deadlock for tasks with MPI calls
On the other hand, issuing MPI operations from tasks is usually inefficient. The communication tasks require artificial data dependencies to define the same execution order in all ranks and prevent the previous deadlock situation. The execution of non-blocking MPI operations (for example, MPI_Irecv), which initiates the operation and returns an MPI request to check its completion later, is difficult to manage inside tasks. The user would be responsible for manually checking the MPI requests, resulting in inefficient algorithms in most cases.
The Task-Aware MPI (TAMPI) library aims to overcome all these limitations by allowing the safe and efficient execution of blocking and non-blocking MPI operations from within tasks, in both OpenMP and OmpSs-2 tasking models. In the case of a task calling a blocking MPI function (for example, MPI_Recv), the library pauses the task until the operation completes, allowing other ‘ready’ tasks to execute on that core in the meantime. The library also defines a TAMPI variant for all non-blocking MPI functions (for example TAMPI_Irecv). These functions are non-blocking and asynchronous, binding the completion of the calling task to the finalization of the corresponding non-blocking operation that they represent (for example, MPI_Irecv). The function returns immediately so that the task can finish its execution even if the MPI operation did not finish yet. A task is considered complete when its execution is finished, and all pending MPI operations are complete.
Figure 3: HPC software stack with MPI and OpenMP interoperability through TAMPI.
We show an example of how to use TAMPI support for non-blocking operations in the code below. The program concurrently receives and processes multiple integers in parallel using tasks. The first task is the receiver, which calls the TAMPI_Irecv function to start the receiving operation. This makes task completion dependent to the receiving operation finalization. Note that it declares an output dependency on the buffer used to receive the data (that is, data will be written into the buffer). The TAMPI function may return immediately with the operation still ongoing, so the buffer cannot be consumed there. Instead, we can consume it in the successor task below, which takes the buffer as an input dependency. In this way, when the MPI operation finalizes, the TAMPI library will transparently complete the receiver task and satisfy the consumer task’s input dependencies. This will run eventually to consume the received data. This way, the TAMPI library allows developers to perform efficient and safe communications in parallel with multiple tasks.
int recvdata[N]; MPI_Status status[N]; for (int i = 0; i < N; ++i) { #pragma omp task out(recvdata[i]) out(status[i]) { int tag = i; TAMPI_Irecv(&recvdata[i], 1, MPI_INT, 0, tag, MPI_COMM_WORLD, &status[i]); // non-blocking and asynchronous // recvdata cannot be accessed yet } #pragma omp task in(recvdata[i]) in(status[i]) { check_status(&status[i]); consume_data(&recvdata[i]); } } #pragma omp taskwait
By leveraging a tasking model such as OpenMP or OmpSs-2 and the TAMPI library, we can perform an effective taskification of most applications with computation and communication sections. This leads to an effective overlap of computation and communication, inherently provided by the tasking model. Developers can then focus on exposing their application’s parallelism, instead of worrying about low-level aspects, such as the handling of MPI operations issued by tasks, which is hidden inside TAMPI. This strategy also enables a top-down parallelization strategy by taskifying high-level functions instead of the inefficient bottom-up strategy seen in fork-join approaches.
So far, we have explored the problem imposed by the lack of interoperability between MPI and OpenMP, and how it hinders MPI-level tasking. We also discussed a proposal to provide such interoperability implemented in TAMPI. In the second part of this blog, we look at how we applied the proposed approach to an adaptive mesh refinement application. The resulting code employs tasking across MPI and OpenMP with important speedups up to 12288 cores. In the meantime, if you have any questions, please do reach out.
Contact Alex Rico
This blog is part of a two part series. Use the link below to read part two:
Accelerating HPC with Advanced Programming Technique (2/2)