One of the main issues in High-Performance Computing (HPC) systems is the underutilization of resources. Parallel applications partition and distribute compute and data across processors in the system that work together to solve a given problem. In this operation, processors synchronize and communicate which may lead to some of them spending time idle, waiting for other processors to complete their part. Idle processors mean wasted time and power. This can happen at a number of points in a program, the main ones being:
These issues are common in bulk synchronous parallel applications, especially those that statically assign work to processors. Tasking, on the other hand, promises to improve load balancing, increase system utilization and has the potential for better locality. However, the adoption of tasking to program parallel and distributed systems is slow. The lack of more codes being implemented or ported to a tasking model is, in part, due to a limited number of success stories. Recent collaborative work between Arm Research and Barcelona Supercomputing Center reports on the experience of porting an adaptive mesh refinement code from the US Exascale Computing Project Proxy App Suite. The lessons learned from this porting exercise are reported in the “On the Benefits of Tasking with OpenMP” at the International Workshop on OpenMP 2019, which aim at being a reference of how to taskify applications and encourage tasking adoption to improve application performance at scale.
Many codes rely on bulk synchronous parallelization constructs to distribute and synchronize work across multiple threads in a system. In this model, multiple threads operate in parallel on different parts of a problem, and perform a global synchronization when the parallel work is completed. Fork-join is a similar model where a single thread, sometimes called a master thread, is the application entry point. This forks into multiple threads that concurrently work on different parts of a problem, and then synchronize to join into a single thread when the work in the parallel section is complete. An example of this model is the worksharing-parallel constructs in OpenMP that distribute the iterations in a loop across multiple threads.
Load imbalance appears when different threads receive an uneven amount of work to do, or perform the work at different speeds, leading to different amounts of compute time. In this scenario, the faster threads need to wait for lagging threads on global synchronizations, therefore being in an idle state and wasting resources during that time. In the fork-join model, serial sections in between parallel regions become an increasing bottleneck, as parallel regions are shortened with increasing numbers of threads.
A task is a piece of compute that operates on a piece of data and that may be executed concurrently with other tasks. This parallel programming abstraction is intuitive and allows to specify data-flow dependencies between tasks instead of costly global synchronizations. Task dependencies typically specify a piece of data produced by a task and consumed by another one so the second cannot execute until the first completes, leading to a producer-consumer relationship. This enables the specification of one-to-one synchronizations that may happen across computation phases, i.e., some compute in a second phase may execute as soon as the data it needs from the first phase is ready, without requiring a global synchronization between compute phases. This mitigates idle time created as a result of load imbalance, given that threads pick up work as they complete, and there is also less time spent on serial sections due to the reduced number of global synchronizations.
There are a number of programming models which support tasking. In the HPC space, OpenMP, Cilk, Thread Building Blocks, Chapel, OmpSs, X10, Charm++ and HPX are examples of programming models with tasking support. OpenMP, one of the most popular programming models in HPC, supports tasking through pre-processor directives on code blocks that allow the specification of individual tasks, and taskloops – loops which iteration chunks are executed as tasks.
float array[N]; int M=10; //task data size //... for (int i=0; i<N; i+=M) { #pragma omp task for (int j=i; j<i+M; j++) { //task computation } }
The code above will execute each instance of the inner loop as a task that operates on M elements of the array. This pragma annotation specifies independent tasks that may be executed concurrently.
float array[N]; int M=10; //task data size //... //Phase 1 for (int i=0; i<N; i+=M) { #pragma omp task depend(out: array[i:M]) for (int j=i; j<i+M; j++) { //task computation - production } } //... //Phase 2 for (int i=0; i<N; i+=M) { #pragma omp task depend(in: array[i:M]) for (int j=i; j<i+M; j++) { //task computation - consumption } }
float array[N];
int M=10; //task data size
//...
//Phase 1
for (int i=0; i<N; i+=M) {
#pragma omp task depend(out: array[i:M])
for (int j=i; j<i+M; j++) {
//task computation - production
}
//Phase 2
#pragma omp task depend(in: array[i:M])
//task computation - consumption
In the code above, the tasks generated in the first phase may execute concurrently with tasks generated in the second phase, as long as dependencies are satisfied. Each task in the first phase writes (out annotation) to a portion of the array (elements i to i+M), and each task in the second phase reads from a portion of the array. A task reading from a given portion of the array needs to wait until the task writing to that same portion completes, but it may run concurrently with tasks writing to other portions of the array. This way, inter-task dependencies avoid a global synchronization between phase 1 and phase 2 loops.
out
i
i+M
The Exascale Computing Project develops a catalog of proxy applications that are simplified codes that represent the features of large complex applications without having to incorporate the large and complex code bases (https://proxyapps.exascaleproject.org). MiniAMR is a mesh refinement proxy application that focuses on representing the behavior of applications working on the simulation of physical phenomena at multiple precision levels. As an example, the figure below shows a miniAMR visualization of a 3-dimensional space, represented at different refinement levels.
MiniAMR simulates objects moving through the 3D space. The edges of those objects are simulated at a finer granularity while empty space is simulated with a coarser granularity. Adjacent cells in the space may only differ in one granularity level, such as being one level coarser or finer. The 3-dimensional space is distributed among multiple threads in the system, and each block keeps track of the halo values in the boundaries of other blocks. In each time step, the simulated objects move with a given speed and direction, and each thread communicates the halo values of a block with neighboring blocks. Since the granularity level might be different, those values may be replicated (going from a coarser to a finer granularity) or averaged (from a finer to a coarser granularity). With the updated values, the proxy app performs a stencil operation on its block data, to mimic a computation step in the refinement application. An optional checksum is performed to validate there were no errors in the time step. After a number of time steps, the 3D space is refined using the new position of the objects.
The stencil computation and communication phases are the ones that take the most execution time and, for that reason, are parallelized. In the OpenMP version, with worksharing-parallel constructs using a fork-join model, the blocks in the 3D space are distributed among threads. Each thread performs a halo exchange for each block with its neighbors, and there is a global synchronization until all the threads finish. The global synchronization is necessary so that all communications complete before the stencil computation may start. This inherently generates load imbalance, as each block may have a different refinement level and therefore different threads may have different amounts of work to do, e.g., a finer granularity leads to more blocks and therefore more communications are needed. This way, the threads working on blocks with a coarser granularity will finish before others, remain idle until the rest finish and lead to underutilization of the system.
The code above shows the taskified version of the communication code with data-flow dependencies. The blocks barray and barray1 are the current block data and neighbor block data, respectively. In this case, they are at the same granularity level, so the code just copies the corresponding face of the block from barray into the barray1 halo face, and the corresponding face from barray1 into the barray halo face. Given that a task both reads and writes from barray and barray1, it uses inout dependencies. This serializes tasks working on the same blocks and may concurrently execute tasks working on different blocks. It also enables the removal of the global synchronization between the communication and computation phases, and allows the concurrent execution of communication and stencil tasks, as long as they operate in different blocks.
barray
barray1
inout
The timelines above show the activity of multiple threads on miniAMR over time with different colors. The timeline labeled ‘Loop’ shows execution using worksharing-parallel constructs. The light green activity is communication, and white space is idle time where threads with less work finish earlier and need to wait until the global synchronization has taken place before stencil computation may start. The timeline labeled ‘Task-2’ uses the taskified version shown before. Here, there are no global synchronizations and communication, stencil and checksum tasks may concurrently execute, as long as they operate in different blocks. This leads to full utilization of the system by overlapping operations across application phases.
We analyzed the performance of the multiple miniAMR versions using fork-join and tasking models on multiple platforms, compilers and runtime systems. The Orig and Loop versions use fork-join, and the Task-1 and Task-2 versions use tasking. Orig-dyn and Loop-dyn use fork-join and perform dynamic scheduling of iteration chunks within each parallel loop. This means that load imbalance may be mitigated within the loop, but there is still a global synchronization at the end of parallel loops, and therefore no overlap between application phases.
All versions were run on Marvell ThunderX2, IBM POWER9, Intel Skylake-SP and AMD EPYC, using multiple numbers of threads in each system. The results are normalized to the Loop version. Overall, tasking shows benefit specially with increasing numbers of threads due to higher utilization. When crossing Non-Uniform Memory Access (NUMA) domains, however, such as cross-socket or cross-die boundaries, the dynamically scheduled versions, including those using tasking, underperform due to worse locality. This shows that the work time inflation resulting from worse data locality may outweigh the improvements on system utilization and, therefore, there is need for more intelligent, locality-aware scheduling policies if needed to run across NUMA domains.
This collaboration with Barcelona Supercomputing Center revealed that tasking lives up to its promises by better balancing load, improving system utilization and enabling execution across application phases. There is potential for improved locality by co-locating tasks working on the same data back-to-back in the same threads, but this optimization is not present in existing runtime systems. Due to this, a locality degradation becomes visible when using dynamic scheduling across NUMA domains, which asks for work on locality-aware policies.
The taskification process and lessons learned in this work serve as a reference on how to achieve higher system utilization and reduce time to solution of parallel applications running on HPC facilities.
This work is part of the program at the International Workshop on OpenMP 2019. More details on our developments, experimental platforms and insights in our results are available in our paper “On the Benefits of Tasking with OpenMP”
Read the Paper