In this article, I discuss the use of containers with the Message Passing Interface (MPI). Regardless of opinions, MPI applications still cover most High Performance Computing (HPC) workloads. In fact, even Machine Learning workloads do not really change that fact since a significant fraction of machine learning frameworks are based on MPI for multi-node execution.
So, what does it mean to run MPI containers? What do I need to know, as a developer, before trying to create and run MPI containers? As a user, how can I manage my MPI containers if I am not an MPI expert? Using MPI can already be a daunting task and adding a container runtime may seem to only make everything more complicated. And to some extends, it is true.
First, containers bring the same advantages than for non-MPI workloads, including help with portability, application packaging, data packaging, reproducibility, and software sharing between users.
But MPI does not understand what a container is, when I get on a HPC system, I usually get an mpirun command and most certainly a job manager. And it is where, you, as a developer need to make a choice based on your goals and the configuration of the target execution platforms. To help make these choices, I describe in this article the 3 different execution models for MPI containers that are commonly accepted by the community: hybrid (the most popular model), embedded, host-only. For each model, I give an overview of the positives and negatives implications.
This is by far the most popular model for the execution of MPI applications in containers. With the hybrid model, the mpirun command on the host, or the job manager, is used to start the container which ultimately will be executing the MPI ranks. This model is called hybrid because it requires both an MPI implementation on the host and another implementation in the container image. The host MPI provides the mpirun command and, with most MPI implementations, a runtime capability on compute node to start ranks or containers. MPI in the container image is used to actually run the application. Of course, this means that both MPI implementations need to be “compatible” since they need to tightly interact with each other.
For example, assuming Singularity containers and 2 MPI ranks, the command to start an MPI application looks like:
$ mpirun -np 2 singularity run ./my_hybrid_container.sif /path/to/app/in/container
This example explicitly shows that the mpirun on the host is used to start to ranks. Also, for each rank, the command is in fact a Singularity command that will start first the container and then the rank within the container.
A good question is what “compatibility” exactly means. Ultimately, what users want to know is a two-fold question: If my target execution platform has MPI implementation X version Y, can I run my container that is based on the MPI implementation X version Z? And if not, which version do I need to install on the host and how am I supposed to install that specific version of MPI (meaning the configuration details)?
Given a specific implementation of MPI, for example, Open MPI, there is therefore the need for compatibility matrices. Based on a container runtime, the matrix shows what version of Open MPI in a container works with a specific version of Open MPI on the host. And this is where you will quickly notice if the container runtime is HPC-friendly or not. For instance, to the best of my knowledge, there is no such official compatibility matrix for Docker. On the other hand, the Singularity ecosystem includes a specific tool to automatically create such compatibility matrices. More details about this in a later blog article.
Lastly, note that very few container ecosystems provide the required tools to assist developers and users. In theory, an MPI-friendly ecosystem would provide a series of tool with the following capabilities:
Without such tool, it is the responsibility of the developer to track all relevant information and assist users when they try to run their containers on various HPC platforms. To the best of my knowledge, only Singularity provides such a tool: https://sylabs.io/articles/2019/11/create-run-and-manage-your-mpi-containers-in-a-few-steps.
The second model is called embedded. With this model, the MPI implementation in the container is solely used; no MPI implementation is required on the host. This approach has the benefit of being extremely portable: the containers are self-contained and can be executed pretty much anywhere (at least from an MPI point-of-view). Unfortunately, this model requires a more advanced understanding of the MPI implementation to make sure that when starting the first container, mpirun can be executed from that container. From there. the MPI implementation starts all other containers on all target compute nodes. In other words, this is the responsibility of the developer to ensure that the MPI implementation is correctly set up for all target execution platforms . This is usually a non-trivial task, especially when problems arise and require debugging.
Assuming Singularity containers and 2 MPI ranks, the command to start an MPI application looks like:
$ singularity exec ./my_embedded_container.sif mpirun -np 2 /path/to/app/in/container
This example illustrates that a Singularity container is first started and mpirun within the container is then executed to start MPI ranks. This assumes that the MPI implementation is set up to guarantee that when an MPI rank is started, a container is first started and then the rank within it. The user is responsible for ensuring this. I do not detail the technical details since it is implementation specific. Overall, this solution is very portable but technically challenging because requires a precise and detailed understanding of the MPI and container runtime configurations.
The last model is the host-only model. With this model, only the MPI implementation from the host is used to start and execute the MPI application in containers. This means that the application in the container image has been compiled with an MPI implementation that is “compatible” with the MPI available on the host.. The term "compatible" is the same than with the hybrid model. This means that the container image is not as portable as with other models. The advantage is the small size of the container which does not need to include any MPI implementation. Instead, the MPI implementation from the host being mounted into the container and used by the application.
The following example illustrates how a host only MPI container can be executed, assuming Singularity and 2 MPI ranks are used:
$ mpirun -np 2 singularity \
-b /host/directory/where/mpi/is:/container/directory/where/mpi/is/assumed/to/be \
This example shows how the user of the container is responsible for figuring out in which directory on the host the MPI implementation is installed. The user is then responsible for mounting that directory into the container. As a result, this solution is potentially less portable: the container must be “prepared” for the MPI implementation available on the host. On the other hand, the image does not have to include any MPI and is therefore smaller.
The MPI forum