Parallel MPI jobs

There are three different mechanisms by which MPI jobs may be dispatched using Slurm. The recommended mechanism uses srun to directly launch tasks and initialize inter-process communication. More information on other mechanisms is available in the Slurm MPI and UPC Users Guide.

MPI dispatch with srun

These are example parallel job scripts using current best practices. They each run a 24-task MPI test job on two Janus nodes with a ten-minute time limit.

Intel MPI

#!/bin/bash

#SBATCH --job-name mpi_test
#SBATCH --qos janus
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 12
#SBATCH --time 00:10:00
#SBATCH --output mpi_test.out

# the slurm module provides the srun command
module load slurm

module load intel/impi-13.0.0

srun mpi_test

OpenMPI

#!/bin/bash

#SBATCH --job-name mpi_test
#SBATCH --qos janus
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 12
#SBATCH --time 00:10:00
#SBATCH --output mpi_test.out

# the slurm module provides the srun command
module load slurm

module load openmpi/1.8.3_intel-13.0.0

srun mpi_test

MPICH

#!/bin/bash

#SBATCH --job-name mpi_test
#SBATCH --qos janus
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 12
#SBATCH --time 00:10:00
#SBATCH --output mpi_test.out

# the slurm module provides the srun command
module load slurm

module load mpich/mpich-3.1.2_intel-13.0.0

# PMI-2 must be specified explicitly when using MPICH
srun --mpi=pmi2 mpi_test

Multiple program, multiple data (MPMD)

Slurm supports multi-program MPI through the use of the --multi-prog argument to srun. In this case, the executable supplied to srun is expected to be a configuration file that maps MPI ranks to executables and arguments.

# mpmd-example.conf

0-9 ./a.out
10-24 ./b.out

Compared to the above SPMD examples, simply modify the srun command to run with the defined MPMD configuration.

srun --multi-prog mpmd-example.conf

More information is available in the srun manpage, under the heading "MULTIPLE PROGRAM CONFIGURATION."

Custom task geometry

If your MPI program requires a custom task geometry you will need to redefine the environment variable SLURM_TASKS_PER_NODE and use mpiexec to execute you program.

In this case we request 4 nodes with 12 tasks per node giving a total of 48 tasks. However we want the root rank (rank 0) to run on the first node by itself, then the next two nodes to run 12 tasks each (ranks 1-12) and last node to run 6 tasks only (ranks 13-18).


#!/bin/bash

#SBATCH --job-name mpi_test
#SBATCH --qos janus
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 12
#SBATCH --time 00:10:00
#SBATCH --output mpi_test.out

# the slurm module provides the srun command
module load slurm

module load intel/impi-13.0.0

export SLURM_TASKS_PER_NODE='1,12(x2),6'
mpiexec ./mpi_test