slurm, mpi, guix and plafrim

Table of Contents

Supercomputers rely on job schedulers to handle resources (compute nodes). In the case of plafrim, as for 60% of Top 500 today supercomputers, slurm is in charge of that, as detailed in the plafrim reference.

We assume that you are all set up with your plafrim access and guix environment and that you have log into the platform:

ssh plafrim-hpcs

You shall now be on the frontal node (mistral01)


We will now use slurm to run on the compute nodes of the platform (and discover which they are).

1. slurm basics

slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. slurm requires no kernel modifications for its operation and is relatively self-contained. You may want to read an overview of its architecture in the slurm reference.

You can first check out with sinfo whether there are partitions with available (idle) compute nodes:

hpc up 8:00:00 1 drain* miriel007
hpc up 8:00:00 12 down* miriel[008-016,019,022,024]
hpc up 8:00:00 19 idle miriel[001-006,017-018,020-021,023,025-032]
sirocco up 10:00:00 1 drain sirocco06
mistral up 3-00:00:00 2 down* mistral[04,18]
mistral up 3-00:00:00 15 idle mistral[02-03,05-17]

In this case, there are 19 miriel compute nodes available on the hpc partition.

We can monitor the node that we got with squeue, possibly restricting the view to our own jobs with the -u option:

squeue -u $USER

At this point, you have no jobs running, it's time to launch a few! As explained in the quick start slurm reference, onto which this tutorial is based, there are three main ways of using slurm. We'll see later on that srun may be optionally substituted with mpirun for mpi applications (when the support is turned on).

1.1. Direct usage of srun

It is possible to create a resource allocation and launch the tasks for a job step in a single command line using the srun command. We'll see later on that depending upon the mpi implementation used, mpi jobs may also be launched in this manner, but for now we set mpi aside and we illustrate it with a simple example displaying the hostname.

1.1.1. Default usage (without guix)

We viewed above that the partition we want to run onto is called hpc, so we instruct srun to run on that partition with the -p option:

srun -p hpc /bin/hostname

We observe that the execution is performed on a miriel compute node. Only one message is triggered because you ran on only one process, or task in slurm terminology. You may want to use the -l option prepends task number to lines of stdout/err.

srun -p hpc -l /bin/hostname
0: miriel023.formation.cluster

You will observe that the message is now preceded with task id 0, the unique task allocated for this execution. You can run more tasks as follows. With the -N (or --nodes=) option, you can request a (minimum) number N of nodes be allocated to the job:

srun -l -p hpc -N 2 hostname
0: miriel017.formation.cluster
1: miriel018.formation.cluster

You may observe that two processes (with ids 0 and 1) are running on two different miriel compute nodes. By default, only one task per node is allocated. You may however observe that 24 CPU cores on a miriel node (here miriel017) are available by monitoring the value of CPUTot retrieved by the follow scontrol command (see slurm refcard):

scontrol show node miriel017
NodeName=miriel017 Arch=x8664 CoresPerSocket=12          
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01        
NodeAddr=miriel017 NodeHostName=miriel017 Version=17.11          
1 SMP Thu Nov 8 23:39:32 UTC 2018
RealMemory=128000 AllocMem=0 FreeMem=127134 Sockets=2 Boards=1      
State=IDLE ThreadsPerCore=1 TmpDisk=227328 Weight=1 Owner=N/A MCSlabel=N/A    
BootTime=2019-10-31T09:23:47 SlurmdStartTime=2019-10-31T09:54:58            
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0          
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s          

You may thus want to use all available 24 CPU cores per node by setting the --ntasks-per-node=24 option:

srun -p hpc -N 2 --ntasks-per-node=24 -l /bin/hostname | sort -n
0: miriel017.formation.cluster
1: miriel017.formation.cluster
2: miriel017.formation.cluster
3: miriel017.formation.cluster
4: miriel017.formation.cluster
5: miriel017.formation.cluster
6: miriel017.formation.cluster
7: miriel017.formation.cluster
8: miriel017.formation.cluster
9: miriel017.formation.cluster
10: miriel017.formation.cluster
11: miriel017.formation.cluster
12: miriel017.formation.cluster
13: miriel017.formation.cluster
14: miriel017.formation.cluster
15: miriel017.formation.cluster
16: miriel017.formation.cluster
17: miriel017.formation.cluster
18: miriel017.formation.cluster
19: miriel017.formation.cluster
20: miriel017.formation.cluster
21: miriel017.formation.cluster
22: miriel017.formation.cluster
23: miriel017.formation.cluster
24: miriel018.formation.cluster
25: miriel018.formation.cluster
26: miriel018.formation.cluster
27: miriel018.formation.cluster
28: miriel018.formation.cluster
29: miriel018.formation.cluster
30: miriel018.formation.cluster
31: miriel018.formation.cluster
32: miriel018.formation.cluster
33: miriel018.formation.cluster
34: miriel018.formation.cluster
35: miriel018.formation.cluster
36: miriel018.formation.cluster
37: miriel018.formation.cluster
38: miriel018.formation.cluster
39: miriel018.formation.cluster
40: miriel018.formation.cluster
41: miriel018.formation.cluster
42: miriel018.formation.cluster
43: miriel018.formation.cluster
44: miriel018.formation.cluster
45: miriel018.formation.cluster
46: miriel018.formation.cluster
47: miriel018.formation.cluster

You will observe that processes 0-23 are executed one node (miriel017 in our case) whereas processes 24-47 are executed on a second node (miriel018 here). You may want to have further reading on slurm resource binding policies here and there or directly check out the output of srun --cpu-bind=help.

Alternatively, one may use the total number of tasks to run over all the nodes with the -n option, the default being one task per node (but note that the --cpus-per-task option would change this default). In this case you do not need to specify the number of nodes, hence, the above command is equivalent to both the srun -p hpc -N 2 -n 48 -l /bin/hostname | sort -n and srun -p hpc -n 48 -l /bin/hostname | sort -n (where -N 2 is implicit) commands.

srun can also be used to enter an interactive shell session:

srun -p hpc -N 2 --ntasks-per-node=24 --pty bash

However, salloc, may be certainly more convenient to do so, as discussed below.

1.1.2. guix-based usage

As of today, here is the default slurm version deployed on plafrim:

srun --version
slurm 17.11.3-2
  1. Using srun provided by guix environment

    Instead of the one provided by default on plafrim (slurm 17.11.3-2), we can let guix provide it. There is a risk though: the slurm client provided by guix must be able to speak with the slurm daemon (see overview architecture). In our case, this should be fine, guix providing the 17.11.3 version:

    guix environment --pure --ad-hoc slurm -- srun --version
    slurm 17.11.3

    We can therefore check out the hostname with the srun command provided by guix:

    guix environment --pure --ad-hoc slurm -- srun -l -p hpc -N 2 /bin/hostname
    0: miriel017.formation.cluster
    1: miriel018.formation.cluster

    Or, even better, check out the hostname with both the srun and hostname (through the inetutils package for the latter) commands provided by guix:

    guix environment --pure --ad-hoc slurm inetutils -- srun -l -p hpc -N 2 hostname
    0: miriel017.formation.cluster
    1: miriel018.formation.cluster

    In this workflow, as the --pure option cleans up the environment before srun is invoked, slurm environment variables are naturally kept:

    guix environment --pure --ad-hoc slurm coreutils -- srun -l -p hpc -N 2 env | grep SLURM_JOB_NODELIST
    0: SLURMJOBNODELIST=miriel[017-018]
    1: SLURMJOBNODELIST=miriel[017-018]
  2. Using default srun provided by plafrim for launching a guix environment

    If slurm was deployed by system administrators independently from guix as it is likely to be the case as of today on a production cluster or supercomputer, and this is the case on plafrim, one may want to rely on the srun command to make sure the client and the dameon are well interfaced. It is still possible to deploy a guix environment in this case as follows. A first option is to provide the full path /usr/bin/srun:

    guix environment --pure --ad-hoc inetutils -- /usr/bin/srun -l -p hpc -N 2 hostname
    0: miriel006.formation.cluster
    1: miriel009.formation.cluster

    This might be fully satisfactory when using the --pure option and one may decide to rely on it during the hands-on sessions of the school.

    However, if this option does clean up the shell environment in a manner equivalent to env -i (see the env manpage with man env), it does not fully isolate the process. Sometimes it is desirable to isolate the environment as much as possible, for maximal purity and reproducibility. In particular, it may be desirable to prevent access to /usr/bin and other system-wide resources from the development environment. The --container (or -C) option provided by guix environment allows one to do so by relying on linux containers (lxc). You may read this for further information. This feature requires linux-libre 3.19 or newer and can be activated as discussed here (or there) if you deploy yourself guix. Unfortunately, plafrim currently provides an older version (3.10) of the linux kernel. Here is for information the report file of system disk space usage on a machine providing the mechanism:

    guix environment --ad-hoc --container coreutils -- df
    Filesystem 1K-blocks Used Available Use% Mounted on

    One may view that /usr/ and /etc/ cannot interfere in such a configuration.

    Because the --container option is currently not available on plafrim we pursue hour discussion using the --pure option (but the --container one shall be used instead on systems where it is available if one wants to ensure reproducibility at an even higher level).

    srun -l -p hpc -N 2 guix environment --pure --ad-hoc inetutils -- hostname
    1: miriel018.formation.cluster
    0: miriel017.formation.cluster

    In this workflow, as the --pure (or --container) option cleans up the environment after srun is invoked, slurm environment variables are lost:

    srun -l -p hpc -N 2 guix environment --pure --ad-hoc slurm coreutils -- env | grep SLURM_JOB_NODELIST

    If it is not a problem when the goal is simply to display the hostname, it may however be critical in more realistic situations. We can then preserve those variables with the --preserve guix environment option, which takes in argument a regexp, ^SLURM in the present case to preserve slurm environment variables starting with "SLURM":

    srun -l -p hpc -N 2 guix environment --pure --preserve=^SLURM --ad-hoc slurm coreutils -- env | grep SLURM_JOB_NODELIST
    0: SLURMJOBNODELIST=miriel[017-018]
    1: SLURMJOBNODELIST=miriel[017-018]

    As above, it is possible to enter a bash interactive session, but this time within a guix environment, with bash provided by guix.

    srun -N 2 -p hpc --pty guix environment --pure --ad-hoc slurm bash -- bash --norc

    Yet, we'll discuss below how to use salloc with guix, which may be more convenient to conduct such interactive sessions.

    To illustrate once again the symmetry of both behaviours, let us dig further in the hardware topology using hwloc, using either srun within a guix environment:

    guix environment --pure --ad-hoc slurm hwloc -- srun -p hpc hwloc-ls

    (result omitted) or, equivalently, guix environment within srun:

    srun -p hpc guix environment --pure --ad-hoc hwloc -- hwloc-ls
    Machine (128GB total)
    0 (15MB)
    0 (P#0 32GB)
    0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2)
    2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4)
    3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6)
    4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8)
    5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10)
              PCI 03:00.0 (InfiniBand)
                Net "ib0"
                OpenFabrics "qib0"
              PCI 01:00.0 (Ethernet)
                Net "em1"
              PCI 01:00.1 (Ethernet)
                Net "em2"
            PCI 00:11.4 (SATA)
              Block(Disk) "sda"
              PCI 06:00.0 (Ethernet)
                Net "em3"
              PCI 06:00.1 (Ethernet)
                Net "em4"
                    PCI 0a:00.0 (VGA)
            PCI 00:1f.2 (SATA)
    1 (15MB)
    1 (P#2 32GB)
    6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12)
    7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14)
    8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#16)
    9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#18)
    10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#20)
    11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#22)
    2 (15MB)
    2 (P#1 32GB)
    12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#1)
    13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#3)
    14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#5)
    15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#7)
    16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#9)
    17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#11)
              PCI 82:00.0 (Fabric)
                Net "ib1"
                OpenFabrics "hfi1_0"
    3 (15MB)
    3 (P#3 32GB)
    18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#13)
    19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#15)
    20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#17)
    21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#19)
    22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#21)
    23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)

1.2. Interactive allocation through salloc (before running with srun)

1.2.1. Default usage (without guix)

The direct usage of srun is very versatile and allows one to go straight to the point, which is very convenient in many cases. However, in some cases, one may want to factorize the resource allocation before performing multiple executions in that context. The salloc command allows one to do so by allocating resources and spawning a shell. The shell is then used to execute srun commands to launch parallel tasks.

salloc -p hpc -N 2 -n 4
srun -l /bin/hostname | sort -n
salloc: Granted job allocation 426762
0: miriel017.formation.cluster      
1: miriel017.formation.cluster      
2: miriel017.formation.cluster      
3: miriel018.formation.cluster      

Note that the second command was performed within a spawned shell. This spawned shell is still being executed on the mistral01 frontal node:


But, if you re-execute an srun command, you'll still have access to the allocated resources:

srun -l /bin/hostname | sort -n
0: miriel017.formation.cluster
1: miriel017.formation.cluster
2: miriel017.formation.cluster
3: miriel018.formation.cluster

You may monitor your job allocation (whether or not you are within the spawned process):

squeue -u $USER
426762 hpc bash hpcs-agu R 0:06 2 miriel[017-018]

The job allocation gets automatically terminated if it reaches the walltime, if you explicitly exit the spawned shell:

salloc: Relinquishing job allocation 426762

or if you cancel it (possibly from another session):

scancel 426762
salloc: Job allocation 426762 has been revoked.

1.2.2. guix-based usage

  1. Using srun provided by guix environment
  2. Using default srun provided by plafrim for launching a guix environment

1.3. Batch execution with sbatch (within which running with srun)

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

smap reports state information for jobs, partitions, and nodes managed by Slurm, but graphically displays the information to reflect network topology.

1.3.1. Two lines

#SBATCH --time=03:59:00
#SBATCH -n 4
#SBATCH -p hpc
guix environment --pure --ad-hoc hello -- /bin/bash --norc
srun -l --mpi=pmi2 hello
srun: error: miriel003: task 3: Exited with exit code 2
0: slurmstepd: error: execve(): hello: No such file or directory
3: slurmstepd: error: execve(): hello: No such file or directory
1: slurmstepd: error: execve(): hello: No such file or directory
2: slurmstepd: error: execve(): hello: No such file or directory
srun: error: miriel002: tasks 0-2: Exited with exit code 2

1.3.2. One line

#SBATCH --time=03:59:00
#SBATCH -n 4
#SBATCH -p hpc
srun -l --mpi=pmi2  guix environment --pure --ad-hoc hello -- hello
0: Hello, world!
2: Hello, world!
1: Hello, world!
3: Hello, world!

1.3.3. Two files (to be completed)

#SBATCH --time=03:59:00
#SBATCH -n 4
#SBATCH -p hpc
guix environment --pure --ad-hoc hello --

mpirun --mpi=pmi2 hello

2. mpi

2.1. openmpi

openmpi is integrated with slurm as discussed in the openmpi reference. In particular, openmpi automatically obtains both the list of hosts and how many processes to start on each host from slurm directly, providing that it has been configured to do so. Hence, it is unnecessary to specify the --hostfile, --host, or -np options. This is true whether you rely on srun ((1) directly, (2) within an interactive salloc session, or, (3) in a sbatch batch) or on mpirun ((1) in an interactive salloc session, or, (2) in a sbatch batch).

to mpirun. Furthermore, you can launch openmpi mpirun in an interactive Slurm allocation (via the salloc command) or you can submit a script to Slurm (via the sbatch command), or you can "directly" launch MPI executables via srun. (–mpi=pmi2 par –mpi=pmix)

module load mpi/openmpi/4.0.1

OpenMPI in our case: --mpi=pmix_v3 (starting from OpenMPI 4.0)

guix environment –pure –ad-hoc openmpi – mpirun –version mpirun (Open MPI) 4.0.2

[hpcs-agullo@mistral01 ~]$ srun –mpi=list srun: MPI types are… srun: none srun: openmpi srun: pmi2

[hpcs-agullo@mistral01 ~]$ guix environment –pure –ad-hoc slurm – srun –mpi=list srun: MPI types are… srun: none srun: openmpi srun: pmi2

2.2. mpich

2.3. nmad

3. slurm and mpi

4. mpi behind the scene

4.1. Introduction

4.2. openmpi

4.2.1. Many ways to use the network

OpenMPI has many different ways to use high-performance networks. There are at least three reasons:

  1. Some networks actually support multiple driver APIs, for instance native InfiniBand Verbs, PSM, and TCP.
  2. Drivers can plug into OpenMPI at different levels (BTL, MTL, PML) depending on what features they support.
  3. Vendors are pushing towards the use of new network-agnostic communication libraries: libfabric/OFI on the Intel side, UCX on the Mellanox (now NVIDIA) side. OpenMPI can use networks directly or through these libraries.

All miriel nodes are connected through Intel TrueScale 40G InfiniBand. Although it is marketingly said InfiniBand-compatible, it should actually be used through its own software interface called PSM. Besides, miriel nodes 001-043 also have an Intel 100G OmniPath interface, that should be used through the PSM2 driver. In OpenMPI, these may be used through the "MTL" layer.

Also, mistral nodes (and some sirocco) have Mellanox 40G InfiniBand interfaces. Those may be used through the native "standard" Verbs InfiniBand stack. However Mellanox rather pushes for their new "UCX" stack to be used.

4.2.2. How drivers are selected

All these ways to use high-performance networks look complicated but they actually work well once you understand how drivers are selected. The runtime looks at available hardware and drivers and decides which one(s) to used based on priorities. Once you are sure that all high-performance drivers are compiled in, the right one will be used. This even works when mixing nodes with different networks because this decision is made for each connection between a pair of nodes.

The OpenMPI configure scripts enables everything it finds. That's good if all drivers are installed. However it will silently disable important ones if development headers are missing (for instance UCX devel packages). To avoid such issues, one should pass --with-foo to force the configure script to fail if foo is not available.

"configure: error: PSM support requested but not found. Aborting"

On PlaFRIM, it means you should pass --with-psm --with-psm2 --with-ucx.

You may also pass --without-ofi since using libfabric/OFI isn't really useful here (it shouldn't be faster than PSM/PSM2/UCX and PlaFRIM doesn't have any additional hardware it supports).

Finally you may also pass --disable-verbs since those are deprecated in OpenMPI 4. If not disabled, you will have to pass --mca btl_openib_allow_ib 1 to mpiexec to use that API. UCX is basically required for OpenMPI 4 over Mellanox hardware nowadays. That's actually strange because MVAPICH, another famous MPI implementation for InfiniBand doesn't plan to switch from Verbs to UCX anytime soon.

In the end, you need a node with development headers for all networks you want to support for this OpenMPI installation. Once it's compiled, it will still run fine on nodes that miss some network libraries because OpenMPI uses dynamically loaded plugins to ignore those missing libraries.

4.2.3. Verifying what was enabled/compiled

A summary at the end of configure confirms whether things are correctly enabled/disabled:

Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): yes     <- miriel001-043
Intel TrueScale (PSM): yes     <- miriel001-088
Mellanox MXM: no
Open UCX: yes                  <- mistral+sirocco01-06
OpenFabrics OFI Libfabric: no  <- explicitly disabled
OpenFabrics Verbs: no          <- explicitly disabled
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

The ompi_info tool also lets you check which components are available at runtime:

$ ompi_info | grep ucx
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.0.2)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.2)
$ ompi_info | grep psm
MCA mtl: psm (MCA v2.1.0, API v2.0.0, Component v4.0.2)
MCA mtl: psm2 (MCA v2.1.0, API v2.0.0, Component v4.0.2)
$ ompi_info | grep Configure
Configure command line: '--with-ucx' '--with-psm' '--with-psm2' '--disable-verbs' '--without-ofi' '--enable-mpirun-prefix-by-default' '--prefix=...'

4.2.4. Performance benchmarks to check whether the right network is used

We use the Intel MPI Benchmark from It was compiled with:

make IMB-MPI1 CC=/my/ompi/bin/mpicc CXX=/my/ompi/bin/mpicxx

We'll just run the IMB Pingpong benchmark between 2 processes, one process per node.

mpiexec -np 2 -H mistral02,mistral03 --map-by node --bind-to core IMB-MPI1 Pingpong

The option --map-by node makes sure that our processes are distributed on nodes first (instead of both going to the same node). The option --bind-to core makes sure processes do not move between cores so that performance is more reproducible between runs.

  1. PSM2

    On miriel001-043, we want to get 100Gbit/s bandwidth. That's supposed to be 12.5GB/s but we rather get 10GB/s in practice because some bandwidth is lost in packet headers, signalling bits, etc.

    $ mpiexec -np 2 --map-by node --bind-to core ... IMB-MPI1 Pingpong
             size         iter      latency    bandwidth
                0         1000         1.45         0.00
                1         1000         1.62         0.62
          2097152           20       215.75      9720.19
          4194304           10       439.37      9546.20

    If we add --mca mtl ^psm2 to disable PSM2, or --mca mtl psm to force PSM1 only, the performance disappears (see below). Hence we were really using PSM2 here, and getting (almost) the expected 10GB/s.

  2. PSM1

    On miriel001-088, we want to get 40Gbit/s (when we cannot use the 100G PSM2 network).

    $ mpiexec -np 2 --map-by node --bind-to core --mca mtl psm ... IMB-MPI1 Pingpong
             size         iter      latency    bandwidth
                0         1000         1.56         0.00
                1         1000         1.68         0.59
          2097152           20       689.57      3041.23
          4194304           10      1325.47      3164.39

    If we add --mca mtl ^psm2,psm to disable PSM2 and PSM1, the performance disappears. Hence we were really using PSM1 here, and getting (almost) the expected 4GB/s.

  3. TCP

    Still on miriel, if we force use TCP on the 10G Ethernet network by disabling PSM1 and PSM2, we get about 1GB/s as expected:

    $ mpiexec -np 2 --map-by node --bind-to core --mca mtl ^psm2,psm ... IMB-MPI1 Pingpong
             size         iter      latency    bandwidth
                0         1000        16.39         0.00
                1         1000        18.08         0.06
          2097152           20      2314.69       906.02
          4194304           10      5323.38       787.90

    In this case, OpenMPI is using the TCP BTL. The equivalent configuration is --mca btl tcp,vader,self. The vader BTL is here for communication between processes on the same node (there are some StarWars-related jokes in OpenMPI component names). The self is for communication between a process and itself.

  4. UCX

    Now we switch to mistral nodes with 40G Mellanox InfiniBand. The default configuration gives almost 4GB/s as expected.

    $ mpiexec -np 2 --map-by node --bind-to core ... IMB-MPI1 Pingpong
             size         iter      latency    bandwidth
                0         1000         1.64         0.00
                1         1000         1.66         0.60
          2097152           20       567.42      3695.97
          4194304           10      1126.84      3722.18

    If we disable UCX with --mca pml ^ucx, performance disappears as expected. It switches to TCP.

  5. OpenIB (deprecated Verbs driver)

    If we didn't disable Verbs support with --disable-verbs, it doesn't get used unless forced. The way to force it is to disable UCX and pass --mca btl_openib_allow_ib 1 (to allow InfiniBand cards to be used in the openib BTL). This is equivalent to passing --mca btl openib,vader,self to OpenMPI before version 4.

    $ mpiexec -np 2 --map-by node --bind-to core --mca pml ^ucx --mca btl_openib_allow_ib 1 ... IMB-MPI1 Pingpong
             size         iter      latency    bandwidth
                0         1000         1.48         0.00
                1         1000         1.54         0.65
          2097152           20       576.53      3637.52
          4194304           10      1128.36      3717.16

    The performance is actually very good, basically identical to UCX, so it is not obvious why Mellanox wants to deprecated openib support in OpenMPI. However this hardware is old. Maybe the difference would be larger on recent Mellanox hardware.

4.3. mpich

4.4. nmad

5. Refileme

5.1. guix + slurm

From mistral01.

srun -p hpc -N 2 -n 4 guix environment --pure --preserve=^SLURM  --ad-hoc slurm -- /bin/hostname

miriel018.formation.cluster miriel017.formation.cluster miriel017.formation.cluster miriel017.formation.cluster

srun -p hpc -N 2 -n 4 guix environment --pure --ad-hoc slurm -- /bin/hostname

miriel017.formation.cluster miriel017.formation.cluster miriel017.formation.cluster miriel018.formation.cluster

salloc -p hpc -N 2 -n 48 guix environment --pure --ad-hoc slurm -- /bin/bash --norc

salloc: Granted job allocation 426601

srun /bin/hostname
srun: error: Unable to allocate resources: No partition specified or system default partition

salloc: Relinquishing job allocation 426601

salloc -p hpc -N 2 -n 48 guix environment --pure --preserve=^SLURM  --ad-hoc slurm -- /bin/bash --norc

Author: Inria Bordeaux Sud Ouest

Created: 2021-06-10 Thu 18:01