slurm, mpi, guix and plafrim
Table of ContentsClose
Supercomputers rely on job schedulers to handle resources (compute nodes). In the case of plafrim, as for 60% of Top 500 today supercomputers, slurm is in charge of that, as detailed in the plafrim reference.
We assume that you are all set up with your plafrim
access and guix
environment and that you have log into the platform:
ssh plafrim-hpcs
You shall now be on the frontal node (mistral01
)
hostname
mistral01.formation.cluster
We will now use slurm
to run on the compute nodes of the platform (and
discover which they are).
1. slurm
basics
slurm
is an open source, fault-tolerant, and highly scalable cluster
management and job scheduling system for large and small Linux clusters. slurm
requires no kernel modifications for its operation and is relatively
self-contained. You may want to read an overview of its architecture in the
slurm
reference.
You can first check out with sinfo
whether there are partitions with available
(idle) compute nodes:
sinfo
PARTITION | AVAIL | TIMELIMIT | NODES | STATE | NODELIST |
hpc | up | 8:00:00 | 1 | drain* | miriel007 |
hpc | up | 8:00:00 | 12 | down* | miriel[008-016,019,022,024] |
hpc | up | 8:00:00 | 19 | idle | miriel[001-006,017-018,020-021,023,025-032] |
sirocco | up | 10:00:00 | 1 | drain | sirocco06 |
mistral | up | 3-00:00:00 | 2 | down* | mistral[04,18] |
mistral | up | 3-00:00:00 | 15 | idle | mistral[02-03,05-17] |
In this case, there are 19 miriel
compute nodes available on the hpc
partition.
We can monitor the node that we got with squeue
, possibly restricting the view
to our own jobs with the -u
option:
squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
At this point, you have no jobs running, it's time to launch a few! As explained
in the quick start slurm
reference, onto which this tutorial is based, there
are three main ways of using slurm
. We'll see later on that srun
may be
optionally substituted with mpirun
for mpi
applications (when the support is
turned on).
1.1. Direct usage of srun
It is possible to create a resource allocation and launch the tasks for a job
step in a single command line using the srun
command. We'll see later on
that depending upon the mpi
implementation used, mpi
jobs may also be
launched in this manner, but for now we set mpi
aside and we illustrate it
with a simple example displaying the hostname.
1.1.1. Default usage (without guix
)
We viewed above that the partition we want to run onto is called hpc
, so we
instruct srun
to run on that partition with the -p
option:
srun -p hpc /bin/hostname
miriel023.formation.cluster
We observe that the execution is performed on a miriel
compute node. Only one
message is triggered because you ran on only one process, or task
in slurm
terminology. You may want to use the -l
option prepends task number to lines
of stdout/err
.
srun -p hpc -l /bin/hostname
0: miriel023.formation.cluster
You will observe that the message is now preceded with task
id 0, the unique
task
allocated for this execution. You can run more tasks as follows. With the
-N
(or --nodes=
) option, you can request a (minimum) number N of nodes be
allocated to the job:
srun -l -p hpc -N 2 hostname
0: | miriel017.formation.cluster |
1: | miriel018.formation.cluster |
You may observe that two processes (with ids 0 and 1) are running on two
different miriel
compute nodes. By default, only one task per node is
allocated. You may however observe that 24 CPU cores on a miriel
node (here
miriel017
) are available by monitoring the value of CPUTot
retrieved by the
follow scontrol
command (see slurm
refcard):
scontrol show node miriel017
NodeName=miriel017 | Arch=x8664 | CoresPerSocket=12 | |||||
CPUAlloc=0 | CPUErr=0 | CPUTot=24 | CPULoad=0.01 | ||||
AvailableFeatures=Miriel,MirielOPA | |||||||
ActiveFeatures=Miriel,MirielOPA | |||||||
Gres=(null) | |||||||
NodeAddr=miriel017 | NodeHostName=miriel017 | Version=17.11 | |||||
1 | SMP | Thu | Nov | 8 | 23:39:32 | UTC | 2018 |
RealMemory=128000 | AllocMem=0 | FreeMem=127134 | Sockets=2 | Boards=1 | |||
State=IDLE | ThreadsPerCore=1 | TmpDisk=227328 | Weight=1 | Owner=N/A | MCSlabel=N/A | ||
Partitions=hpc | |||||||
BootTime=2019-10-31T09:23:47 | SlurmdStartTime=2019-10-31T09:54:58 | ||||||
CfgTRES=cpu=24,mem=125G,billing=24 | |||||||
AllocTRES= | |||||||
CapWatts=n/a | |||||||
CurrentWatts=0 | LowestJoules=0 | ConsumedJoules=0 | |||||
ExtSensorsJoules=n/s | ExtSensorsWatts=0 | ExtSensorsTemp=n/s |
You may thus want to use all available 24 CPU cores per node
by setting the --ntasks-per-node=24
option:
srun -p hpc -N 2 --ntasks-per-node=24 -l /bin/hostname | sort -n
0: | miriel017.formation.cluster |
1: | miriel017.formation.cluster |
2: | miriel017.formation.cluster |
3: | miriel017.formation.cluster |
4: | miriel017.formation.cluster |
5: | miriel017.formation.cluster |
6: | miriel017.formation.cluster |
7: | miriel017.formation.cluster |
8: | miriel017.formation.cluster |
9: | miriel017.formation.cluster |
10: | miriel017.formation.cluster |
11: | miriel017.formation.cluster |
12: | miriel017.formation.cluster |
13: | miriel017.formation.cluster |
14: | miriel017.formation.cluster |
15: | miriel017.formation.cluster |
16: | miriel017.formation.cluster |
17: | miriel017.formation.cluster |
18: | miriel017.formation.cluster |
19: | miriel017.formation.cluster |
20: | miriel017.formation.cluster |
21: | miriel017.formation.cluster |
22: | miriel017.formation.cluster |
23: | miriel017.formation.cluster |
24: | miriel018.formation.cluster |
25: | miriel018.formation.cluster |
26: | miriel018.formation.cluster |
27: | miriel018.formation.cluster |
28: | miriel018.formation.cluster |
29: | miriel018.formation.cluster |
30: | miriel018.formation.cluster |
31: | miriel018.formation.cluster |
32: | miriel018.formation.cluster |
33: | miriel018.formation.cluster |
34: | miriel018.formation.cluster |
35: | miriel018.formation.cluster |
36: | miriel018.formation.cluster |
37: | miriel018.formation.cluster |
38: | miriel018.formation.cluster |
39: | miriel018.formation.cluster |
40: | miriel018.formation.cluster |
41: | miriel018.formation.cluster |
42: | miriel018.formation.cluster |
43: | miriel018.formation.cluster |
44: | miriel018.formation.cluster |
45: | miriel018.formation.cluster |
46: | miriel018.formation.cluster |
47: | miriel018.formation.cluster |
You will observe that processes 0-23 are executed one node (miriel017
in our
case) whereas processes 24-47 are executed on a second node (miriel018
here).
You may want to have further reading on slurm resource binding policies here and
there or directly check out the output of srun --cpu-bind=help
.
Alternatively, one may use the total number of tasks to run over all the nodes
with the -n
option, the default being one task per node (but note that the
--cpus-per-task
option would change this default). In this case you do not
need to specify the number of nodes, hence, the above command is equivalent to
both the srun -p hpc -N 2 -n 48 -l /bin/hostname | sort -n
and srun -p hpc -n
48 -l /bin/hostname | sort -n
(where -N 2
is implicit) commands.
srun
can also be used to enter an interactive shell session:
srun -p hpc -N 2 --ntasks-per-node=24 --pty bash
However, salloc
, may be certainly more convenient to do so, as discussed below.
1.1.2. guix
-based usage
As of today, here is the default slurm
version deployed on plafrim
:
srun --version
slurm 17.11.3-2
- Using
srun
provided byguix
environment
Instead of the one provided by default on
plafrim
(slurm
17.11.3-2), we can letguix
provide it. There is a risk though: theslurm
client provided byguix
must be able to speak with theslurm
daemon (see overview architecture). In our case, this should be fine,guix
providing the 17.11.3 version:guix environment --pure --ad-hoc slurm -- srun --version
slurm 17.11.3
We can therefore check out the hostname with the
srun
command provided byguix
:guix environment --pure --ad-hoc slurm -- srun -l -p hpc -N 2 /bin/hostname
0: miriel017.formation.cluster 1: miriel018.formation.cluster Or, even better, check out the hostname with both the
srun
andhostname
(through theinetutils
package for the latter) commands provided byguix
:guix environment --pure --ad-hoc slurm inetutils -- srun -l -p hpc -N 2 hostname
0: miriel017.formation.cluster 1: miriel018.formation.cluster In this workflow, as the
--pure
option cleans up the environment beforesrun
is invoked,slurm
environment variables are naturally kept:guix environment --pure --ad-hoc slurm coreutils -- srun -l -p hpc -N 2 env | grep SLURM_JOB_NODELIST
0: SLURMJOBNODELIST=miriel[017-018] 1: SLURMJOBNODELIST=miriel[017-018] - Using default
srun
provided byplafrim
for launching aguix
environment
If
slurm
was deployed by system administrators independently fromguix
as it is likely to be the case as of today on a production cluster or supercomputer, and this is the case onplafrim
, one may want to rely on thesrun
command to make sure theclient
and thedameon
are well interfaced. It is still possible to deploy aguix
environment in this case as follows. A first option is to provide the full path/usr/bin/srun
:guix environment --pure --ad-hoc inetutils -- /usr/bin/srun -l -p hpc -N 2 hostname
0: miriel006.formation.cluster 1: miriel009.formation.cluster This might be fully satisfactory when using the
--pure option
and one may decide to rely on it during the hands-on sessions of the school.However, if this option does clean up the shell environment in a manner equivalent to
env -i
(see theenv
manpage withman env
), it does not fully isolate the process. Sometimes it is desirable to isolate the environment as much as possible, for maximal purity and reproducibility. In particular, it may be desirable to prevent access to/usr/bin
and other system-wide resources from the development environment. The--container
(or-C
) option provided by guix environment allows one to do so by relying on linux containers (lxc
). You may read this for further information. This feature requireslinux-libre
3.19 or newer and can be activated as discussed here (or there) if you deploy yourselfguix
. Unfortunately,plafrim
currently provides an older version (3.10) of thelinux
kernel. Here is for information the report file of system disk space usage on a machine providing the mechanism:guix environment --ad-hoc --container coreutils -- df
Filesystem 1K-blocks Used Available Use% Mounted on /dev /dev/tty /dev/shm /gnu/store/26sq3rhvxydqpmgfn5kicd11drib0xpd-profile One may view that
/usr/
and/etc/
cannot interfere in such a configuration.Because the
--container
option is currently not available onplafrim
we pursue hour discussion using the--pure
option (but the--container
one shall be used instead on systems where it is available if one wants to ensure reproducibility at an even higher level).srun -l -p hpc -N 2 guix environment --pure --ad-hoc inetutils -- hostname
1: miriel018.formation.cluster 0: miriel017.formation.cluster In this workflow, as the
--pure
(or--container
) option cleans up the environment aftersrun
is invoked,slurm
environment variables are lost:srun -l -p hpc -N 2 guix environment --pure --ad-hoc slurm coreutils -- env | grep SLURM_JOB_NODELIST
If it is not a problem when the goal is simply to display the hostname, it may however be critical in more realistic situations. We can then preserve those variables with the
--preserve
guix environment option, which takes in argument a regexp,^SLURM
in the present case to preserve slurm environment variables starting with "SLURM":srun -l -p hpc -N 2 guix environment --pure --preserve=^SLURM --ad-hoc slurm coreutils -- env | grep SLURM_JOB_NODELIST
0: SLURMJOBNODELIST=miriel[017-018] 1: SLURMJOBNODELIST=miriel[017-018] As above, it is possible to enter a bash interactive session, but this time within a
guix
environment, withbash
provided byguix
.srun -N 2 -p hpc --pty guix environment --pure --ad-hoc slurm bash -- bash --norc
Yet, we'll discuss below how to use
salloc
withguix
, which may be more convenient to conduct such interactive sessions.To illustrate once again the symmetry of both behaviours, let us dig further in the hardware topology using hwloc, using either
srun
within aguix
environment:guix environment --pure --ad-hoc slurm hwloc -- srun -p hpc hwloc-ls
(result omitted) or, equivalently,
guix
environment withinsrun
:mistral01.formation.cluster
srun -p hpc guix environment --pure --ad-hoc hwloc -- hwloc-ls
Machine (128GB total) 0 0 (15MB) 0 (P#0 32GB) 0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) 1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2) 2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4) 3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6) 4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8) 5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10) HostBridge PCIBridge PCI 03:00.0 (InfiniBand) Net "ib0" OpenFabrics "qib0" PCIBridge PCI 01:00.0 (Ethernet) Net "em1" PCI 01:00.1 (Ethernet) Net "em2" PCI 00:11.4 (SATA) Block(Disk) "sda" PCIBridge PCI 06:00.0 (Ethernet) Net "em3" PCI 06:00.1 (Ethernet) Net "em4" PCIBridge PCIBridge PCIBridge PCIBridge PCI 0a:00.0 (VGA) PCI 00:1f.2 (SATA) 1 (15MB) 1 (P#2 32GB) 6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12) 7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14) 8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#16) 9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#18) 10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#20) 11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#22) 1 2 (15MB) 2 (P#1 32GB) 12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#1) 13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#3) 14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#5) 15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#7) 16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#9) 17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#11) HostBridge PCIBridge PCI 82:00.0 (Fabric) Net "ib1" OpenFabrics "hfi1_0" 3 (15MB) 3 (P#3 32GB) 18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#13) 19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#15) 20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#17) 21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#19) 22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#21) 23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
1.2. Interactive allocation through salloc
(before running with srun
)
1.2.1. Default usage (without guix
)
The direct usage of srun
is very versatile and allows one to go straight to
the point, which is very convenient in many cases. However, in some cases, one
may want to factorize the resource allocation before performing multiple
executions in that context. The salloc
command allows one to do so by
allocating resources and spawning a shell. The shell is then used to
execute srun
commands to launch parallel tasks.
salloc -p hpc -N 2 -n 4 srun -l /bin/hostname | sort -n
salloc: | Granted | job | allocation | 426762 |
0: | miriel017.formation.cluster | |||
1: | miriel017.formation.cluster | |||
2: | miriel017.formation.cluster | |||
3: | miriel018.formation.cluster |
Note that the second command was performed within a spawned shell.
This spawned shell is still being executed on the mistral01
frontal
node:
hostname
mistral01.formation.cluster
But, if you re-execute an srun
command, you'll still have access to the
allocated resources:
srun -l /bin/hostname | sort -n
0: | miriel017.formation.cluster |
1: | miriel017.formation.cluster |
2: | miriel017.formation.cluster |
3: | miriel018.formation.cluster |
You may monitor your job allocation (whether or not you are within the spawned process):
squeue -u $USER
JOBID | PARTITION | NAME | USER | ST | TIME | NODES | NODELIST(REASON) |
426762 | hpc | bash | hpcs-agu | R | 0:06 | 2 | miriel[017-018] |
The job allocation gets automatically terminated if it reaches the walltime,
if you explicitly exit
the spawned shell:
exit
salloc: Relinquishing job allocation 426762
or if you cancel it (possibly from another session):
scancel 426762
salloc: Job allocation 426762 has been revoked.
1.3. Batch execution with sbatch
(within which running with srun
)
sbatch
is used to submit
a job script for later execution. The script will
typically contain one or more srun commands to launch parallel tasks.
smap
reports state information for jobs, partitions, and nodes managed by
Slurm, but graphically displays the information to reflect network topology.
1.3.1. Two lines
#!/bin/sh #SBATCH --time=03:59:00 #SBATCH -N 2 #SBATCH -n 4 #SBATCH -p hpc guix environment --pure --ad-hoc hello -- /bin/bash --norc srun -l --mpi=pmi2 hello
srun: error: miriel003: task 3: Exited with exit code 2 0: slurmstepd: error: execve(): hello: No such file or directory 3: slurmstepd: error: execve(): hello: No such file or directory 1: slurmstepd: error: execve(): hello: No such file or directory 2: slurmstepd: error: execve(): hello: No such file or directory srun: error: miriel002: tasks 0-2: Exited with exit code 2
1.3.2. One line
#!/bin/sh #SBATCH --time=03:59:00 #SBATCH -N 2 #SBATCH -n 4 #SBATCH -p hpc srun -l --mpi=pmi2 guix environment --pure --ad-hoc hello -- hello
0: Hello, world! 2: Hello, world! 1: Hello, world! 3: Hello, world!
1.3.3. Two files (to be completed)
#!/bin/sh #SBATCH --time=03:59:00 #SBATCH -N 2 #SBATCH -n 4 #SBATCH -p hpc guix environment --pure --ad-hoc hello -- hello.sh
hello.sh:
#!/bin/bash mpirun --mpi=pmi2 hello
2. mpi
https://en.wikipedia.org/wiki/Message_Passing_Interface#Example_program
openmpi, mpich, nmad (part of the pm2 suite).
You may want to read how do the terms job, task and step relate to each other.
2.1. openmpi
openmpi
is integrated with slurm
as discussed in the openmpi
reference. In
particular, openmpi
automatically obtains both the list of hosts and how many
processes to start on each host from slurm
directly, providing that it has
been configured to do so. Hence, it is unnecessary to specify the --hostfile
,
--host
, or -np
options. This is true whether you rely on srun
((1)
directly, (2) within an interactive salloc
session, or, (3) in a sbatch
batch) or on mpirun
((1) in an interactive salloc
session, or, (2) in a
sbatch
batch).
to mpirun
. Furthermore, you can launch openmpi
mpirun
in an interactive Slurm allocation (via the salloc command) or you can
submit a script to Slurm (via the sbatch command), or you can "directly" launch
MPI executables via srun.
https://www.open-mpi.org/faq/?category=slurm
https://slurm.schedmd.com/mpiplugins.html (–mpi=pmi2 par –mpi=pmix)
module load mpi/openmpi/4.0.1
OpenMPI in our case: https://slurm.schedmd.com/mpi_guide.html#open_mpi
--mpi=pmix_v3
(starting from OpenMPI 4.0)
guix environment –pure –ad-hoc openmpi – mpirun –version mpirun (Open MPI) 4.0.2
[hpcs-agullo@mistral01 ~]$ srun –mpi=list srun: MPI types are… srun: none srun: openmpi srun: pmi2
[hpcs-agullo@mistral01 ~]$ guix environment –pure –ad-hoc slurm – srun –mpi=list srun: MPI types are… srun: none srun: openmpi srun: pmi2
2.2. mpich
2.3. nmad
3. slurm
and mpi
4. mpi
behind the scene
4.1. Introduction
4.2. openmpi
4.2.1. Many ways to use the network
OpenMPI has many different ways to use high-performance networks. There are at least three reasons:
- Some networks actually support multiple driver APIs, for instance native InfiniBand Verbs, PSM, and TCP.
- Drivers can plug into OpenMPI at different levels (BTL, MTL, PML) depending on what features they support.
- Vendors are pushing towards the use of new network-agnostic communication libraries: libfabric/OFI on the Intel side, UCX on the Mellanox (now NVIDIA) side. OpenMPI can use networks directly or through these libraries.
All miriel nodes are connected through Intel TrueScale 40G InfiniBand. Although it is marketingly said InfiniBand-compatible, it should actually be used through its own software interface called PSM. Besides, miriel nodes 001-043 also have an Intel 100G OmniPath interface, that should be used through the PSM2 driver. In OpenMPI, these may be used through the "MTL" layer.
Also, mistral nodes (and some sirocco) have Mellanox 40G InfiniBand interfaces. Those may be used through the native "standard" Verbs InfiniBand stack. However Mellanox rather pushes for their new "UCX" stack to be used.
4.2.2. How drivers are selected
All these ways to use high-performance networks look complicated but they actually work well once you understand how drivers are selected. The runtime looks at available hardware and drivers and decides which one(s) to used based on priorities. Once you are sure that all high-performance drivers are compiled in, the right one will be used. This even works when mixing nodes with different networks because this decision is made for each connection between a pair of nodes.
The OpenMPI configure scripts enables everything it finds.
That's good if all drivers are installed.
However it will silently disable important ones if development headers are missing (for instance UCX devel packages).
To avoid such issues, one should pass --with-foo
to force the configure script to fail if foo is not available.
"configure: error: PSM support requested but not found. Aborting"
On PlaFRIM, it means you should pass --with-psm --with-psm2 --with-ucx
.
You may also pass --without-ofi
since using libfabric/OFI isn't really useful here
(it shouldn't be faster than PSM/PSM2/UCX and PlaFRIM doesn't have any additional hardware it supports).
Finally you may also pass --disable-verbs
since those are deprecated in OpenMPI 4.
If not disabled, you will have to pass --mca btl_openib_allow_ib 1
to mpiexec to use that API.
UCX is basically required for OpenMPI 4 over Mellanox hardware nowadays.
That's actually strange because MVAPICH, another famous MPI implementation for
InfiniBand doesn't plan to switch from Verbs to UCX anytime soon.
In the end, you need a node with development headers for all networks you want to support for this OpenMPI installation. Once it's compiled, it will still run fine on nodes that miss some network libraries because OpenMPI uses dynamically loaded plugins to ignore those missing libraries.
4.2.3. Verifying what was enabled/compiled
A summary at the end of configure confirms whether things are correctly enabled/disabled:
Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): yes <- miriel001-043
Intel TrueScale (PSM): yes <- miriel001-088
Mellanox MXM: no
Open UCX: yes <- mistral+sirocco01-06
OpenFabrics OFI Libfabric: no <- explicitly disabled
OpenFabrics Verbs: no <- explicitly disabled
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes
The ompi_info
tool also lets you check which components are available at runtime:
$ ompi_info | grep ucx MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.0.2) MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.2) $ ompi_info | grep psm MCA mtl: psm (MCA v2.1.0, API v2.0.0, Component v4.0.2) MCA mtl: psm2 (MCA v2.1.0, API v2.0.0, Component v4.0.2) $ ompi_info | grep Configure Configure command line: '--with-ucx' '--with-psm' '--with-psm2' '--disable-verbs' '--without-ofi' '--enable-mpirun-prefix-by-default' '--prefix=...'
4.2.4. Performance benchmarks to check whether the right network is used
We use the Intel MPI Benchmark from https://github.com/intel/mpi-benchmarks/ It was compiled with:
make IMB-MPI1 CC=/my/ompi/bin/mpicc CXX=/my/ompi/bin/mpicxx
We'll just run the IMB Pingpong benchmark between 2 processes, one process per node.
mpiexec -np 2 -H mistral02,mistral03 --map-by node --bind-to core IMB-MPI1 Pingpong
The option --map-by node
makes sure that our processes are distributed on nodes first
(instead of both going to the same node).
The option --bind-to core
makes sure processes do not move between cores so that
performance is more reproducible between runs.
- PSM2
On miriel001-043, we want to get 100Gbit/s bandwidth. That's supposed to be 12.5GB/s but we rather get 10GB/s in practice because some bandwidth is lost in packet headers, signalling bits, etc.
$ mpiexec -np 2 --map-by node --bind-to core ... IMB-MPI1 Pingpong size iter latency bandwidth 0 1000 1.45 0.00 1 1000 1.62 0.62 [...] 2097152 20 215.75 9720.19 4194304 10 439.37 9546.20
If we add
--mca mtl ^psm2
to disable PSM2, or--mca mtl psm
to force PSM1 only, the performance disappears (see below). Hence we were really using PSM2 here, and getting (almost) the expected 10GB/s. - PSM1
On miriel001-088, we want to get 40Gbit/s (when we cannot use the 100G PSM2 network).
$ mpiexec -np 2 --map-by node --bind-to core --mca mtl psm ... IMB-MPI1 Pingpong size iter latency bandwidth 0 1000 1.56 0.00 1 1000 1.68 0.59 [...] 2097152 20 689.57 3041.23 4194304 10 1325.47 3164.39
If we add
--mca mtl ^psm2,psm
to disable PSM2 and PSM1, the performance disappears. Hence we were really using PSM1 here, and getting (almost) the expected 4GB/s. - TCP
Still on miriel, if we force use TCP on the 10G Ethernet network by disabling PSM1 and PSM2, we get about 1GB/s as expected:
$ mpiexec -np 2 --map-by node --bind-to core --mca mtl ^psm2,psm ... IMB-MPI1 Pingpong size iter latency bandwidth 0 1000 16.39 0.00 1 1000 18.08 0.06 [...] 2097152 20 2314.69 906.02 4194304 10 5323.38 787.90
In this case, OpenMPI is using the TCP BTL. The equivalent configuration is
--mca btl tcp,vader,self
. Thevader
BTL is here for communication between processes on the same node (there are some StarWars-related jokes in OpenMPI component names). Theself
is for communication between a process and itself. - UCX
Now we switch to mistral nodes with 40G Mellanox InfiniBand. The default configuration gives almost 4GB/s as expected.
$ mpiexec -np 2 --map-by node --bind-to core ... IMB-MPI1 Pingpong size iter latency bandwidth 0 1000 1.64 0.00 1 1000 1.66 0.60 2097152 20 567.42 3695.97 4194304 10 1126.84 3722.18
If we disable UCX with
--mca pml ^ucx
, performance disappears as expected. It switches to TCP. - OpenIB (deprecated Verbs driver)
If we didn't disable Verbs support with
--disable-verbs
, it doesn't get used unless forced. The way to force it is to disable UCX and pass--mca btl_openib_allow_ib 1
(to allow InfiniBand cards to be used in theopenib
BTL). This is equivalent to passing--mca btl openib,vader,self
to OpenMPI before version 4.$ mpiexec -np 2 --map-by node --bind-to core --mca pml ^ucx --mca btl_openib_allow_ib 1 ... IMB-MPI1 Pingpong size iter latency bandwidth 0 1000 1.48 0.00 1 1000 1.54 0.65 [...] 2097152 20 576.53 3637.52 4194304 10 1128.36 3717.16
The performance is actually very good, basically identical to UCX, so it is not obvious why Mellanox wants to deprecated
openib
support in OpenMPI. However this hardware is old. Maybe the difference would be larger on recent Mellanox hardware.
4.3. mpich
4.4. nmad
5. Refileme
5.1. guix + slurm
From mistral01
.
srun -p hpc -N 2 -n 4 guix environment --pure --preserve=^SLURM --ad-hoc slurm -- /bin/hostname
miriel018.formation.cluster miriel017.formation.cluster miriel017.formation.cluster miriel017.formation.cluster
srun -p hpc -N 2 -n 4 guix environment --pure --ad-hoc slurm -- /bin/hostname
miriel017.formation.cluster miriel017.formation.cluster miriel017.formation.cluster miriel018.formation.cluster
salloc -p hpc -N 2 -n 48 guix environment --pure --ad-hoc slurm -- /bin/bash --norc
salloc: Granted job allocation 426601
srun /bin/hostname
srun: error: Unable to allocate resources: No partition specified or system default partition
exit exit
salloc: Relinquishing job allocation 426601
salloc -p hpc -N 2 -n 48 guix environment --pure --preserve=^SLURM --ad-hoc slurm -- /bin/bash --norc
5.2. Best practices
5.2.1. Long jobs
Some more to read.
- https://slurm.schedmd.com/job_array.html
- https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
- http://en.wikipedia.org/wiki/Openmp#C