Hands-on sparse linear solvers
Table of Contents
- 1. Introduction
- 2. Getting set up
- 3. A first exploration of
maphys
features- 3.1.
maphys
driver - 3.2. Set up for doing testing with
salloc
andguix
environment - 3.3. A first execution
- 3.4. An actual hybrid method
- 3.5. Exploit SPD properties
- 3.6. Try running on more subdomains
- 3.7. Use a robust algebraic domain decompostion
- 3.8. Backward and forward error analysis
- 3.9. Sparse direct solver
- 3.10. Display more info
- 3.11. Sparsification
- 3.12. Force GMRES
- 3.13. Multiple right-hand sides (
fabulous
) - 3.14. Matrix vector product
- 3.15. exit the environment
salloc
andguix
environment
- 3.1.
- 4. Larger test cases
- 5. Build
maphys
This hands-on session is a follow-up of yesterday 2D-3D hands-on but can also be conducted in a standalone fashion, all the required material (matrices) from the previous session being provided as an input to this session.
The pedagogical objectives of this hands-on session are the following.
- Having a more advanced usage of slurm/mpi/guix, through the same
guix
set up as for the rest of the school. - Understanding and analyzing the behaviour of a sparse solver, whether it is direct, iterative or hybrid.
- Setting up key numerical (stopping criterion, …) and computational parameters (MPI+thread, …) parameters.
- Knowing how to recompile a solver to change a key dependency such as the
blas
it depends on.
The instructions below are related to the maphys hybrid direct/iterative solver. You can find the ones for the pastix direct solver are provided here.
1. Introduction
maphys
is a software package for the solution of sparse linear system \[ {\cal
A} x = b \] where \({\cal A}\) is a square real or complex non singular general
matrix, \(b\) is a given right-hand side vector, and \(x\) is the solution vector to
be computed. It follows a non-overlapping algebraic domain decomposition method
that first reorders the linear system into a \(2 \times 2\) block system
where \({\cal A}_{II}\) and \({\cal A}_{\Gamma\Gamma}\) respectively represent interior subdomains and separators, and \({\cal A}_{I\Gamma}\) and \({\cal A}_{\Gamma I}\) are the coupling between interior and separators. By eliminating the unknowns associated with the interior subdomains \({\cal A}_{II}\) we get
\begin{equation} \left( \begin{array}{cc} {\cal A}_{II} & {\cal A}_{I\Gamma} \\ 0 & {\cal S} \\ \end{array} \right) \left( \begin{array}{c} x_{I} \\ x_{\Gamma} \\ \end{array} \right) = \left( \begin{array}{c} b_{I} \\ f \\ \end{array} \right), \end{equation}with
\begin{equation} {\cal S}={\cal A}_{\Gamma\Gamma}-{\cal A}_{\Gamma I} {\cal A}_{II}^{-1} {\cal A}_{I\Gamma} \; \textrm{ and} \; f = b_\Gamma -{\cal A}_{\Gamma I} {\cal A}_{II}^{-1} b_{I}. \end{equation}The matrix \({\cal S}\) is referred to as the Schur complement matrix. Because most of the fill-in appears in the Schur complement, the Schur complement system is solved using a preconditioned Krylov subspace method while the interior subdomain systems are solved using a sparse direct solver. Although, the Schur complement system is significantly better conditioned than the original matrix \({\cal A}\), it is important to consider further preconditioning when employing a Krylov method.
In maphys
, several overlapping block diagonal preconditioning techniques are
implemented, where each diagonal block is associated with the interface of a
subdomain:
- dense block diagonal factorization: each diagonal block is factorized using
the appropriated
lapack
subroutine, - sparsified block diagonal factorization: the dense diagonal block is first
sparsified by droppring entry \(s_{i,j}\) if it is lower than \(\xi ( \vert
s_{i,i} \vert + \vert s_{j,j}\vert )\). The sparse factorization is performed
by a sparse direct solver (
mumps
orpastix
). - sparse approximation of the diagonal block: a sparse approximation of the diagonal block is computed by replacing \({\cal A}_{II}^{-1}\) by an incomplete \(ILU(t,p)\) factorization. The computed approximation of the Schur complement is further sparsified.
Because of its diagonal nature (consequently local), the preconditioner tends to
be numerically less efficient when the number of subdomains is increased. The
efficient solution provided by maphys
results from a trade-off between the two
contradictory ideas that are increasing the number of domains to reduce the cost
of the sparse factorization of the interior domain on one hand; and reducing the
number of domains to make easier the iterative solution for the interface
solution on the other hand.
2. Getting set up
We use the same guix
set up as for the rest of the school.
We now set up maphys
.
guix environment --pure --ad-hoc maphys coreutils sed bash -- bash --norc
source $GUIX_ENVIRONMENT/bin/maphys_env.sh
We are now in a guix
shell where $MAPHYS_DIR
is set up. We go to our home
directectory (cd
), make a hands-on-maphys
repository for the purpose of this
hands-on session, move there (cd hands-on-maphys
) and put there a selection of
material we will use for the session. Rather than copying the material (you can
do that if you prefer, though), we proceed with symbolic links and cherry-pick
three drivers (dmph_examplekv
, dmph_examplerestart
and
zmph_examplerestart
) one matrix (bcsstk17.mtx
) with either 1 or 4 right-hand
sides (rhs).
cd mkdir hands-on-maphys cd hands-on-maphys ln -s $MAPHYS_DIR/lib/maphys/examples/dmph_examplekv . ln -s $MAPHYS_DIR/lib/maphys/examples/dmph_examplerestart . ln -s $MAPHYS_DIR/lib/maphys/examples/zmph_examplerestart . ln -s $MAPHYS_DIR/lib/maphys/examples/bcsstk17.mtx . ln -s $MAPHYS_DIR/lib/maphys/examples/bcsstk17_rhs.ijv . ln -s $MAPHYS_DIR/lib/maphys/examples/bcsstk17_4rhs.ijv .
We exit the shell spawned by the guix
environment command we have used.
exit
2.1. Optional: Understanding further guix
environment
Can we execute the maphys
drivers we have just produced now that we are
outside of the guix
environment we have used to produce them? Check out the
shared libraries this driver requires to get a hint:
ldd ./dmph_examplekv
In this particular case, we observe that all the dependencies are satisfied. We
thus do not need to explicitly load maphys
in the environment for the next
steps now that the binary is built (we will still need to load openmpi
though
to execute mpirun
). Is this satisfying from a reproducibility point of view?
What does the --pure
guix environment do underneath? Do you expect all the
dependencies would still be matched by ldd
in a --pure
environment. Check
it out:
guix environment --pure --ad-hoc coreutils sed bash gcc-toolchain -- ldd ~/hands-on-maphys/dmph_examplekv
You may now better understand the limits of a --pure
environment. Check out
the -containers
option of guix environment to achieve an effective isolation
and obtain a bit-to-bit reproducible environment. Unfortunately you cannot
assess it on plafrim
yet.
3. A first exploration of maphys
features
3.1. maphys
driver
We cherry-picked three drivers above. In the section, we will use one of them:
dmph_examplekv
. It is main driver provided by maphys
to do standalone
testing (as opposed to maphys
being used as a third-party library of an
application, as seen in the yesterday class) in double precision, real
arithmetic. maphys
support four precisions and there is a specific driver
<arithm>mph_examplekv
for each of them, where <arith>
may be:
- s for single precision, real arithmetic;
- d for double precision, real arithmetic;
- c for single precision, complex arithmetic;
- z for double precision, complex arithmetic.
The driver takes some input information via free structured files (we will
construct a simple.in
input file throughout this section). In order to launch
these tests, you need to specify the matrix file (bcsstk17.mtx in this section)
to process and the maphys
solver options you want to set up in an input file
whose names is given as argument. These options are described in the inputs
files and the documentation. We are using =maphys= version 0.9.8.3, here are the
associated refcard and users'guide. The name of the input file (simple.in
) has
to be given in argument.
We will play with the bcsstk17.mtx matrix from the SuiteSparse Matrix
Collection. Read its properties on the website. Note that it is a real,
symmetric positive definite (SPD) \(10,974\) by \(10,974\) matrix with \(428,650\) non
zero elements, arising from a structural problem. This one is shipped with
maphys
but you may browse, download other matrices from the collection, and
play with those instead.
Note that, based on the extension of the matrix file, the driver can take in charge:
- '.ijv': coordinate format;
- '.mm', or '.mtx': Matrix Market
- '.rsa','.rua','.rse','.rue','.csa','.cua','.cse','.cue': Harwell boeing
The '.psa', '.pua', '.pse', '.pue' Harwell boeing extensions are not supported.
3.2. Set up for doing testing with salloc
and guix
environment
salloc -n 24 -p hpc
squeue -u $USER
ssh miriel<node-id>
guix environment --pure --ad-hoc openmpi openssh vim bash coreutils sed grep -- bash --norc
Be careful to replace <node-id>
by the number of the node number assigned to
your allocation.
3.3. A first execution
We want to instruct the maphys
./dmph_examplekv
driver to run our sample matrix.
Copy these lines in the file in simple.in
.
MATFILE = bcsstk17.mtx SYM = 2 # the symmetry of the matrix (0=general, 1=spd, 2=symmetric)
mpirun -n 2 ./dmph_examplekv simple.in
In this run we can observe that we effectively run in real, double precision
(Real (KIND=8): double precision
). No right-hand side has been provided (no
rhs: solution (1, 2, 3, ...) -> b
), so the driver generates a solution vector
(1, 2, 3, …) and and computes the associated right-hand side through
a matrix vector product.
You may observe that "thread multiple" is enabled (which is thus not supported on mistral01 and may lead to deadlocks right away after the
mpi_init
).
-------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: miriel028 Local device: qib0 -------------------------------------------------------------------------- version is :0.9.8.3 arith is :Real (KIND=8) prefix is : matrix is :bcsstk17.mtx rhs is : initguess is : outrhsfile is : outsolfile is : nb of domains is :2 Have MPI_THREAD_MULTIPLE Warning : failed to open rhsfile Warning : Generates RHS for x s.t. x_i = i for all i Warning : no initial guess *************************************** * MaPHyS 0.9.8.3 [ Real (KIND=8) ] * *************************************** [00000] :Domain Decomposition -> local interface -> maximum size = 174 [00000] :Domain Decomposition -> local interface -> minimum size = 174 [00000] :Domain Decomposition -> local interface -> maximum weight = 2 [00000] :Domain Decomposition -> local interface [ 0] -> size = 174 [00000] :Domain Decomposition -> local interface [ 1] -> size = 174 [00000] :Domain Decomposition -> local interior [ 0] -> size = 6418 [00000] :Domain Decomposition -> local interior [ 1] -> size = 4382 [00000] :Domain Decomposition -> local interior [ 0] -> nnz = 132888 [00000] :Domain Decomposition -> local interior [ 1] -> nnz = 82259 CONVERGENCE HISTORY FOR CG Errors are displayed in unit: 6 Warnings are not displayed: Matrix size: 174 Left preconditioning Default initial guess x_0 = 0 Stopping criterion backawrd error Maximum number of iterations: 100 Tolerance for convergence: 0.10E-04 Backward error on the unpreconditioned system Ax = b: the residual is normalised by 0.00E+00 * ||x|| + 0.61E+14 Optimal size for the workspace: 1045 Convergence history: b.e. on the unpreconditioned system Iteration Approx. A-norm Approx. b.e. True b.e. 1 -- 0.30E-04 -- 2 0.60E+01 0.65E-05 0.65E-05 Convergence achieved B.E. on the unpreconditioned system: 0.65E-05 info(1) = 0 Number of iterations (info(2)): 2 * Solution Quality Description FIELD (INDEX) VALUE ------------------------------------------------------- |A.x-b|/|b| RINFOG( 4) 6.510E-06 Wanted precision RCNTL ( 21) 1.000E-05 Nb. of iters. IINFOG( 5) 2 Max Nb. of iters. ICNTL ( 24) 100 Schur. system backward error RINFOG( 3) 6.510E-06 Description FIELD (INDEX) MIN MAX AVERAGE STD.DEV. ---------------------------------------------------------------------------------------- Schur -> norm1 RINFO ( 2) -1.000E+00 -1.000E+00 -1.000E+00 0.000E+00 Schur -> rcond1 (estimation) RINFO ( 3) -1.000E+00 -1.000E+00 -1.000E+00 0.000E+00 * Sizes Description FIELD (INDEX) VALUE ------------------------------------------------------ input matrix -> order IINFOG( 3) 10974 input matrix -> nb. entries IINFOG( 4) 219812 Description FIELD (INDEX) MIN MAX AVERAGE STD.DEV. ------------------------------------------------------------------------------------------- lc. matrix -> order IINFO ( 4) 4556 6592 5.574E+03 1.018E+03 lc. matrix -> nb. entries IINFO ( 6) 82511 137301 1.099E+05 2.740E+04 lc. matrix -> mem. cost IINFO ( 5) 3 5 4.000E+00 1.000E+00 schur. -> order IINFO ( 11) 174 174 1.740E+02 0.000E+00 lc. matrix facto. -> mem. cost IINFO ( 9) 9 9 9.000E+00 0.000E+00 schur. -> mem. cost IINFO ( 12) 1 1 1.000E+00 0.000E+00 schur. -> nb. entries IINFO ( 13) 30276 30276 3.028E+04 0.000E+00 prcnd. -> order IINFO ( 16) 174 174 1.740E+02 0.000E+00 prcnd. -> % of kept entries IINFO ( 19) 69 69 6.900E+01 0.000E+00 prcnd. -> % of kept entries IINFO ( 19) 69 69 6.900E+01 0.000E+00 * Timings Description FIELD (INDEX) MIN MAX AVERAGE STD.DEV. ---------------------------------------------------------------------------------------------------- Timing -> Total RINFO ( 21) 1.844E-01 1.857E-01 1.851E-01 6.544E-04 Timing -> Spent in all steps RINFO ( 20) 1.735E-01 1.833E-01 1.784E-01 4.853E-03 Timing -> Analyze RINFO ( 4) 1.085E-01 1.087E-01 1.086E-01 9.825E-05 Timing -> Factorise RINFO ( 5) 5.288E-02 6.249E-02 5.768E-02 4.809E-03 Timing -> Precond RINFO ( 6) 5.259E-03 5.390E-03 5.325E-03 6.504E-05 Timing -> Solve RINFO ( 7) 6.742E-03 6.763E-03 6.753E-03 1.044E-05 == ANALYSIS DATA == Number of domains : 2 IINFO(4), n local (min/max/avg) : 4556 6592 5574.0000000000000 IINFO(6), local matrix nnz (min/max/avg) : 82511 137301 109906.00000000000 IINFO(11), local schur n (min/max/avg) : 174 174 174.00000000000000 IINFO(13), local schur nnz (min/max/avg): 30276 30276 30276.000000000000 IINFO(19), percentage of kept entries in schur (min/max/avg) : 69 69 69.000000000000000 == ESTIMATION OF COMMITED ERROR == Backward error: |A.x-b|/|b| norm2 = 6.5100591311E-06 Backward error centralized: |A.x-b|/|b| norm2 = 6.5100591311E-06 Computed (x) vs analytical (x*) solution: |x-x*|/|x*| norm2 = 1.7624358784E-02 Computed (x) vs analytical (x*) solution: |x-x*|/|x*| norm_inf = 6.7049065776E-02 == FIRST VALUES OF ESTIMATED VS ANALYTICAL SOLUTION == | Computed (x) | Theoretical (x*) _____________________________________________________ |sol( 1)| | 1.0000000000E+00 | 1.0000000000E+00 |sol( 2)| | 7.3379644782E+02 | 2.0000000000E+00 |sol( 3)| | 5.5408891809E+00 | 3.0000000000E+00 |sol( 4)| | 2.6556211773E+00 | 4.0000000000E+00 |sol( 5)| | 5.0000000000E+00 | 5.0000000000E+00 |sol( 6)| | 6.0000000000E+00 | 6.0000000000E+00 |sol( 7)| | 7.0000000000E+00 | 7.0000000000E+00 |sol( 8)| | 6.8067590129E+00 | 8.0000000000E+00 |sol( 9)| | 1.4208711808E+01 | 9.0000000000E+00 |sol(10)| | 9.2202338540E-01 | 1.0000000000E+01 [miriel028.formation.cluster:95393] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init [miriel028.formation.cluster:95393] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
How many iterations do you observe? Why? Note that in the current maphys
version, one process is associated to one domain.
3.4. An actual hybrid method
mpirun -n 4 ./dmph_examplekv simple.in
Which iterative solver is being used? You can play with grep "CONVERGENCE HISTORY"
.
iterative solver.
3.5. Exploit SPD properties
MATFILE = bcsstk17.mtx # matrix file SYM = 1 # the symmetry of the matrix (0=general, 1=spd, 2=symmetric)
Which iterative solver iterative solver? Why? Which are are the potential benefits (not representative for a small test case as considered for now).
3.6. Try running on more subdomains
As mentioned above, in the current maphys
version, one process is associated
to one domain. In the previous test case, we thus ran on four subdomains. We
can try running on more subdomains.
As mentioned above, we are restricted to a power of 2 in this set up.
If you run mpirun -n 5 ./dmph_examplekv simple.in
, you'll get an
error:
ERROR: MaPHyS partitioner only support *a power of 2* subdomains. nb subdomains= 5
Try a run on 16 domains:
mpirun -n 16 ./dmph_examplekv simple.in
You shall get the following error:
ERROR: PART_scotch_partgraph while asserting cblknbr == 2*nbdom -1, cblknbr = 30
3.7. Use a robust algebraic domain decompostion
We'll use the Parallel Algebraic Domain Decomposition for Linear systEms
(paddle) to perform the decomposition instead of the dummy internal
default maphys
algorithm.
Check out the maphys
refcard. You'll see that you can activate paddle
with ICNTL(49) = 2
.
MATFILE = bcsstk17.mtx # matrix file SYM = 1 # the symmetry of the matrix (0=general, 1=spd, 2=symmetric) ICNTL(49) = 2 # algebraic domain decomposition algorithm (1=internal dummy algorithm, 2=robust paddle)
Try and run on 16 domains once again in this configuration.
mpirun -n 16 ./dmph_examplekv simple.in
3.8. Backward and forward error analysis
Check out the value of the backward error (you may add | grep "Backward error"
at the end of your mpirun
command) of the previous execution. Is the stopping
criterion satisfied knowing RCNTL(21)
controls this threshold and that you may
observe (if you did not play with that parameter)?:
Wanted precision RCNTL ( 21) 1.000E-05
Check out the direct error (you may add | "Computed (x) vs analytical (x*)
solution"
at the end of your mpirun
command). How can you explain that? Check
out the condition number of bcsstk17
: you can view that online.
3.9. Sparse direct solver
Set up ICNTL(13)
to run either pastix
or mumps
as internal direct solver.
3.10. Display more info
Set up ICNTL(5)
and ICNTL(6)
to view more information.
3.11. Sparsification
So far we were relying on the default preconditioning mechanism resulting to
locally dense block diagonal matrices. Investigate the sparsified block diagonal
factorization: the dense diagonal block is first sparsified by droppring entry
\(s_{i,j}\) if it is lower than \(\xi ( \vert s_{i,i} \vert + \vert s_{j,j}\vert
)\). An additional sparse factorization is then performed in this situation
instead of a lapack
dense factorsation.
Enable sparsification (ICTNL(21)=2
) and set up ξ=RCNTL(11)=1.0e-5,
1.0e-4, or 1.0e-3.
3.12. Force GMRES
3.12.1. Normal
Continue to let the solver know that the matrix is SPD (SYM=1
) while forcing
it to use (ICNTL(20)=1
). Note that is certainly not a good thing to do
in practice, but it is interesting to study.
3.12.2. Orthogonalisation
Change the orthogonalisation scheme (ICNTL(22)). Does it affect convergence? How?
3.13. Multiple right-hand sides (fabulous
)
Turn on (ICNTL(20)=4
) the block Krylov fabulous iterative solver (see more
here) and possibly play with ICNTL(29)
: we are curious.
If you are interested in the fabulous
library, you can check this guide to
install it and use it directly in your own code.
3.14. Matrix vector product
Play with the ICNTL(27)
parameter to assess both implict and explicit matrix
vector products. Which one shall be used for large tests cases? In 2D? In 3D?
3.15. exit the environment salloc
and guix
environment
exit # guix environment exit # miriel028 exit # allocation
4. Larger test cases
We propose to study two classes of matrices.
The first class of matrices arises from the modeling of acoustics wave
propagation problem in heterogeneous media through hou10ni
(yesterday
hands-on). They are complex, non symmetric matrices.
The second class of matrices were generated via the genfem parallel python
finite element python matrix generator developed (see details in Section 5.5.2
of the thesis of Louis Poirel in the context of which the genfem
toolbox has
been designed). They correspond to a stationary heterogeneous diffusion equation
(or Darcy equation) in a 3D stratified medium. They are real, symmetric positive
definite matrices. The ones discussed here have the form of a baton
, hence
their name.
If you have matrices arising from your own application that you have uploaded to plafrim, we will be happy to try and process them.
The maphys
driver to use is <arithm>mph_examplekv
. You should not use
paddle
with this driver (remove ICNTL(49)
from the input file or set it back
to 1.)
4.1. Test cases
4.1.1. hou10ni
test cases (dumped matrices from the previous hands-on)
In the 2D-3D hou10ni
hands-on session, you were invited to produce matrices.
If you did so you may directly play with those matrices you produced, and refine
the analysis of the solver behaviour in the present session.
As a backup alternative, or, even better, as additional test cases, you may use some matrices we dumped. Here are their description. They differ by the number of processes used to process them and the dimension (2D or 3D). Here they are.
2D-8-procs
Repository:
/home/hpcs-readonly/matrices/maphys/hou10ni_2D_8
Command used for production (you don't need to run it):
mpirun -np 8 hou10ni.out < param_marmousi_maphys.txt
param_marmousi_maphys.txt
(extraction) :1 p-adaptivity (yes=1,no=0) 4 Ordre des elements
2D-24-procs
Repository:
/home/hpcs-readonly/matrices/maphys/hou10ni_2D_24
Command used for production (you don't need to run it):
mpirun -np 24 hou10ni.out < param_marmousi_maphys.txt
param_marmousi_maphys.txt
:1 p-adaptivity (yes=1,no=0) 4 Ordre des elements
3D-8procs
Repository:
/home/hpcs-readonly/matrices/maphys/hou10ni_3D_8
Command used for production (you don't need to run it):
mpirun -np 8 hou10ni_lite.out < param_simple_3D_maphys.txt
param_simple_3D_maphys.txt
(extraction):1 p-adaptivity (yes=1,no=0) 4 Ordre des elements
3D-24procs
Repository:
/home/hpcs-readonly/matrices/maphys/hou10ni_3D_24
Command used for production (you don't need to run it):
mpirun -np 24 hou10ni_lite.out < param_simple_3D_maphys.txt
param_simple_3D_maphys.txt
(extraction):1 p-adaptivity (yes=1,no=0) 4 Ordre des elements
4.1.2. baton
test cases (generated with genfem)
You can run:
mpirun -np ${SLURM_NTASKS} ./dmph_examplerestart simple.in /home/hpcs-readonly/matrices/maphys/baton_30_${SLURM_NTASKS}procs
You can set SLURM_NTASKS=24
, 48
and 96
. Note that it is weak scaling along
one dimension (the baton
becomes longer and longer while the other two
dimensions do not change). You should use batch scheduling if you want
to run on more than one node (24 processes).
4.2. Batch and run
From now on, we will use the batch system to create jobs. This is the recommended way to use clusters in general and it will allow us to make runs on a larger number of nodes. Indeed, the jobs will be queued and the job scheduler will automatically run them in order to use the cluster resources as efficiently as possible, which is significantly better than reserving nodes for a long time and use them in interactive mode.
To do so, we create a batch file (we will use the .slurm
extension to
identify such files). If you are not familiar with those, it consists in a bash
script with additional slurm
-related information about the run.
Here is a template template.slurm
:
#!/bin/bash #SBATCH -N 1 # Number of nodes #SBATCH -n 24 # Number of tasks (MPI processes) #SBATCH -c 1 # Number of threads per task #SBATCH -p hpc #SBATCH --job-name=maphys #SBATCH --exclusive # Do not share reserved nodes with other users #SBATCH --output=out_maphys_%j #SBATCH --time=10:00 # Time limit exec guix environment --pure --preserve=^SLURM maphys --ad-hoc maphys slurm pastix starpu -- \ mpirun -np ${SLURM_NTASKS} ./dmph_examplerestart simple.in \ /home/hpcs-readonly/matrices/maphys/baton_30_${SLURM_NTASKS}procs
You can see in the the description of the slurm job with #SBATCH
setting the
slurm
variables.
For the moment leave #SBATCH -c
to 1 (or remove the line), as thread
parallelism will be investigated later in the hand-on.
The second command loads the guix environment and executes the command directly.
Using \
in bash allows one to write a single command on several lines, for
more readability.
To submit the job, use the command sbatch template.slurm
.
4.3. Baton test case
The previous batch allows you to test in weak scaling the baton
test case,
which is a 3D domain growing in one dimension with the number of processes.
Is is scaling well ? Check the time spend in the different parts (RINFO(4)
to RINFO(7)
).
Does the time of factorization increase ? What about the time of solve ? Why ?
What about the number of iterations (IINFOG(5)
) ?
Feel free to explore different parameters for the direct or iterative solver.
4.4. Adding a Coarse Grid Correction
Now rerun the test cases using the Coarse Grid Correction: ICNTL(21)=10
.
You can set the number of eigenvectors to be computed for each subdomains
with ICNTL(33)
, which will change the size of the coarse space.
Check again the time spent in each part of the computation. Does the Coarse Grid Correction manage to bound the number of iterations ?
4.5. Optimize the Coarse Grid Correction
You can play with ICNTL(51)
and ICNTL(52)
to try to optimize
the time spent to compute the preconditioner and solve the iterative
system.
What seems to be the best methods ? (You can still ICNTL(33)
to make a bigger or smaller coarse space and see how the methods
scale).
4.6. MPI+threads
We suggest turning off the previous feature to focus one problem at a time!
Set
ICNTL(21)
back to 1 (or remove it from your input file).
4.6.1. Introduction
The current release of maphys
supports two levels of parallelism
(MPI+threads). The MPI level is inherent to maphys
and cannot be disabled. By
default, only the MPI level of parallelism is enabled and the user can
optionnally furthermore activate multithreading using the following control
parameters:
- ICNTL(42) = 1: turn on multithreading in
maphys
; - ICNTL(37) = number of nodes;
- ICNTL(38) = number of CPU cores on each node (24 on
miriel
nodes); - ICNTL(39) = number threads per MPI process
- ICNTL(40) = the number of processes (also equal to the number of subdomains)
You can also set ICNTL(36) in order to specify the binding you want to use
within maphys
as follows:
- 0 = do not bind;
- 1 = thread to core binding;
- 2 = group binding on a per-process basis (default)
You can read more in Stojce's Nakov thesis or here as well as in the
maphys
users'guide or refcard.
4.6.2. Matrices
To see the effect of thread parallelism, we will work on larger (but fewer) subdomains. You can retrieve them from:
/home/hpcs-readonly/matrices/maphys/baton_50_${SLURM_NTASKS}procs/
/home/hpcs-readonly/matrices/maphys/baton_50_${SLURM_NTASKS}procs/
with 4 and 8 subdomains (${SLURM_NTASKS}=4
or 8
).
Notice that this number of subdomains has been chosen to allow you to use the 24 cores on a node with different number of processes and thread :
- 4 processes and 1 thread;
- 4 processes and 2 threads;
- 4 processes and 3 threads;
- 4 processes and 6 threads;
- 8 processes and 1 thread;
- 8 processes and 2 threads;
- 8 processes and 3 threads.
4.6.3. Exercise
- Set up and assess
maphys
in a MPI+thread set up
Choose a target given configuration, say:
- given number of nodes: 1;
- given number of CPU cores per node: 24;
- given number of threads per MPI process: 2;
- given number of MPI processes: 4.
Set up a
slurm
a batch file and amaphys
input file (ICNTL(42), ICNTL(37-40)) to run in this configuration.Run and analyze the behaviour of the solver.
- Use
slurm
variables to make it automatic
Are you ready to make a scalability study? Maybe you want to make the process less error-prone before, by relating
maphys
variables toslurm
ones automatically.Checking
slurm
output variables, make the set up ofmaphys
MPI+thread automatic as soon as the resources are reserved: 1 full node (24 cores) total with 6 threads per process.
4.6.4. Correction
- Setting up thread parallelism with
slurm
In the batch file, you will have to play with
#SBATCH -c
to set the number of threads, which will set the slurm variableSLURM_CPUS_PER_TASK
.You should use it in combination with
#SBATCH -n
to set the number of MPI processes, which sets the variableSLURM_NTASKS
. - Setting up thread parallelism with
maphys
Using thread parallelism is
maphys
requires playing with the variables descrived above (ICNTL(42), ICNTL(37-40)).To set those automatically, we will use the following template file
threads.in
SYM = 1 ICNTL(6) = 1 ICNTL(13) = 1 ICNTL(20) = 3 # You can set all the parameters you already know here ICNTL(42) = 1 ICNTL(37) = @NNODE@ ICNTL(38) = 24 ICNTL(39) = @NTHREADS@ ICNTL(40) = @NDOMAINS@
Now aim at creating an input file which gets automatically given a number of threads and processes. Let's make a script to do that using the
slurm
variables:maphys_thread_test.sh
#!/bin/bash RUN_ID=${SLURM_NTASKS}_${SLURM_CPUS_PER_TASK} INPUT_FILE=thread_${RUN_ID}.in cp threads.in ${INPUT_FILE} export OMP_NUM_THREAD=${SLURM_CPUS_PER_TASK} sed -i "s/@NNODE@/${SLURM_NNODES}/g" ${INPUT_FILE} sed -i "s/@NTHREADS@/${SLURM_CPUS_PER_TASK}/g" ${INPUT_FILE} sed -i "s/@NDOMAINS@/${SLURM_NTASKS}/g" ${INPUT_FILE} mpirun -np ${SLURM_NTASKS} ./dmph_examplerestart ${INPUT_FILE} /home/hpcs-readonly/matrices/maphys/baton_50_${SLURM_NTASKS}procs
Let's make it executable:
chmod +x maphys_thread_test.sh
Now we just have to call this input file from our batch script. For example with 4 processes and 2 threads per process:
#!/bin/bash #SBATCH -N 1 # Number of nodes #SBATCH -n 4 # Number of tasks (MPI processes) #SBATCH -c 2 # Number of threads per task #SBATCH -p hpc #SBATCH --job-name=maphys #SBATCH --exclusive # Do not share reserved nodes with other users #SBATCH --output=out_threads_maphys_%j #SBATCH --time=10:00 # Time limit exec guix environment --pure --preserve=^SLURM maphys --ad-hoc maphys slurm pastix starpu -- ./maphys_thread_test.sh
4.6.5. Scalability study
You can assess the performance you obtain on all seven MPI+threads configurations we proposed. On one node you should already see if thread parallelism is scaling.
You can then try on multiple nodes. You can explore different combinations of number of threads and processes.
Does thread parallelism scale well in general ?
Assess the performance with both the mumps
(ICNTL(13)=2) and =pastix
(ICNTL(13)=2
) internal sparse direct solvers: which one scales better and
gives a better factorisation time (RINFO(5)
) on the example we are
considering?
4.7. hou10ni matrices
Now you have all the cards in your hands, try to optimize
the solver time for the maphys
solver on the hou10ni
matrices !
hou10ni
matrices are complex and symmetric (but not definite),
so you should use the cmph_examplerestart
(single precision)
or zmph_examplerestart
(double precision, recommended) driver;
and don't forget to set the SYM
maphys parameter to 2.
As mentioned before, the paths to the hou10ni
matrices are:
/home/hpcs-readonly/matrices/maphys/hou10ni_2D_8
/home/hpcs-readonly/matrices/maphys/hou10ni_2D_24
/home/hpcs-readonly/matrices/maphys/hou10ni_3D_8
/home/hpcs-readonly/matrices/maphys/hou10ni_3D_24
If you had the opportunity to dump some matrices yourself in the previous hands-on sessions, you can of course try to run on them!
5. Build maphys
5.1. salloc
and log in to a miriel
compute node
salloc -N 1 -p hpc squeue -u $USER # assuming you get miriel032 ssh miriel032
5.2. Default maphys
build
We clone (with the --recursive
option) maphys
and we build it:
git clone --recursive https://gitlab.inria.fr/solverstack/maphys/maphys.git cd maphys mkdir build cd build guix environment --pure maphys --ad-hoc maphys vim zlib -- /bin/bash --norc cmake .. make -j 12
We can check that all the symbols are retrieved with:
cd examples
ldd ./dmph_examplekv
We can then run a basic example, in sequential or in parallel with 2 or 4 processes:
./dmph_examplekv real_bcsstk17.in mpirun -n 2 ./dmph_examplekv real_bcsstk17.in mpirun -n 4 ./dmph_examplekv real_bcsstk17.in
5.3. Build with pastix
-only and mkl
5.4. Log out from the miriel
compute node and quit salloc
environment
5.5. exit the environment salloc
and guix
environment
exit # guix environment exit # miriel028 exit # allocation
5.6. Performance comparison
On test cases of your choice, compare the performance of
maphys
withpastix
andmkl
- to
maphys
withpastix
andopenblas
.