Hands-on sparse linear solvers

Table of Contents

This hands-on session is a follow-up of yesterday 2D-3D hands-on but can also be conducted in a standalone fashion, all the required material (matrices) from the previous session being provided as an input to this session.

The pedagogical objectives of this hands-on session are the following.

The instructions below are related to the maphys hybrid direct/iterative solver. You can find the ones for the pastix direct solver are provided here.

1. Introduction

maphys is a software package for the solution of sparse linear system \[ {\cal A} x = b \] where \({\cal A}\) is a square real or complex non singular general matrix, \(b\) is a given right-hand side vector, and \(x\) is the solution vector to be computed. It follows a non-overlapping algebraic domain decomposition method that first reorders the linear system into a \(2 \times 2\) block system

\begin{equation} \left( \begin{array}{cc} {\cal A}_{II} & {\cal A}_{I\Gamma} \\ {\cal A}_{\Gamma I} & {\cal A}_{\Gamma\Gamma} \\ \end{array} \right) \left( \begin{array}{c} x_{I} \\ x_{\Gamma} \\ \end{array} \right) = \left( \begin{array}{c} b_{I} \\ b_{\Gamma} \\ \end{array} \right), \end{equation}

where \({\cal A}_{II}\) and \({\cal A}_{\Gamma\Gamma}\) respectively represent interior subdomains and separators, and \({\cal A}_{I\Gamma}\) and \({\cal A}_{\Gamma I}\) are the coupling between interior and separators. By eliminating the unknowns associated with the interior subdomains \({\cal A}_{II}\) we get

\begin{equation} \left( \begin{array}{cc} {\cal A}_{II} & {\cal A}_{I\Gamma} \\ 0 & {\cal S} \\ \end{array} \right) \left( \begin{array}{c} x_{I} \\ x_{\Gamma} \\ \end{array} \right) = \left( \begin{array}{c} b_{I} \\ f \\ \end{array} \right), \end{equation}

with

\begin{equation} {\cal S}={\cal A}_{\Gamma\Gamma}-{\cal A}_{\Gamma I} {\cal A}_{II}^{-1} {\cal A}_{I\Gamma} \; \textrm{ and} \; f = b_\Gamma -{\cal A}_{\Gamma I} {\cal A}_{II}^{-1} b_{I}. \end{equation}

The matrix \({\cal S}\) is referred to as the Schur complement matrix. Because most of the fill-in appears in the Schur complement, the Schur complement system is solved using a preconditioned Krylov subspace method while the interior subdomain systems are solved using a sparse direct solver. Although, the Schur complement system is significantly better conditioned than the original matrix \({\cal A}\), it is important to consider further preconditioning when employing a Krylov method.

In maphys, several overlapping block diagonal preconditioning techniques are implemented, where each diagonal block is associated with the interface of a subdomain:

  • dense block diagonal factorization: each diagonal block is factorized using the appropriated lapack subroutine,
  • sparsified block diagonal factorization: the dense diagonal block is first sparsified by droppring entry \(s_{i,j}\) if it is lower than \(\xi ( \vert s_{i,i} \vert + \vert s_{j,j}\vert )\). The sparse factorization is performed by a sparse direct solver (mumps or pastix).
  • sparse approximation of the diagonal block: a sparse approximation of the diagonal block is computed by replacing \({\cal A}_{II}^{-1}\) by an incomplete \(ILU(t,p)\) factorization. The computed approximation of the Schur complement is further sparsified.

Because of its diagonal nature (consequently local), the preconditioner tends to be numerically less efficient when the number of subdomains is increased. The efficient solution provided by maphys results from a trade-off between the two contradictory ideas that are increasing the number of domains to reduce the cost of the sparse factorization of the interior domain on one hand; and reducing the number of domains to make easier the iterative solution for the interface solution on the other hand.

2. Getting set up

We use the same guix set up as for the rest of the school.

We now set up maphys.

guix environment --pure --ad-hoc maphys coreutils sed bash -- bash --norc
source $GUIX_ENVIRONMENT/bin/maphys_env.sh

We are now in a guix shell where $MAPHYS_DIR is set up. We go to our home directectory (cd), make a hands-on-maphys repository for the purpose of this hands-on session, move there (cd hands-on-maphys) and put there a selection of material we will use for the session. Rather than copying the material (you can do that if you prefer, though), we proceed with symbolic links and cherry-pick three drivers (dmph_examplekv, dmph_examplerestart and zmph_examplerestart) one matrix (bcsstk17.mtx) with either 1 or 4 right-hand sides (rhs).

cd
mkdir hands-on-maphys
cd hands-on-maphys
ln -s $MAPHYS_DIR/lib/maphys/examples/dmph_examplekv .
ln -s $MAPHYS_DIR/lib/maphys/examples/dmph_examplerestart .
ln -s $MAPHYS_DIR/lib/maphys/examples/zmph_examplerestart .
ln -s $MAPHYS_DIR/lib/maphys/examples/bcsstk17.mtx .
ln -s $MAPHYS_DIR/lib/maphys/examples/bcsstk17_rhs.ijv .
ln -s $MAPHYS_DIR/lib/maphys/examples/bcsstk17_4rhs.ijv .

We exit the shell spawned by the guix environment command we have used.

exit

2.1. Optional: Understanding further guix environment

Can we execute the maphys drivers we have just produced now that we are outside of the guix environment we have used to produce them? Check out the shared libraries this driver requires to get a hint:

ldd ./dmph_examplekv

In this particular case, we observe that all the dependencies are satisfied. We thus do not need to explicitly load maphys in the environment for the next steps now that the binary is built (we will still need to load openmpi though to execute mpirun). Is this satisfying from a reproducibility point of view?

What does the --pure guix environment do underneath? Do you expect all the dependencies would still be matched by ldd in a --pure environment. Check it out:

guix environment --pure --ad-hoc coreutils sed bash gcc-toolchain -- ldd ~/hands-on-maphys/dmph_examplekv

You may now better understand the limits of a --pure environment. Check out the -containers option of guix environment to achieve an effective isolation and obtain a bit-to-bit reproducible environment. Unfortunately you cannot assess it on plafrim yet.

3. A first exploration of maphys features

3.1. maphys driver

We cherry-picked three drivers above. In the section, we will use one of them: dmph_examplekv. It is main driver provided by maphys to do standalone testing (as opposed to maphys being used as a third-party library of an application, as seen in the yesterday class) in double precision, real arithmetic. maphys support four precisions and there is a specific driver <arithm>mph_examplekv for each of them, where <arith> may be:

  • s for single precision, real arithmetic;
  • d for double precision, real arithmetic;
  • c for single precision, complex arithmetic;
  • z for double precision, complex arithmetic.

The driver takes some input information via free structured files (we will construct a simple.in input file throughout this section). In order to launch these tests, you need to specify the matrix file (bcsstk17.mtx in this section) to process and the maphys solver options you want to set up in an input file whose names is given as argument. These options are described in the inputs files and the documentation. We are using =maphys= version 0.9.8.3, here are the associated refcard and users'guide. The name of the input file (simple.in) has to be given in argument.

We will play with the bcsstk17.mtx matrix from the SuiteSparse Matrix Collection. Read its properties on the website. Note that it is a real, symmetric positive definite (SPD) \(10,974\) by \(10,974\) matrix with \(428,650\) non zero elements, arising from a structural problem. This one is shipped with maphys but you may browse, download other matrices from the collection, and play with those instead.

Note that, based on the extension of the matrix file, the driver can take in charge:

  • '.ijv': coordinate format;
  • '.mm', or '.mtx': Matrix Market
  • '.rsa','.rua','.rse','.rue','.csa','.cua','.cse','.cue': Harwell boeing

The '.psa', '.pua', '.pse', '.pue' Harwell boeing extensions are not supported.

3.2. Set up for doing testing with salloc and guix environment

salloc -n 24 -p hpc
squeue -u $USER
ssh miriel<node-id>
guix environment --pure --ad-hoc openmpi openssh vim bash coreutils sed grep  -- bash --norc

Be careful to replace <node-id> by the number of the node number assigned to your allocation.

3.3. A first execution

We want to instruct the maphys ./dmph_examplekv driver to run our sample matrix. Copy these lines in the file in simple.in.

MATFILE = bcsstk17.mtx
SYM = 2                # the symmetry of the matrix (0=general, 1=spd, 2=symmetric)
mpirun -n 2 ./dmph_examplekv simple.in

In this run we can observe that we effectively run in real, double precision (Real (KIND=8): double precision). No right-hand side has been provided (no rhs: solution (1, 2, 3, ...) -> b), so the driver generates a solution vector (1, 2, 3, …) and and computes the associated right-hand side through a matrix vector product.

You may observe that "thread multiple" is enabled (which is thus not supported on mistral01 and may lead to deadlocks right away after the mpi_init).

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   miriel028
  Local device: qib0
--------------------------------------------------------------------------
 version  is :0.9.8.3
 arith    is :Real   (KIND=8)
 prefix   is :
 matrix   is :bcsstk17.mtx
 rhs      is :
 initguess is :
 outrhsfile is :
 outsolfile is :
 nb of domains is :2
 Have MPI_THREAD_MULTIPLE
 Warning : failed to open rhsfile
 Warning : Generates RHS for x s.t. x_i = i for all i
 Warning : no initial guess


                     ***************************************
                     * MaPHyS  0.9.8.3 [ Real   (KIND=8) ] *
                     ***************************************

[00000] :Domain Decomposition -> local interface -> maximum size =            174
[00000] :Domain Decomposition -> local interface -> minimum size =            174
[00000] :Domain Decomposition -> local interface -> maximum weight =              2
[00000] :Domain Decomposition -> local interface [    0] -> size =        174
[00000] :Domain Decomposition -> local interface [    1] -> size =        174
[00000] :Domain Decomposition -> local interior [    0] -> size =       6418
[00000] :Domain Decomposition -> local interior [    1] -> size =       4382
[00000] :Domain Decomposition -> local interior [    0] -> nnz =     132888
[00000] :Domain Decomposition -> local interior [    1] -> nnz =      82259
                       CONVERGENCE HISTORY FOR CG

Errors are displayed in unit:  6
Warnings are not displayed:
Matrix size:     174
Left preconditioning
Default initial guess x_0 = 0
Stopping criterion backawrd error
Maximum number of iterations:   100
Tolerance for convergence: 0.10E-04
Backward error on the unpreconditioned system Ax = b:
    the residual is normalised by
        0.00E+00 * ||x|| + 0.61E+14
Optimal size for the workspace:   1045

Convergence history: b.e. on the unpreconditioned system
 Iteration  Approx. A-norm  Approx. b.e.    True b.e.
    1            --           0.30E-04         --
    2         0.60E+01        0.65E-05       0.65E-05

Convergence achieved
B.E. on the unpreconditioned system: 0.65E-05
info(1) =  0
Number of iterations (info(2)):     2

      * Solution Quality


Description                    FIELD (INDEX)      VALUE
 -------------------------------------------------------
|A.x-b|/|b|                    RINFOG(    4)  6.510E-06
Wanted precision               RCNTL (   21)  1.000E-05
Nb. of iters.                  IINFOG(    5)          2
Max Nb. of iters.              ICNTL (   24)        100
Schur. system backward error   RINFOG(    3)  6.510E-06


Description                    FIELD (INDEX)        MIN        MAX    AVERAGE   STD.DEV.
 ----------------------------------------------------------------------------------------
Schur -> norm1                 RINFO (    2) -1.000E+00 -1.000E+00 -1.000E+00  0.000E+00
Schur -> rcond1 (estimation)   RINFO (    3) -1.000E+00 -1.000E+00 -1.000E+00  0.000E+00


      * Sizes


Description                   FIELD (INDEX)      VALUE
 ------------------------------------------------------
input matrix -> order         IINFOG(    3)      10974
input matrix -> nb. entries   IINFOG(    4)     219812


Description                       FIELD (INDEX)        MIN        MAX    AVERAGE   STD.DEV.
 -------------------------------------------------------------------------------------------
lc. matrix -> order               IINFO (    4)       4556       6592  5.574E+03  1.018E+03
lc. matrix -> nb. entries         IINFO (    6)      82511     137301  1.099E+05  2.740E+04
lc. matrix -> mem. cost           IINFO (    5)          3          5  4.000E+00  1.000E+00
schur.     -> order               IINFO (   11)        174        174  1.740E+02  0.000E+00
lc. matrix facto. -> mem. cost    IINFO (    9)          9          9  9.000E+00  0.000E+00
schur.     -> mem. cost           IINFO (   12)          1          1  1.000E+00  0.000E+00
schur.     -> nb. entries         IINFO (   13)      30276      30276  3.028E+04  0.000E+00
prcnd.     -> order               IINFO (   16)        174        174  1.740E+02  0.000E+00
prcnd.     -> % of kept entries   IINFO (   19)         69         69  6.900E+01  0.000E+00
prcnd.     -> % of kept entries   IINFO (   19)         69         69  6.900E+01  0.000E+00


      * Timings


Description                                FIELD (INDEX)        MIN        MAX    AVERAGE   STD.DEV.
 ----------------------------------------------------------------------------------------------------
Timing -> Total                            RINFO (   21)  1.844E-01  1.857E-01  1.851E-01  6.544E-04
Timing -> Spent in all steps               RINFO (   20)  1.735E-01  1.833E-01  1.784E-01  4.853E-03
Timing -> Analyze                          RINFO (    4)  1.085E-01  1.087E-01  1.086E-01  9.825E-05
Timing -> Factorise                        RINFO (    5)  5.288E-02  6.249E-02  5.768E-02  4.809E-03
Timing -> Precond                          RINFO (    6)  5.259E-03  5.390E-03  5.325E-03  6.504E-05
Timing -> Solve                            RINFO (    7)  6.742E-03  6.763E-03  6.753E-03  1.044E-05


 == ANALYSIS DATA ==
 Number of domains : 2
 IINFO(4), n local (min/max/avg) :          4556        6592   5574.0000000000000
 IINFO(6), local matrix nnz (min/max/avg) :        82511      137301   109906.00000000000
 IINFO(11), local schur n (min/max/avg) :          174         174   174.00000000000000
 IINFO(13), local schur nnz (min/max/avg):        30276       30276   30276.000000000000
 IINFO(19), percentage of kept entries in schur (min/max/avg) :           69          69   69.000000000000000

 == ESTIMATION OF COMMITED ERROR ==
Backward error:                           |A.x-b|/|b|    norm2 =     6.5100591311E-06
Backward error centralized:               |A.x-b|/|b|    norm2 =     6.5100591311E-06
Computed (x) vs analytical (x*) solution: |x-x*|/|x*|    norm2 =     1.7624358784E-02
Computed (x) vs analytical (x*) solution: |x-x*|/|x*| norm_inf =     6.7049065776E-02

 == FIRST VALUES OF ESTIMATED VS ANALYTICAL SOLUTION ==
          |         Computed (x) |     Theoretical (x*)
_____________________________________________________
|sol( 1)| |     1.0000000000E+00 |     1.0000000000E+00
|sol( 2)| |     7.3379644782E+02 |     2.0000000000E+00
|sol( 3)| |     5.5408891809E+00 |     3.0000000000E+00
|sol( 4)| |     2.6556211773E+00 |     4.0000000000E+00
|sol( 5)| |     5.0000000000E+00 |     5.0000000000E+00
|sol( 6)| |     6.0000000000E+00 |     6.0000000000E+00
|sol( 7)| |     7.0000000000E+00 |     7.0000000000E+00
|sol( 8)| |     6.8067590129E+00 |     8.0000000000E+00
|sol( 9)| |     1.4208711808E+01 |     9.0000000000E+00
|sol(10)| |     9.2202338540E-01 |     1.0000000000E+01
[miriel028.formation.cluster:95393] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
[miriel028.formation.cluster:95393] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

How many iterations do you observe? Why? Note that in the current maphys version, one process is associated to one domain.

3.4. An actual hybrid method

mpirun -n 4 ./dmph_examplekv simple.in

Which iterative solver is being used? You can play with grep "CONVERGENCE HISTORY". iterative solver.

3.5. Exploit SPD properties

MATFILE = bcsstk17.mtx # matrix file
SYM = 1                # the symmetry of the matrix (0=general, 1=spd, 2=symmetric)

Which iterative solver iterative solver? Why? Which are are the potential benefits (not representative for a small test case as considered for now).

3.6. Try running on more subdomains

As mentioned above, in the current maphys version, one process is associated to one domain. In the previous test case, we thus ran on four subdomains. We can try running on more subdomains.

As mentioned above, we are restricted to a power of 2 in this set up. If you run mpirun -n 5 ./dmph_examplekv simple.in, you'll get an error:

ERROR: MaPHyS partitioner only support *a power of 2* subdomains. nb subdomains=              5

Try a run on 16 domains:

mpirun -n 16 ./dmph_examplekv simple.in

You shall get the following error:

ERROR: PART_scotch_partgraph while asserting cblknbr == 2*nbdom -1, cblknbr =             30

3.7. Use a robust algebraic domain decompostion

We'll use the Parallel Algebraic Domain Decomposition for Linear systEms (paddle) to perform the decomposition instead of the dummy internal default maphys algorithm.

Check out the maphys refcard. You'll see that you can activate paddle with ICNTL(49) = 2.

MATFILE = bcsstk17.mtx # matrix file
SYM = 1                # the symmetry of the matrix (0=general, 1=spd, 2=symmetric)
ICNTL(49) = 2          # algebraic domain decomposition algorithm (1=internal dummy algorithm, 2=robust paddle)

Try and run on 16 domains once again in this configuration.

mpirun -n 16 ./dmph_examplekv simple.in

3.8. Backward and forward error analysis

Check out the value of the backward error (you may add | grep "Backward error" at the end of your mpirun command) of the previous execution. Is the stopping criterion satisfied knowing RCNTL(21) controls this threshold and that you may observe (if you did not play with that parameter)?:

Wanted precision               RCNTL (   21)  1.000E-05

Check out the direct error (you may add | "Computed (x) vs analytical (x*) solution" at the end of your mpirun command). How can you explain that? Check out the condition number of bcsstk17: you can view that online.

3.9. Sparse direct solver

Set up ICNTL(13) to run either pastix or mumps as internal direct solver.

3.10. Display more info

Set up ICNTL(5) and ICNTL(6) to view more information.

3.11. Sparsification

So far we were relying on the default preconditioning mechanism resulting to locally dense block diagonal matrices. Investigate the sparsified block diagonal factorization: the dense diagonal block is first sparsified by droppring entry \(s_{i,j}\) if it is lower than \(\xi ( \vert s_{i,i} \vert + \vert s_{j,j}\vert )\). An additional sparse factorization is then performed in this situation instead of a lapack dense factorsation.

Enable sparsification (ICTNL(21)=2) and set up ξ=RCNTL(11)=1.0e-5, 1.0e-4, or 1.0e-3.

3.12. Force GMRES

3.12.1. Normal

Continue to let the solver know that the matrix is SPD (SYM=1) while forcing it to use (ICNTL(20)=1). Note that is certainly not a good thing to do in practice, but it is interesting to study.

3.12.2. Orthogonalisation

Change the orthogonalisation scheme (ICNTL(22)). Does it affect convergence? How?

3.13. Multiple right-hand sides (fabulous)

Turn on (ICNTL(20)=4) the block Krylov fabulous iterative solver (see more here) and possibly play with ICNTL(29): we are curious.

If you are interested in the fabulous library, you can check this guide to install it and use it directly in your own code.

3.14. Matrix vector product

Play with the ICNTL(27) parameter to assess both implict and explicit matrix vector products. Which one shall be used for large tests cases? In 2D? In 3D?

3.15. exit the environment salloc and guix environment

exit # guix environment
exit # miriel028
exit # allocation

4. Larger test cases

We propose to study two classes of matrices.

The first class of matrices arises from the modeling of acoustics wave propagation problem in heterogeneous media through hou10ni (yesterday hands-on). They are complex, non symmetric matrices.

The second class of matrices were generated via the genfem parallel python finite element python matrix generator developed (see details in Section 5.5.2 of the thesis of Louis Poirel in the context of which the genfem toolbox has been designed). They correspond to a stationary heterogeneous diffusion equation (or Darcy equation) in a 3D stratified medium. They are real, symmetric positive definite matrices. The ones discussed here have the form of a baton, hence their name.

If you have matrices arising from your own application that you have uploaded to plafrim, we will be happy to try and process them.

The maphys driver to use is <arithm>mph_examplekv. You should not use paddle with this driver (remove ICNTL(49) from the input file or set it back to 1.)

4.1. Test cases

4.1.1. hou10ni test cases (dumped matrices from the previous hands-on)

In the 2D-3D hou10ni hands-on session, you were invited to produce matrices. If you did so you may directly play with those matrices you produced, and refine the analysis of the solver behaviour in the present session.

As a backup alternative, or, even better, as additional test cases, you may use some matrices we dumped. Here are their description. They differ by the number of processes used to process them and the dimension (2D or 3D). Here they are.

  1. 2D-8-procs

    Repository: /home/hpcs-readonly/matrices/maphys/hou10ni_2D_8

    Command used for production (you don't need to run it):

    mpirun -np 8 hou10ni.out < param_marmousi_maphys.txt
    

    param_marmousi_maphys.txt (extraction) :

    1 p-adaptivity (yes=1,no=0)
    4 Ordre des elements
    
  2. 2D-24-procs

    Repository: /home/hpcs-readonly/matrices/maphys/hou10ni_2D_24

    Command used for production (you don't need to run it):

    mpirun -np 24 hou10ni.out < param_marmousi_maphys.txt
    

    param_marmousi_maphys.txt:

    1 p-adaptivity (yes=1,no=0)
    4 Ordre des elements
    
  3. 3D-8procs

    Repository: /home/hpcs-readonly/matrices/maphys/hou10ni_3D_8

    Command used for production (you don't need to run it):

    mpirun -np 8 hou10ni_lite.out < param_simple_3D_maphys.txt
    

    param_simple_3D_maphys.txt (extraction):

    1 p-adaptivity (yes=1,no=0)
    4 Ordre des elements
    
  4. 3D-24procs

    Repository: /home/hpcs-readonly/matrices/maphys/hou10ni_3D_24

    Command used for production (you don't need to run it):

    mpirun -np 24 hou10ni_lite.out < param_simple_3D_maphys.txt
    

    param_simple_3D_maphys.txt (extraction):

    1 p-adaptivity (yes=1,no=0)
    4 Ordre des elements
    

4.1.2. baton test cases (generated with genfem)

You can run:

mpirun -np ${SLURM_NTASKS} ./dmph_examplerestart simple.in /home/hpcs-readonly/matrices/maphys/baton_30_${SLURM_NTASKS}procs

You can set SLURM_NTASKS=24, 48 and 96. Note that it is weak scaling along one dimension (the baton becomes longer and longer while the other two dimensions do not change). You should use batch scheduling if you want to run on more than one node (24 processes).

4.2. Batch and run

From now on, we will use the batch system to create jobs. This is the recommended way to use clusters in general and it will allow us to make runs on a larger number of nodes. Indeed, the jobs will be queued and the job scheduler will automatically run them in order to use the cluster resources as efficiently as possible, which is significantly better than reserving nodes for a long time and use them in interactive mode.

To do so, we create a batch file (we will use the .slurm extension to identify such files). If you are not familiar with those, it consists in a bash script with additional slurm-related information about the run.

Here is a template template.slurm:

#!/bin/bash
#SBATCH -N 1 # Number of nodes
#SBATCH -n 24 # Number of tasks (MPI processes)
#SBATCH -c 1 # Number of threads per task
#SBATCH -p hpc
#SBATCH --job-name=maphys
#SBATCH --exclusive # Do not share reserved nodes with other users
#SBATCH --output=out_maphys_%j
#SBATCH --time=10:00 # Time limit

exec guix environment --pure --preserve=^SLURM maphys --ad-hoc maphys slurm pastix starpu -- \
     mpirun -np ${SLURM_NTASKS} ./dmph_examplerestart simple.in \
     /home/hpcs-readonly/matrices/maphys/baton_30_${SLURM_NTASKS}procs

You can see in the the description of the slurm job with #SBATCH setting the slurm variables.

For the moment leave #SBATCH -c to 1 (or remove the line), as thread parallelism will be investigated later in the hand-on.

The second command loads the guix environment and executes the command directly. Using \ in bash allows one to write a single command on several lines, for more readability.

To submit the job, use the command sbatch template.slurm.

4.3. Baton test case

The previous batch allows you to test in weak scaling the baton test case, which is a 3D domain growing in one dimension with the number of processes.

Is is scaling well ? Check the time spend in the different parts (RINFO(4) to RINFO(7)). Does the time of factorization increase ? What about the time of solve ? Why ?

What about the number of iterations (IINFOG(5)) ?

Feel free to explore different parameters for the direct or iterative solver.

4.4. Adding a Coarse Grid Correction

Now rerun the test cases using the Coarse Grid Correction: ICNTL(21)=10. You can set the number of eigenvectors to be computed for each subdomains with ICNTL(33), which will change the size of the coarse space.

Check again the time spent in each part of the computation. Does the Coarse Grid Correction manage to bound the number of iterations ?

4.5. Optimize the Coarse Grid Correction

You can play with ICNTL(51) and ICNTL(52) to try to optimize the time spent to compute the preconditioner and solve the iterative system.

What seems to be the best methods ? (You can still ICNTL(33) to make a bigger or smaller coarse space and see how the methods scale).

4.6. MPI+threads

We suggest turning off the previous feature to focus one problem at a time!

Set ICNTL(21) back to 1 (or remove it from your input file).

4.6.1. Introduction

The current release of maphys supports two levels of parallelism (MPI+threads). The MPI level is inherent to maphys and cannot be disabled. By default, only the MPI level of parallelism is enabled and the user can optionnally furthermore activate multithreading using the following control parameters:

  • ICNTL(42) = 1: turn on multithreading in maphys;
  • ICNTL(37) = number of nodes;
  • ICNTL(38) = number of CPU cores on each node (24 on miriel nodes);
  • ICNTL(39) = number threads per MPI process
  • ICNTL(40) = the number of processes (also equal to the number of subdomains)

You can also set ICNTL(36) in order to specify the binding you want to use within maphys as follows:

  • 0 = do not bind;
  • 1 = thread to core binding;
  • 2 = group binding on a per-process basis (default)

You can read more in Stojce's Nakov thesis or here as well as in the maphys users'guide or refcard.

4.6.2. Matrices

To see the effect of thread parallelism, we will work on larger (but fewer) subdomains. You can retrieve them from:

  • /home/hpcs-readonly/matrices/maphys/baton_50_${SLURM_NTASKS}procs/
  • /home/hpcs-readonly/matrices/maphys/baton_50_${SLURM_NTASKS}procs/

with 4 and 8 subdomains (${SLURM_NTASKS}=4 or 8).

Notice that this number of subdomains has been chosen to allow you to use the 24 cores on a node with different number of processes and thread :

  1. 4 processes and 1 thread;
  2. 4 processes and 2 threads;
  3. 4 processes and 3 threads;
  4. 4 processes and 6 threads;
  5. 8 processes and 1 thread;
  6. 8 processes and 2 threads;
  7. 8 processes and 3 threads.

4.6.3. Exercise

  1. Set up and assess maphys in a MPI+thread set up

    Choose a target given configuration, say:

    • given number of nodes: 1;
    • given number of CPU cores per node: 24;
    • given number of threads per MPI process: 2;
    • given number of MPI processes: 4.

    Set up a slurm a batch file and a maphys input file (ICNTL(42), ICNTL(37-40)) to run in this configuration.

    Run and analyze the behaviour of the solver.

  2. Use slurm variables to make it automatic

    Are you ready to make a scalability study? Maybe you want to make the process less error-prone before, by relating maphys variables to slurm ones automatically.

    Checking slurm output variables, make the set up of maphys MPI+thread automatic as soon as the resources are reserved: 1 full node (24 cores) total with 6 threads per process.

4.6.4. Correction

  1. Setting up thread parallelism with slurm

    In the batch file, you will have to play with #SBATCH -c to set the number of threads, which will set the slurm variable SLURM_CPUS_PER_TASK.

    You should use it in combination with #SBATCH -n to set the number of MPI processes, which sets the variable SLURM_NTASKS.

  2. Setting up thread parallelism with maphys

    Using thread parallelism is maphys requires playing with the variables descrived above (ICNTL(42), ICNTL(37-40)).

    To set those automatically, we will use the following template file threads.in

    SYM = 1
    ICNTL(6) = 1
    ICNTL(13) = 1
    ICNTL(20) = 3
    # You can set all the parameters you already know here
    
    ICNTL(42) = 1
    ICNTL(37) = @NNODE@
    ICNTL(38) = 24
    ICNTL(39) = @NTHREADS@
    ICNTL(40) = @NDOMAINS@
    

    Now aim at creating an input file which gets automatically given a number of threads and processes. Let's make a script to do that using the slurm variables:

    maphys_thread_test.sh

    #!/bin/bash
    RUN_ID=${SLURM_NTASKS}_${SLURM_CPUS_PER_TASK}
    INPUT_FILE=thread_${RUN_ID}.in
    cp threads.in ${INPUT_FILE}
    
    export OMP_NUM_THREAD=${SLURM_CPUS_PER_TASK}
    
    sed -i "s/@NNODE@/${SLURM_NNODES}/g" ${INPUT_FILE}
    sed -i "s/@NTHREADS@/${SLURM_CPUS_PER_TASK}/g" ${INPUT_FILE}
    sed -i "s/@NDOMAINS@/${SLURM_NTASKS}/g" ${INPUT_FILE}
    mpirun -np ${SLURM_NTASKS} ./dmph_examplerestart ${INPUT_FILE} /home/hpcs-readonly/matrices/maphys/baton_50_${SLURM_NTASKS}procs
    

    Let's make it executable:

    chmod +x maphys_thread_test.sh
    

    Now we just have to call this input file from our batch script. For example with 4 processes and 2 threads per process:

    #!/bin/bash
    #SBATCH -N 1 # Number of nodes
    #SBATCH -n 4 # Number of tasks (MPI processes)
    #SBATCH -c 2 # Number of threads per task
    #SBATCH -p hpc
    #SBATCH --job-name=maphys
    #SBATCH --exclusive # Do not share reserved nodes with other users
    #SBATCH --output=out_threads_maphys_%j
    #SBATCH --time=10:00 # Time limit
    exec guix environment --pure --preserve=^SLURM maphys --ad-hoc maphys slurm pastix starpu -- ./maphys_thread_test.sh
    

4.6.5. Scalability study

You can assess the performance you obtain on all seven MPI+threads configurations we proposed. On one node you should already see if thread parallelism is scaling.

You can then try on multiple nodes. You can explore different combinations of number of threads and processes.

Does thread parallelism scale well in general ?

Assess the performance with both the mumps (ICNTL(13)=2) and =pastix (ICNTL(13)=2) internal sparse direct solvers: which one scales better and gives a better factorisation time (RINFO(5)) on the example we are considering?

4.7. hou10ni matrices

Now you have all the cards in your hands, try to optimize the solver time for the maphys solver on the hou10ni matrices !

hou10ni matrices are complex and symmetric (but not definite), so you should use the cmph_examplerestart (single precision) or zmph_examplerestart (double precision, recommended) driver; and don't forget to set the SYM maphys parameter to 2.

As mentioned before, the paths to the hou10ni matrices are:

  • /home/hpcs-readonly/matrices/maphys/hou10ni_2D_8
  • /home/hpcs-readonly/matrices/maphys/hou10ni_2D_24
  • /home/hpcs-readonly/matrices/maphys/hou10ni_3D_8
  • /home/hpcs-readonly/matrices/maphys/hou10ni_3D_24

If you had the opportunity to dump some matrices yourself in the previous hands-on sessions, you can of course try to run on them!

5. Build maphys

5.1. salloc and log in to a miriel compute node

salloc -N 1 -p hpc
squeue -u $USER # assuming you get miriel032
ssh miriel032

5.2. Default maphys build

We clone (with the --recursive option) maphys and we build it:

git clone --recursive https://gitlab.inria.fr/solverstack/maphys/maphys.git
cd maphys
mkdir build
cd build
guix environment --pure maphys --ad-hoc maphys vim zlib -- /bin/bash --norc
cmake ..
make -j 12

We can check that all the symbols are retrieved with:

cd examples
ldd ./dmph_examplekv

We can then run a basic example, in sequential or in parallel with 2 or 4 processes:

./dmph_examplekv real_bcsstk17.in
mpirun -n 2 ./dmph_examplekv real_bcsstk17.in
mpirun -n 4 ./dmph_examplekv real_bcsstk17.in

5.3. Build with pastix-only and mkl

Build maphys with pastix-only as an internal sparse direct solver (turning-off mumps) and use the mkl-dense linear algebra library instead of openblas.

First, make a maphys environment with pastix 6.0 (default) verion. You may want to have a look at the guix refcard for more on building variants.

5.4. Log out from the miriel compute node and quit salloc environment

5.5. exit the environment salloc and guix environment

exit # guix environment
exit # miriel028
exit # allocation

5.6. Performance comparison

On test cases of your choice, compare the performance of

  • maphys with pastix and mkl
  • to maphys with pastix and openblas.

Author: Inria Bordeaux Sud Ouest

Created: 2021-06-10 Thu 18:01

Validate