Multi-GPU usage issue

    Hello Developers,
    I tried running the code on multiple GPUs. I used the config file corresponding to multiple GPUs. However, after sshing on the node, it only shows running on one GPU device. What might be the possible ways to rectify this issue?

    Thank you.

    Yours sincerely,

    Abhijeet C.


    Just to confirm: You used the config/ example config?

    How exactly did you launch the application and how did you assign each process a single GPU? (Following the comments from the example config)

    Usage on a multi GPU system: (recommended when using MPI, use non-MPI version on single GPU systems)
    – Run `mpirun -np 4 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’
    (for a 4 GPU system, further process mapping advisable, consult cluster documentation)

    Hello Adrian,
    I did use the config/ example config.

    I used this SLURM script:

    #SBATCH –job-name=run1
    #SBATCH –output=run1.out
    #SBATCH –mail-type=ALL
    #SBATCH –partition=gpu
    #SBATCH –nodes=1
    #SBATCH –gpus-per-node=a100:8
    #SBATCH –mem=50gb
    #SBATCH –time=5-00:00:00
    #SBATCH –get-user-env

    CUDA_VISIBLE_DEVICES_SETTING=(“0” “0” “0,1” “0,1,2” “0,1,2,3” “0,1,2,3,4” “0,1,2,3,4,5” “0,1,2,3,4,5,6” “0,1,2,3,4,5,6,7” “0,1,2,3,4,5,6,7,8” “0” )

    srun –mpi=pmix_v3 bash -c export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./poiseuille3d


    The mpirun command doesn’t work on the cluster for me. I also tried by setting the no of cuboids to 8. Also, I tried using the cuda_visible device settings. Also, I used this line srun –mpi=pmix_v3 export env CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES_SETTING[$gpus-per-node]}; ./poiseuille3d instead of the previous one. However, none of the cases worked out for me.

    I was adviced to use cudasetdevice command with the rank. How should I implement it? Is it the right approach to fix this issue?

    I would really appreciate your help to resolve this issue.

    Thank you.

    Yours sincerely,

    Abhijeet C.


    Hello Adrian,

    How exactly did you launch the application and how did you assign each process a single GPU?
    1) I ran the make file.
    2) Ran the slurm script

    Did I miss on anything?


    It looks to me as if only one MPI task is launched per node in your SLURM script. You can try to include a #SBATCH --tasks-per-node=8 setting.

    If your application case calculates the number of cuboids w.r.t. the number of MPI processes (this is the case for OpenLB’s example cases) you do not need to manually change this to 8.

    Edit: Enabling handling of more than one GPU by a single process would be a nice addition and require setting the active device via CUDA as you indicated. However this is not included in OpenLB 1.5 – there we assume that each process that hold a GPU-based block lattice has access to exactly one default GPU (as configured via the visible device environment variable)

    Hello Adrian,

    I retracted back to the original code for the number of cuboids, however, the number of cuboids is set to one after running the code.

    const int noOfCuboids = 2*singleton::mpi().getSize();
    #else // ifdef PARALLEL_MODE_MPI
    const int noOfCuboids = 1;
    #endif // ifdef PARALLEL_MODE_MPI

    Also, the command #SBATCH –tasks-per-node=8 was added to the SLURM script. It didn’t make any difference.

    I observe that the issue stems from the mpi process. There is something not syncing between the mpi process or ranks with the Slurm script. I would appreciate your feedback.

    Thank you.

    Yours sincerely,

    Abhijeet C.


    I am starting to suspect that the case was not actually compiled with MPI support (at the start of the output you should see the number of MPI processes printed by OpenLB, is this correct in your SLURM log?)

    Replying in more detail to your previous question:

    How exactly did you launch the application and how did you assign each process a single GPU?

    The steps are the following:

    1. Copy the example config config/ into using e.g. cp config/

    2. Edit the to use the correct CUDA_ARCH for your target GPU

    3. Ensure that a CUDA-aware MPI module and CUDA 11.4 or later (for nvcc) is loaded in your build environment

    4. Edit the to use the mpic++ provided CXXFLAGS and LDFLAGS per the config hint

    # CXXFLAGS and LDFLAGS may need to be adjusted depending on the specific MPI installation.
    # Compare to mpicxx --showme:compile and mpicxx --showme:link when in doubt.

    4. Compile the example using make

    5. Update the SLURM script to launch one process per GPU and assign each process a GPU via the CUDA_VISIBLE_DEVICES environment variable. This is what

    mpirun bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./program


    For further investigation on where the problem is in your process it would help if you can share your exact, SLURM script and job output in addition to more information of your system setup.

    Other approaches are possible depending on the exact environment.

