CUDA MPI usage in two GeForce RTX 2080 Ti GPUs • OpenLB

This topic has 1 reply, 2 voices, and was last updated 1 year, 8 months ago by Adrian.

Viewing 2 posts - 1 through 2 (of 2 total)

Author

Posts
November 1, 2022 at 3:32 pm #6938

jflorezgi
Participant

Hi,
I have a program that is running properly in a single gpu using gpu_only.mk as a config.mk, but now I want to run the same program in two GeForce RTX 2080 Ti graphic cards, so I changed the config.mk by gpu_openmpi.mk and I started the simulation using
mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./GPU-ESVcase01’ command, but it appears an error that you can see below. If I run the program with the same command but only using one gpu
mpirun -np 1 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./GPU-ESVcase01’
the program has no problems running. I appreciate any help you can give me

[prepareLattice] Prepare Lattice … OK
[medusa16:398642] Read -1, expected 4561900, errno = 14
[medusa16:398642] *** Process received signal ***
[medusa16:398642] Signal: Segmentation fault (11)
[medusa16:398642] Signal code: Invalid permissions (2)
[medusa16:398642] Failing at address: 0x7f89fa000000
[medusa16:398643] Read -1, expected 4561900, errno = 14
[medusa16:398643] *** Process received signal ***
[medusa16:398643] Signal: Segmentation fault (11)
[medusa16:398643] Signal code: Invalid permissions (2)
[medusa16:398643] Failing at address: 0x7fb2ac000000
[medusa16:398642] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f8a8a744090]
[medusa16:398642] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18b733)[0x7f8a8a88c733]
[medusa16:398642] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x31c4)[0x7f8a889d61c4]
[medusa16:398642] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1c6)[0x7f8a889fc926]
[medusa16:398642] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x1a9)[0x7f8a889f5429]
[medusa16:398642] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x95)[0x7f8a889d7ed5]
[medusa16:398643] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fb340836090]
[medusa16:398643] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18b733)[0x7fb34097e733]
[medusa16:398643] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x31c4)[0x7fb33eac81c4]
[medusa16:398643] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1c6)[0x7fb33eaee926]
[medusa16:398643] [ 4] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x1a9)[0x7fb33eae7429]
[medusa16:398643] [ 5] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x95)[0x7fb33eac9ed5]
[medusa16:398643] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x53a3)[0x7fb33eaca3a3]
[medusa16:398643] [medusa16:398642] [ 6] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x53a3)[0x7f8a889d83a3]
[medusa16:398642] [ 7] [ 7] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7fb3406b3854]
[medusa16:398643] [ 8] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_test+0x31)[0x7fb3428671b1]
[medusa16:398643] /lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7f8a8a5c1854]
[medusa16:398642] [ 8] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_test+0x31)[0x7f8a8c7751b1]
[ 9] [medusa16:398642] [ 9] /lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Test+0x52)[0x7f8a8c7b38d2]
[medusa16:398642] [10] ./GPU-ESVcase01(+0x187ba9)[0x55a902bd4ba9]
[medusa16:398642] [11] ./GPU-ESVcase01(+0xdad5a)[0x55a902b27d5a]
/lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Test+0x52)[0x7fb3428a58d2]
[medusa16:398643] [10] ./GPU-ESVcase01(+0x187ba9)[0x56487ededba9]
[medusa16:398643] [11] ./GPU-ESVcase01(+0xdad5a)[0x56487ed40d5a]
[medusa16:398643] [12] ./GPU-ESVcase01(+0xe2c42)[0x56487ed48c42]
[medusa16:398643] [13] ./GPU-ESVcase01(+0x69eba)[0x56487eccfeba]
[medusa16:398643] [14] [medusa16:398642] [12] ./GPU-ESVcase01(+0xe2c42)[0x55a902b2fc42]
[medusa16:398642] [13] ./GPU-ESVcase01(+0x69eba)[0x55a902ab6eba]
[medusa16:398642] ./GPU-ESVcase01(+0x471e8)[0x56487ecad1e8]
[medusa16:398643] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fb340817083]
[medusa16:398643] [16] ./GPU-ESVcase01(+0x4736e)[0x56487ecad36e]
[medusa16:398643] *** End of error message ***
[14] ./GPU-ESVcase01(+0x471e8)[0x55a902a941e8]
[medusa16:398642] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f8a8a725083]
[medusa16:398642] [16] ./GPU-ESVcase01(+0x4736e)[0x55a902a9436e]
[medusa16:398642] *** End of error message ***
bash: line 1: 398642 Segmentation fault (core dumped) ./GPU-ESVcase01
————————————————————————–
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
————————————————————————–
bash: line 1: 398643 Segmentation fault (core dumped) ./GPU-ESVcase01
————————————————————————–
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[29179,1],0]
Exit code: 139
————————————————————————–

November 1, 2022 at 7:45 pm #6939

Adrian
Keymaster

Your OpenMPI build likely wasn’t compiled with CUDA support. CUDA-aware MPI is required for multi GPU simulations in release 1.5. You can check whether this is available using e.g. ompi_info --parsable --all | grep mpi_built_with_cuda_support:value which should return:

mca:mpi:base:param:mpi_built_with_cuda_support:value:true

If you run this on a cluster there likely is a module already available, otherwise you’ll have to check how this can be installed on your particular distribution (I’ll still be happy to help further). If no package / build option (as e.g. for the declarative Nix shell environment included in the release) is available on your system you’ll have to compile OpenMPI / some other CUDA-aware MPI library manually. One additional option is to use Nvidia’s HPC SDK which includes a CUDA-aware build of OpenMPI (this is the environment I commonly use on our cluster).

Sorry for the unhelpful error message, this will be improved with 1.6 – the latest release was only the first step in OpenLB GPU support.
Author

Posts

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.