Multi GPUs Calculation • OpenLB - Open source lattice Boltzmann code

This topic has 26 replies, 2 voices, and was last updated 12 months ago by Adrian.

Viewing 15 posts - 1 through 15 (of 27 total)

1 2 →

Author

Posts
June 23, 2023 at 11:11 am #7534

Yuji
Participant

Hello team openLB,

I am tyring GPU calculatons.
1. with one GPU, it worked well when I used gpu_only.mk at `examples/laminar/cavity3dBenchmark’

2. with 2 GPUs (because my workstation has 2 of 3080Ti), it doesn’t work well when I use gpu_openmpi_mixed.mk or gpu_openmpi.mk. The both of error message are same
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
Hostname: user
cuIpcOpenMemHandle return value: 217
address: 0x1310200000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory. Try to reduce the device
memory footprint of your application.
————————————————————————–
[user:06042] Failed to register remote memory, rc=-1
[user:06041] Failed to register remote memory, rc=-1
corrupted size vs. prev_size while consolidating
[user:06041] *** Process received signal ***
[user:06041] Signal: Aborted (6)
[user:06041] Signal code: (-6)
[user:06037] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcOpenMemHandle failed
[user:06037] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

(the shell comand is $mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’)

Moreover, I tried a comannd of $ mpirun -np 2 ./cavity3d. The error message was [GPU_CUDA:0] Found 2 CUDA devices but only one can be used per MPI process.
[GPU_CUDA:1] Found 2 CUDA devices but only one can be used per MPI process.
corrupted size vs. prev_size while consolidating
[user:06021] *** Process received signal ***
[user:06021] Signal: Aborted (6)
[user:06021] Signal code: (-6)

Do you know what is wrong?
I am wondering if I need NVLINK as to use multi GPUs?

Best,

June 23, 2023 at 1:32 pm #7535

Adrian
Keymaster

In general you do not need NVlink interconnect to use multiple GPUs in OpenLB (as MPI will transparently fall back to PCI device-cpu-device communication, although it is recommended for optimal performance due to better inter-GPU bandwidth).

I assume that OpenLB did not issue a warning on missing CUDA-awareness of MPI (e.g. “The used MPI Library is not CUDA-aware. Multi-GPU execution will fail.”) and that you compiled / installed MPI with CUDA-awareness?

Can you provide me with more details on your system and software environment? (CUDA versions, modified config.mk and so on)

If you use mpirun -np 2 ./cavity3d only the first of all visible GPUs will be used (as per the warning message). This is why the example configs contain the mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./program’ command which will assign each rank its own GPU.

June 27, 2023 at 7:37 am #7601

Yuji
Participant

Dear Adrian,

Thank you for replying.
I compiled and installed open-MPI with cuda-aware follwing the openLB user guide 1.6 pp157.

my cuda information is here
$ nvidia-smi
Tue Jun 27 14:21:43 2023
+—————————————————————————————+
| NVIDIA-SMI 530.30.04 Driver Version: 531.29 CUDA Version: 12.1 |
|—————————————–+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 Ti On | 00000000:0E:00.0 On | N/A |
| 0% 26C P8 17W / 350W| 3154MiB / 12288MiB | 7% Default |
| | | N/A |
+—————————————–+———————-+———————-+
| 1 NVIDIA GeForce RTX 3080 Ti On | 00000000:0F:00.0 Off | N/A |
| 0% 24C P8 5W / 350W| 475MiB / 12288MiB | 0% Default |
| | | N/A |
+—————————————–+———————-+———————-+

and config.mk is here,

CXX := nvcc
CC := nvcc

CXXFLAGS := -O3
CXXFLAGS += -std=c++17

PARALLEL_MODE := MPI

MPIFLAGS := -lmpi_cxx -lmpi

PLATFORMS := CPU_SISD GPU_CUDA

# for e.g. RTX 30* (Ampere), see table in rules.mk for other options
CUDA_ARCH := 86

FLOATING_POINT_TYPE := float

USE_EMBEDDED_DEPENDENCIES := ON

when cuda-aware is used, must I remove opempi-bin and libopenmpi-dev which is installed using apt ?
Best regards

June 27, 2023 at 10:34 am #7602

Adrian
Keymaster

Ok, this looks as if the wrong MPI version is selected for compilation. Likely you will get a compiler / linker error if you remove opempi-bin and libopenmpi-dev.

Does ompi_info --parsable -l 9 --all | grep mpi_built_with_cuda_support:value return true (as per the user guide)? Did you modify the PATH accordingly?

I would suggest to make sure that the correct MPI version is set and use the flags as provided by mpic++ -showme. (Or you can exactly follow the user guide / tech report).

June 27, 2023 at 10:48 am #7603
Yuji
Participant
Dear Adrian,

Thank you. Ok, I dont remove opempi-bin and libopenmpi-dev.
As to ompi_info –parsable -l 9 –all | grep mpi_built_with_cuda_support:value, this returns true.
$ ompi_info –parsable -l 9 –all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true.

I modifided PATH a bit from user guide.
My ~/.bashrc has
export MPI_HOME=$MPI_HOME:$HOME/opt/openmpi
export PATH=$PATH:$MPI_HOME/bin
export CPATH=$CPATH:$MPI_HOME/include
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_HOME/lib

And $mpic++ -showme shows
g++ -I/home/ubuntu2204/opt/openmpi/include -L/home/ubuntu2204/opt/openmpi/lib -Wl,-rpath -Wl,/home/ubuntu2204/opt/openmpi/lib -Wl,–enable-new-dtags -lmpi
I am not familiar with $mpic++ -showme. In this case, dose the output show that I can use only -lmpi?

Best regards
- This reply was modified 1 year, 1 month ago by Yuji.
- This reply was modified 1 year, 1 month ago by Yuji.
June 27, 2023 at 10:57 am #7606

Adrian
Keymaster

Ok, this seems to be a precedence issue then – if you add -I/home/ubuntu2204/opt/openmpi/include to your CXXFLAGS and -L/home/ubuntu2204/opt/openmpi/lib to the LDFLAGS it should work.

June 27, 2023 at 11:10 am #7607
Yuji
Participant
Dear Adrian,

Thank you.
when compiled config.mk, it seems not reading -L/home/ubuntu2204/opt/openmpi/lib (LDFLAGS).
what files shoud I modify?

My config.mk is

`
`
CXX := nvcc
CC := nvcc

CXXFLAGS := -O3
CXXFLAGS += -std=c++17
CXXFLAGS += -I/home/ubuntu2204/opt/openmpi/include

LDFLAGS := -L/home/ubuntu2204/opt/openmpi/lib

PARALLEL_MODE := MPI

MPIFLAGS := -lmpi_cxx -lmpi

PLATFORMS := CPU_SISD GPU_CUDA

# for e.g. RTX 30* (Ampere), see table in rules.mk for other options
CUDA_ARCH := 86

FLOATING_POINT_TYPE := float

USE_EMBEDDED_DEPENDENCIES := ON
`
`

And compiled config.mk output is
`
`
$ make clean;make
make CXX=nvcc CC=nvcc -C external clean
make[1]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
make -C zlib clean
make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
make -C tinyxml clean
make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
rm -f lib/libz.a lib/libtinyxml.a
make[1]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
rm -f src/communication/mpiManager.o src/communication/ompManager.o src/core/olbInit.o src/io/ostreamManager.o
rm -f build/lib/libolbcore.a
make CXX=nvcc CC=nvcc -C external
make[1]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
make -C zlib
make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
nvcc -c -o build/adler32.o ./adler32.c
nvcc -c -o build/crc32.o ./crc32.c
nvcc -c -o build/deflate.o ./deflate.c
nvcc -c -o build/infback.o ./infback.c
nvcc -c -o build/inffast.o ./inffast.c
nvcc -c -o build/inflate.o ./inflate.c
nvcc -c -o build/inftrees.o ./inftrees.c
nvcc -c -o build/trees.o ./trees.c
nvcc -c -o build/zutil.o ./zutil.c
nvcc -c -o build/compress.o ./compress.c
nvcc -c -o build/uncompr.o ./uncompr.c
nvcc -c -o build/gzclose.o ./gzclose.c
nvcc -c -o build/gzlib.o ./gzlib.c
nvcc -c -o build/gzread.o ./gzread.c
nvcc -c -o build/gzwrite.o ./gzwrite.c
ar rc build//libz.a ./build/adler32.o ./build/crc32.o ./build/deflate.o ./build/infback.o ./build/inffast.o ./build/inflate.o ./build/inftrees.o ./build/trees.o ./build/zutil.o ./build/compress.o ./build/uncompr.o ./build/gzclose.o ./build/gzlib.o ./build/gzread.o ./build/gzwrite.o
make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
cp zlib/build/libz.a lib/
make -C tinyxml
make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
nvcc -c tinystr.cpp -o build/tinystr.o
nvcc -c tinyxml.cpp -o build/tinyxml.o
nvcc -c tinyxmlerror.cpp -o build/tinyxmlerror.o
nvcc -c tinyxmlparser.cpp -o build/tinyxmlparser.o
ar rc build/libtinyxml.a ./build/tinystr.o ./build/tinyxml.o ./build/tinyxmlerror.o ./build/tinyxmlparser.o
make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
cp tinyxml/build/libtinyxml.a lib/
make[1]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/communication/mpiManager.cpp -o src/communication/mpiManager.o
nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/communication/ompManager.cpp -o src/communication/ompManager.o
nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/core/olbInit.cpp -o src/core/olbInit.o
nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/io/ostreamManager.cpp -o src/io/ostreamManager.o
ar rc build/lib/libolbcore.a src/communication/mpiManager.o src/communication/ompManager.o src/core/olbInit.o src/io/ostreamManager.o
`
`
- This reply was modified 1 year, 1 month ago by Yuji.
- This reply was modified 1 year, 1 month ago by Yuji.
- This reply was modified 1 year, 1 month ago by Yuji.
- This reply was modified 1 year, 1 month ago by Yuji.
June 27, 2023 at 11:16 am #7612
Adrian
Keymaster
How do you know that it is not actually using the MPI library in /home/ubuntu2204/opt/openmpi/lib? The core library compilation was successful as per the log. Did you already try recompiling one of the examples to confirm whether it still doesn’t work?

To make sure you can also try to put the CUDA-aware MPI paths first in:
```
export PATH=$PATH:$MPI_HOME/bin
export CPATH=$CPATH:$MPI_HOME/include
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_HOME/lib 
```
June 27, 2023 at 11:58 am #7613

Yuji
Participant

Thank you. Sorry I noticed the linker of -L/home/ubuntu2204/opt/openmpi/lib was successful at cavity3dBenchmark when built.

I tried an execution at cavity3dBenchmark hoever I got an error (as same as previous one)
`
:$ mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’
————————————————————————–
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
Hostname: YujiShimojima
cuIpcOpenMemHandle return value: 217
address: 0x1310200000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory. Try to reduce the device
memory footprint of your application.
————————————————————————–
[user:01607] Failed to register remote memory, rc=-1
[user:01608] Failed to register remote memory, rc=-1
corrupted size vs. prev_size while consolidating
[user:01607] *** Process received signal ***
[user:01607] Signal: Aborted (6)
[user:01607] Signal code: (-6)
[user:01603] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcOpenMemHandle failed
[user:01603] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
`

June 29, 2023 at 8:01 am #7615

Yuji
Participant

I noticed when I tyied $ mpirun -np 1 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’ it was successful;
output is
100, 100, 1, 1, 1, 1515.15

However I still dont use 2 GPUs.
error message is
$ mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’
————————————————————————–
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
Hostname: YujiShimojima
cuIpcOpenMemHandle return value: 217
address: 0x1310200000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory. Try to reduce the device
memory footprint of your application.
————————————————————————–
[YujiShimojima:01335] Failed to register remote memory, rc=-1
[YujiShimojima:01334] Failed to register remote memory, rc=-1
corrupted size vs. prev_size while consolidating
[YujiShimojima:01334] *** Process received signal ***
[YujiShimojima:01334] Signal: Aborted (6)
[YujiShimojima:01334] Signal code: (-6)
[YujiShimojima:01330] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcOpenMemHandle failed
[YujiShimojima:01330] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

July 11, 2023 at 8:02 am #7619

Yuji
Participant

I can not solve it still now…
Could you give me any idea in order to calculate openLB using multiple GPUs without NVLINK?

July 11, 2023 at 7:06 pm #7620

Adrian
Keymaster

Did you verify whether single-GPU execution works correctly on the other GPU by setting CUDA_VISIBLE_DEVICES to 1?

I will only be back from holiday tomorrow so can not verify right now but ” -mca btl_smcuda_use_cuda_ipc 0″ should disable NVlink for local GPU-GPU communication in CUDA-aware MPI.

July 12, 2023 at 5:55 am #7621

Yuji
Participant

Dear Adrian,

Sorry for interrupting your holliday.

I could do mpirun with -mca btl_smcuda_use_cuda_ipc 0.
Thank you so much.

By the way, some instractions for the multiple GPUs calculation , for examples “gpu::cuda::device::synchronize();”, are written in the examples/laminar/cavity3dBenchmark/cavity3d.cpp.

Is it necessary to write in my cpp too for mpirun with multiple GPUs?
I think it is not necessary to write it for CPU mpirun. But I am not sure for GPU.
Moreover I want to know where openMPI and openMP instractions in openLB (both CPU and GPU).
Best

July 12, 2023 at 8:18 pm #7623

Adrian
Keymaster

No worries 🙂 So the multi-GPU execution now works with the additional flag?

The gpu::cuda::device::synchronize calls are conditionally enabled only if GPU support is enabled. In general this is an artifact of the work-in-progress nature of heterogeneous computation support in OpenLB. Both this and the SuperLattice::setProcessingContext calls will be transparently hidden in the future.

What do you mean exactly by your last question? Ignoring the mpirun / hardware setup issues, multi-GPU support in OpenLB is transparent in the sense that if A) CUDA-aware MPI support is enabled during comilation and B) the application works on a single GPU then it will work in multi-GPU mode also.

OpenMP is only used for CPU-side parallelization on shared memory systems. Most commonly we use it in HYBRID mode for CPU-only simulations (i.e. each socket of a cluster is assigned a single OpenMPI process using OpenMP parallelization internally).

The performance-critical parts of OpenMPI usage are contained in the SuperCommunicator (and its support infrastructure). This is the class responsible for handling all overlap communication between the individual blocks of the decomposition.

July 14, 2023 at 5:52 am #7625

Yuji
Participant

Dear Adrian,

Yes, the multi-GPU execution now works with the additional flag without NVLINK; ” -mca btl_smcuda_use_cuda_ipc 0″ . Thank you.

Thank you for detail ansewers.

The last questuion was meant that where the codes for parallelization are. Are there in src/communication? This is because I cannot understand so far why the codes can run with mpirun without openmpi API instruction for CPU-only simulation in the examples cases in openLB (now I see the cavity3d.cpp in examples/laminar/cavity3d). In this case, “int noCuboids = singleton::mpi().getSize();” is only written. I mean I would like to understabd why the orter API instructions of the openmpi and openMP are not written in “example cases .cpp file” for CPU-only simulation.

Let me confirm to make sure for your answer
1) for CPU-only simulation, it is not necessary to write openmpi API in “my case .cpp file” , (for example MPI_Comm_rank(MPI_COMM_WORLD, &rank); and so on.)
2) for single-GPU simulation, it is also not necessary to write it.
3) for multi-GPU simulation, it is necessary to write openmpi API “in my case .cpp file” . If I do not wirte openMPI API instruction and run with mpirun(for example mpirun -np 2), the GPU dose same simulation in each GPU(in case 2 GPU would do same run and provide same results).

Are my understandigs right?

Best
Author

Posts

Viewing 15 posts - 1 through 15 (of 27 total)

1 2 →

You must be logged in to reply to this topic.