Multi GPUs Calculation
OpenLB – Open Source Lattice Boltzmann Code › Forums › on OpenLB › General Topics › Multi GPUs Calculation
- This topic has 26 replies, 2 voices, and was last updated 1 year, 1 month ago by Adrian.
-
AuthorPosts
-
June 23, 2023 at 11:11 am #7534YujiParticipant
Hello team openLB,
I am tyring GPU calculatons.
1. with one GPU, it worked well when I used gpu_only.mk at `examples/laminar/cavity3dBenchmark’2. with 2 GPUs (because my workstation has 2 of 3080Ti), it doesn’t work well when I use gpu_openmpi_mixed.mk or gpu_openmpi.mk. The both of error message are same
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
Hostname: user
cuIpcOpenMemHandle return value: 217
address: 0x1310200000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory. Try to reduce the device
memory footprint of your application.
————————————————————————–
[user:06042] Failed to register remote memory, rc=-1
[user:06041] Failed to register remote memory, rc=-1
corrupted size vs. prev_size while consolidating
[user:06041] *** Process received signal ***
[user:06041] Signal: Aborted (6)
[user:06041] Signal code: (-6)
[user:06037] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcOpenMemHandle failed
[user:06037] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages(the shell comand is $mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’)
Moreover, I tried a comannd of $ mpirun -np 2 ./cavity3d. The error message was [GPU_CUDA:0] Found 2 CUDA devices but only one can be used per MPI process.
[GPU_CUDA:1] Found 2 CUDA devices but only one can be used per MPI process.
corrupted size vs. prev_size while consolidating
[user:06021] *** Process received signal ***
[user:06021] Signal: Aborted (6)
[user:06021] Signal code: (-6)Do you know what is wrong?
I am wondering if I need NVLINK as to use multi GPUs?Best,
June 23, 2023 at 1:32 pm #7535AdrianKeymasterIn general you do not need NVlink interconnect to use multiple GPUs in OpenLB (as MPI will transparently fall back to PCI device-cpu-device communication, although it is recommended for optimal performance due to better inter-GPU bandwidth).
I assume that OpenLB did not issue a warning on missing CUDA-awareness of MPI (e.g. “The used MPI Library is not CUDA-aware. Multi-GPU execution will fail.”) and that you compiled / installed MPI with CUDA-awareness?
Can you provide me with more details on your system and software environment? (CUDA versions, modified config.mk and so on)
If you use
mpirun -np 2 ./cavity3d
only the first of all visible GPUs will be used (as per the warning message). This is why the example configs contain thempirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./program’
command which will assign each rank its own GPU.June 27, 2023 at 7:37 am #7601YujiParticipantDear Adrian,
Thank you for replying.
I compiled and installed open-MPI with cuda-aware follwing the openLB user guide 1.6 pp157.my cuda information is here
$ nvidia-smi
Tue Jun 27 14:21:43 2023
+—————————————————————————————+
| NVIDIA-SMI 530.30.04 Driver Version: 531.29 CUDA Version: 12.1 |
|—————————————–+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 Ti On | 00000000:0E:00.0 On | N/A |
| 0% 26C P8 17W / 350W| 3154MiB / 12288MiB | 7% Default |
| | | N/A |
+—————————————–+———————-+———————-+
| 1 NVIDIA GeForce RTX 3080 Ti On | 00000000:0F:00.0 Off | N/A |
| 0% 24C P8 5W / 350W| 475MiB / 12288MiB | 0% Default |
| | | N/A |
+—————————————–+———————-+———————-+and config.mk is here,
CXX := nvcc
CC := nvccCXXFLAGS := -O3
CXXFLAGS += -std=c++17PARALLEL_MODE := MPI
MPIFLAGS := -lmpi_cxx -lmpi
PLATFORMS := CPU_SISD GPU_CUDA
# for e.g. RTX 30* (Ampere), see table in
rules.mk
for other options
CUDA_ARCH := 86FLOATING_POINT_TYPE := float
USE_EMBEDDED_DEPENDENCIES := ON
when cuda-aware is used, must I remove opempi-bin and libopenmpi-dev which is installed using apt ?
Best regardsJune 27, 2023 at 10:34 am #7602AdrianKeymasterOk, this looks as if the wrong MPI version is selected for compilation. Likely you will get a compiler / linker error if you remove opempi-bin and libopenmpi-dev.
Does
ompi_info --parsable -l 9 --all | grep mpi_built_with_cuda_support:value
return true (as per the user guide)? Did you modify the PATH accordingly?I would suggest to make sure that the correct MPI version is set and use the flags as provided by
mpic++ -showme
. (Or you can exactly follow the user guide / tech report).June 27, 2023 at 10:48 am #7603YujiParticipantDear Adrian,
Thank you. Ok, I dont remove opempi-bin and libopenmpi-dev.
As to ompi_info –parsable -l 9 –all | grep mpi_built_with_cuda_support:value, this returns true.
$ ompi_info –parsable -l 9 –all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true.I modifided PATH a bit from user guide.
My ~/.bashrc has
export MPI_HOME=$MPI_HOME:$HOME/opt/openmpi
export PATH=$PATH:$MPI_HOME/bin
export CPATH=$CPATH:$MPI_HOME/include
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_HOME/libAnd $mpic++ -showme shows
g++ -I/home/ubuntu2204/opt/openmpi/include -L/home/ubuntu2204/opt/openmpi/lib -Wl,-rpath -Wl,/home/ubuntu2204/opt/openmpi/lib -Wl,–enable-new-dtags -lmpi
I am not familiar with $mpic++ -showme. In this case, dose the output show that I can use only -lmpi?Best regards
June 27, 2023 at 10:57 am #7606AdrianKeymasterOk, this seems to be a precedence issue then – if you add
-I/home/ubuntu2204/opt/openmpi/include
to your CXXFLAGS and-L/home/ubuntu2204/opt/openmpi/lib
to the LDFLAGS it should work.June 27, 2023 at 11:10 am #7607YujiParticipantDear Adrian,
Thank you.
when compiled config.mk, it seems not reading -L/home/ubuntu2204/opt/openmpi/lib (LDFLAGS).
what files shoud I modify?My config.mk is
`
`
CXX := nvcc
CC := nvccCXXFLAGS := -O3
CXXFLAGS += -std=c++17
CXXFLAGS += -I/home/ubuntu2204/opt/openmpi/includeLDFLAGS := -L/home/ubuntu2204/opt/openmpi/lib
PARALLEL_MODE := MPI
MPIFLAGS := -lmpi_cxx -lmpi
PLATFORMS := CPU_SISD GPU_CUDA
# for e.g. RTX 30* (Ampere), see table in
rules.mk
for other options
CUDA_ARCH := 86FLOATING_POINT_TYPE := float
USE_EMBEDDED_DEPENDENCIES := ON
`
`
And compiled config.mk output is
`
`
$ make clean;make
make CXX=nvcc CC=nvcc -C external clean
make[1]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
make -C zlib clean
make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
make -C tinyxml clean
make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
rm -f lib/libz.a lib/libtinyxml.a
make[1]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
rm -f src/communication/mpiManager.o src/communication/ompManager.o src/core/olbInit.o src/io/ostreamManager.o
rm -f build/lib/libolbcore.a
make CXX=nvcc CC=nvcc -C external
make[1]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
make -C zlib
make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
nvcc -c -o build/adler32.o ./adler32.c
nvcc -c -o build/crc32.o ./crc32.c
nvcc -c -o build/deflate.o ./deflate.c
nvcc -c -o build/infback.o ./infback.c
nvcc -c -o build/inffast.o ./inffast.c
nvcc -c -o build/inflate.o ./inflate.c
nvcc -c -o build/inftrees.o ./inftrees.c
nvcc -c -o build/trees.o ./trees.c
nvcc -c -o build/zutil.o ./zutil.c
nvcc -c -o build/compress.o ./compress.c
nvcc -c -o build/uncompr.o ./uncompr.c
nvcc -c -o build/gzclose.o ./gzclose.c
nvcc -c -o build/gzlib.o ./gzlib.c
nvcc -c -o build/gzread.o ./gzread.c
nvcc -c -o build/gzwrite.o ./gzwrite.c
ar rc build//libz.a ./build/adler32.o ./build/crc32.o ./build/deflate.o ./build/infback.o ./build/inffast.o ./build/inflate.o ./build/inftrees.o ./build/trees.o ./build/zutil.o ./build/compress.o ./build/uncompr.o ./build/gzclose.o ./build/gzlib.o ./build/gzread.o ./build/gzwrite.o
make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
cp zlib/build/libz.a lib/
make -C tinyxml
make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
nvcc -c tinystr.cpp -o build/tinystr.o
nvcc -c tinyxml.cpp -o build/tinyxml.o
nvcc -c tinyxmlerror.cpp -o build/tinyxmlerror.o
nvcc -c tinyxmlparser.cpp -o build/tinyxmlparser.o
ar rc build/libtinyxml.a ./build/tinystr.o ./build/tinyxml.o ./build/tinyxmlerror.o ./build/tinyxmlparser.o
make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
cp tinyxml/build/libtinyxml.a lib/
make[1]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/communication/mpiManager.cpp -o src/communication/mpiManager.o
nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/communication/ompManager.cpp -o src/communication/ompManager.o
nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/core/olbInit.cpp -o src/core/olbInit.o
nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/io/ostreamManager.cpp -o src/io/ostreamManager.o
ar rc build/lib/libolbcore.a src/communication/mpiManager.o src/communication/ompManager.o src/core/olbInit.o src/io/ostreamManager.o
`
`
June 27, 2023 at 11:16 am #7612AdrianKeymasterHow do you know that it is not actually using the MPI library in
/home/ubuntu2204/opt/openmpi/lib
? The core library compilation was successful as per the log. Did you already try recompiling one of the examples to confirm whether it still doesn’t work?To make sure you can also try to put the CUDA-aware MPI paths first in:
export PATH=$PATH:$MPI_HOME/bin export CPATH=$CPATH:$MPI_HOME/include export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_HOME/lib
June 27, 2023 at 11:58 am #7613YujiParticipantThank you. Sorry I noticed the linker of -L/home/ubuntu2204/opt/openmpi/lib was successful at cavity3dBenchmark when built.
I tried an execution at cavity3dBenchmark hoever I got an error (as same as previous one)
`
:$ mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’
————————————————————————–
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
Hostname: YujiShimojima
cuIpcOpenMemHandle return value: 217
address: 0x1310200000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory. Try to reduce the device
memory footprint of your application.
————————————————————————–
[user:01607] Failed to register remote memory, rc=-1
[user:01608] Failed to register remote memory, rc=-1
corrupted size vs. prev_size while consolidating
[user:01607] *** Process received signal ***
[user:01607] Signal: Aborted (6)
[user:01607] Signal code: (-6)
[user:01603] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcOpenMemHandle failed
[user:01603] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
`
June 29, 2023 at 8:01 am #7615YujiParticipantI noticed when I tyied $ mpirun -np 1 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’ it was successful;
output is
100, 100, 1, 1, 1, 1515.15However I still dont use 2 GPUs.
error message is
$ mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’
————————————————————————–
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
Hostname: YujiShimojima
cuIpcOpenMemHandle return value: 217
address: 0x1310200000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory. Try to reduce the device
memory footprint of your application.
————————————————————————–
[YujiShimojima:01335] Failed to register remote memory, rc=-1
[YujiShimojima:01334] Failed to register remote memory, rc=-1
corrupted size vs. prev_size while consolidating
[YujiShimojima:01334] *** Process received signal ***
[YujiShimojima:01334] Signal: Aborted (6)
[YujiShimojima:01334] Signal code: (-6)
[YujiShimojima:01330] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcOpenMemHandle failed
[YujiShimojima:01330] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messagesJuly 11, 2023 at 8:02 am #7619YujiParticipantI can not solve it still now…
Could you give me any idea in order to calculate openLB using multiple GPUs without NVLINK?July 11, 2023 at 7:06 pm #7620AdrianKeymasterDid you verify whether single-GPU execution works correctly on the other GPU by setting CUDA_VISIBLE_DEVICES to 1?
I will only be back from holiday tomorrow so can not verify right now but ” -mca btl_smcuda_use_cuda_ipc 0″ should disable NVlink for local GPU-GPU communication in CUDA-aware MPI.
July 12, 2023 at 5:55 am #7621YujiParticipantDear Adrian,
Sorry for interrupting your holliday.
I could do mpirun with -mca btl_smcuda_use_cuda_ipc 0.
Thank you so much.By the way, some instractions for the multiple GPUs calculation , for examples “gpu::cuda::device::synchronize();”, are written in the examples/laminar/cavity3dBenchmark/cavity3d.cpp.
Is it necessary to write in my cpp too for mpirun with multiple GPUs?
I think it is not necessary to write it for CPU mpirun. But I am not sure for GPU.
Moreover I want to know where openMPI and openMP instractions in openLB (both CPU and GPU).
BestJuly 12, 2023 at 8:18 pm #7623AdrianKeymasterNo worries 🙂 So the multi-GPU execution now works with the additional flag?
The
gpu::cuda::device::synchronize
calls are conditionally enabled only if GPU support is enabled. In general this is an artifact of the work-in-progress nature of heterogeneous computation support in OpenLB. Both this and theSuperLattice::setProcessingContext
calls will be transparently hidden in the future.What do you mean exactly by your last question? Ignoring the mpirun / hardware setup issues, multi-GPU support in OpenLB is transparent in the sense that if A) CUDA-aware MPI support is enabled during comilation and B) the application works on a single GPU then it will work in multi-GPU mode also.
OpenMP is only used for CPU-side parallelization on shared memory systems. Most commonly we use it in HYBRID mode for CPU-only simulations (i.e. each socket of a cluster is assigned a single OpenMPI process using OpenMP parallelization internally).
The performance-critical parts of OpenMPI usage are contained in the
SuperCommunicator
(and its support infrastructure). This is the class responsible for handling all overlap communication between the individual blocks of the decomposition.July 14, 2023 at 5:52 am #7625YujiParticipantDear Adrian,
Yes, the multi-GPU execution now works with the additional flag without NVLINK; ” -mca btl_smcuda_use_cuda_ipc 0″ . Thank you.
Thank you for detail ansewers.
The last questuion was meant that where the codes for parallelization are. Are there in src/communication? This is because I cannot understand so far why the codes can run with mpirun without openmpi API instruction for CPU-only simulation in the examples cases in openLB (now I see the cavity3d.cpp in examples/laminar/cavity3d). In this case, “int noCuboids = singleton::mpi().getSize();” is only written. I mean I would like to understabd why the orter API instructions of the openmpi and openMP are not written in “example cases .cpp file” for CPU-only simulation.
Let me confirm to make sure for your answer
1) for CPU-only simulation, it is not necessary to write openmpi API in “my case .cpp file” , (for example MPI_Comm_rank(MPI_COMM_WORLD, &rank); and so on.)
2) for single-GPU simulation, it is also not necessary to write it.
3) for multi-GPU simulation, it is necessary to write openmpi API “in my case .cpp file” . If I do not wirte openMPI API instruction and run with mpirun(for example mpirun -np 2), the GPU dose same simulation in each GPU(in case 2 GPU would do same run and provide same results).Are my understandigs right?
Best
-
AuthorPosts
- You must be logged in to reply to this topic.