Skip to content

Multi GPUs Calculation

Viewing 15 posts - 1 through 15 (of 27 total)
  • Author
    Posts
  • #7534
    Yuji
    Participant

    Hello team openLB,

    I am tyring GPU calculatons.
    1. with one GPU, it worked well when I used gpu_only.mk at `examples/laminar/cavity3dBenchmark’

    2. with 2 GPUs (because my workstation has 2 of 3080Ti), it doesn’t work well when I use gpu_openmpi_mixed.mk or gpu_openmpi.mk. The both of error message are same
    The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
    and will cause the program to abort.
    Hostname: user
    cuIpcOpenMemHandle return value: 217
    address: 0x1310200000
    Check the cuda.h file for what the return value means. A possible cause
    for this is not enough free device memory. Try to reduce the device
    memory footprint of your application.
    ————————————————————————–
    [user:06042] Failed to register remote memory, rc=-1
    [user:06041] Failed to register remote memory, rc=-1
    corrupted size vs. prev_size while consolidating
    [user:06041] *** Process received signal ***
    [user:06041] Signal: Aborted (6)
    [user:06041] Signal code: (-6)
    [user:06037] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcOpenMemHandle failed
    [user:06037] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

    (the shell comand is $mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’)

    Moreover, I tried a comannd of $ mpirun -np 2 ./cavity3d. The error message was [GPU_CUDA:0] Found 2 CUDA devices but only one can be used per MPI process.
    [GPU_CUDA:1] Found 2 CUDA devices but only one can be used per MPI process.
    corrupted size vs. prev_size while consolidating
    [user:06021] *** Process received signal ***
    [user:06021] Signal: Aborted (6)
    [user:06021] Signal code: (-6)

    Do you know what is wrong?
    I am wondering if I need NVLINK as to use multi GPUs?

    Best,

    #7535
    Adrian
    Keymaster

    In general you do not need NVlink interconnect to use multiple GPUs in OpenLB (as MPI will transparently fall back to PCI device-cpu-device communication, although it is recommended for optimal performance due to better inter-GPU bandwidth).

    I assume that OpenLB did not issue a warning on missing CUDA-awareness of MPI (e.g. “The used MPI Library is not CUDA-aware. Multi-GPU execution will fail.”) and that you compiled / installed MPI with CUDA-awareness?

    Can you provide me with more details on your system and software environment? (CUDA versions, modified config.mk and so on)

    If you use mpirun -np 2 ./cavity3d only the first of all visible GPUs will be used (as per the warning message). This is why the example configs contain the mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./program’ command which will assign each rank its own GPU.

    #7601
    Yuji
    Participant

    Dear Adrian,

    Thank you for replying.
    I compiled and installed open-MPI with cuda-aware follwing the openLB user guide 1.6 pp157.

    my cuda information is here
    $ nvidia-smi
    Tue Jun 27 14:21:43 2023
    +—————————————————————————————+
    | NVIDIA-SMI 530.30.04 Driver Version: 531.29 CUDA Version: 12.1 |
    |—————————————–+———————-+———————-+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    | | | MIG M. |
    |=========================================+======================+======================|
    | 0 NVIDIA GeForce RTX 3080 Ti On | 00000000:0E:00.0 On | N/A |
    | 0% 26C P8 17W / 350W| 3154MiB / 12288MiB | 7% Default |
    | | | N/A |
    +—————————————–+———————-+———————-+
    | 1 NVIDIA GeForce RTX 3080 Ti On | 00000000:0F:00.0 Off | N/A |
    | 0% 24C P8 5W / 350W| 475MiB / 12288MiB | 0% Default |
    | | | N/A |
    +—————————————–+———————-+———————-+

    and config.mk is here,

    CXX := nvcc
    CC := nvcc

    CXXFLAGS := -O3
    CXXFLAGS += -std=c++17

    PARALLEL_MODE := MPI

    MPIFLAGS := -lmpi_cxx -lmpi

    PLATFORMS := CPU_SISD GPU_CUDA

    # for e.g. RTX 30* (Ampere), see table in rules.mk for other options
    CUDA_ARCH := 86

    FLOATING_POINT_TYPE := float

    USE_EMBEDDED_DEPENDENCIES := ON

    when cuda-aware is used, must I remove opempi-bin and libopenmpi-dev which is installed using apt ?
    Best regards

    #7602
    Adrian
    Keymaster

    Ok, this looks as if the wrong MPI version is selected for compilation. Likely you will get a compiler / linker error if you remove opempi-bin and libopenmpi-dev.

    Does ompi_info --parsable -l 9 --all | grep mpi_built_with_cuda_support:value return true (as per the user guide)? Did you modify the PATH accordingly?

    I would suggest to make sure that the correct MPI version is set and use the flags as provided by mpic++ -showme. (Or you can exactly follow the user guide / tech report).

    #7603
    Yuji
    Participant

    Dear Adrian,

    Thank you. Ok, I dont remove opempi-bin and libopenmpi-dev.
    As to ompi_info –parsable -l 9 –all | grep mpi_built_with_cuda_support:value, this returns true.
    $ ompi_info –parsable -l 9 –all | grep mpi_built_with_cuda_support:value
    mca:mpi:base:param:mpi_built_with_cuda_support:value:true.

    I modifided PATH a bit from user guide.
    My ~/.bashrc has
    export MPI_HOME=$MPI_HOME:$HOME/opt/openmpi
    export PATH=$PATH:$MPI_HOME/bin
    export CPATH=$CPATH:$MPI_HOME/include
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_HOME/lib

    And $mpic++ -showme shows
    g++ -I/home/ubuntu2204/opt/openmpi/include -L/home/ubuntu2204/opt/openmpi/lib -Wl,-rpath -Wl,/home/ubuntu2204/opt/openmpi/lib -Wl,–enable-new-dtags -lmpi
    I am not familiar with $mpic++ -showme. In this case, dose the output show that I can use only -lmpi?

    Best regards

    • This reply was modified 10 months, 3 weeks ago by Yuji.
    • This reply was modified 10 months, 3 weeks ago by Yuji.
    #7606
    Adrian
    Keymaster

    Ok, this seems to be a precedence issue then – if you add -I/home/ubuntu2204/opt/openmpi/include to your CXXFLAGS and -L/home/ubuntu2204/opt/openmpi/lib to the LDFLAGS it should work.

    #7607
    Yuji
    Participant

    Dear Adrian,

    Thank you.
    when compiled config.mk, it seems not reading -L/home/ubuntu2204/opt/openmpi/lib (LDFLAGS).
    what files shoud I modify?

    My config.mk is

    `
    `
    CXX := nvcc
    CC := nvcc

    CXXFLAGS := -O3
    CXXFLAGS += -std=c++17
    CXXFLAGS += -I/home/ubuntu2204/opt/openmpi/include

    LDFLAGS := -L/home/ubuntu2204/opt/openmpi/lib

    PARALLEL_MODE := MPI

    MPIFLAGS := -lmpi_cxx -lmpi

    PLATFORMS := CPU_SISD GPU_CUDA

    # for e.g. RTX 30* (Ampere), see table in rules.mk for other options
    CUDA_ARCH := 86

    FLOATING_POINT_TYPE := float

    USE_EMBEDDED_DEPENDENCIES := ON
    `
    `

    And compiled config.mk output is
    `
    `
    $ make clean;make
    make CXX=nvcc CC=nvcc -C external clean
    make[1]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
    make -C zlib clean
    make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
    make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
    make -C tinyxml clean
    make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
    make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
    rm -f lib/libz.a lib/libtinyxml.a
    make[1]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
    rm -f src/communication/mpiManager.o src/communication/ompManager.o src/core/olbInit.o src/io/ostreamManager.o
    rm -f build/lib/libolbcore.a
    make CXX=nvcc CC=nvcc -C external
    make[1]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
    make -C zlib
    make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
    nvcc -c -o build/adler32.o ./adler32.c
    nvcc -c -o build/crc32.o ./crc32.c
    nvcc -c -o build/deflate.o ./deflate.c
    nvcc -c -o build/infback.o ./infback.c
    nvcc -c -o build/inffast.o ./inffast.c
    nvcc -c -o build/inflate.o ./inflate.c
    nvcc -c -o build/inftrees.o ./inftrees.c
    nvcc -c -o build/trees.o ./trees.c
    nvcc -c -o build/zutil.o ./zutil.c
    nvcc -c -o build/compress.o ./compress.c
    nvcc -c -o build/uncompr.o ./uncompr.c
    nvcc -c -o build/gzclose.o ./gzclose.c
    nvcc -c -o build/gzlib.o ./gzlib.c
    nvcc -c -o build/gzread.o ./gzread.c
    nvcc -c -o build/gzwrite.o ./gzwrite.c
    ar rc build//libz.a ./build/adler32.o ./build/crc32.o ./build/deflate.o ./build/infback.o ./build/inffast.o ./build/inflate.o ./build/inftrees.o ./build/trees.o ./build/zutil.o ./build/compress.o ./build/uncompr.o ./build/gzclose.o ./build/gzlib.o ./build/gzread.o ./build/gzwrite.o
    make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/zlib’
    cp zlib/build/libz.a lib/
    make -C tinyxml
    make[2]: Entering directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
    nvcc -c tinystr.cpp -o build/tinystr.o
    nvcc -c tinyxml.cpp -o build/tinyxml.o
    nvcc -c tinyxmlerror.cpp -o build/tinyxmlerror.o
    nvcc -c tinyxmlparser.cpp -o build/tinyxmlparser.o
    ar rc build/libtinyxml.a ./build/tinystr.o ./build/tinyxml.o ./build/tinyxmlerror.o ./build/tinyxmlparser.o
    make[2]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external/tinyxml’
    cp tinyxml/build/libtinyxml.a lib/
    make[1]: Leaving directory ‘/mnt/c/Users/shimo/Code_SMJM/olb-1.6r0/external’
    nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/communication/mpiManager.cpp -o src/communication/mpiManager.o
    nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/communication/ompManager.cpp -o src/communication/ompManager.o
    nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/core/olbInit.cpp -o src/core/olbInit.o
    nvcc -O3 -std=c++17 -I/home/ubuntu2204/opt/openmpi/include -pthread -DPARALLEL_MODE_MPI -lmpi_cxx -lmpi –forward-unknown-to-host-compiler -x cu -O3 -std=c++17 –generate-code=arch=compute_86,code=[compute_86,sm_86] –extended-lambda –expt-relaxed-constexpr -rdc=true -Xcudafe “–diag_suppress=implicit_return_from_non_void_function –display_error_number –diag_suppress=20014 –diag_suppress=20011” -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -fPIC -Isrc/ -c src/io/ostreamManager.cpp -o src/io/ostreamManager.o
    ar rc build/lib/libolbcore.a src/communication/mpiManager.o src/communication/ompManager.o src/core/olbInit.o src/io/ostreamManager.o
    `
    `

    • This reply was modified 10 months, 3 weeks ago by Yuji.
    • This reply was modified 10 months, 3 weeks ago by Yuji.
    • This reply was modified 10 months, 3 weeks ago by Yuji.
    • This reply was modified 10 months, 3 weeks ago by Yuji.
    #7612
    Adrian
    Keymaster

    How do you know that it is not actually using the MPI library in /home/ubuntu2204/opt/openmpi/lib? The core library compilation was successful as per the log. Did you already try recompiling one of the examples to confirm whether it still doesn’t work?

    To make sure you can also try to put the CUDA-aware MPI paths first in:

    
    export PATH=$PATH:$MPI_HOME/bin
    export CPATH=$CPATH:$MPI_HOME/include
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_HOME/lib 
    
    #7613
    Yuji
    Participant

    Thank you. Sorry I noticed the linker of -L/home/ubuntu2204/opt/openmpi/lib was successful at cavity3dBenchmark when built.

    I tried an execution at cavity3dBenchmark hoever I got an error (as same as previous one)
    `
    :$ mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’
    ————————————————————————–
    The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
    and will cause the program to abort.
    Hostname: YujiShimojima
    cuIpcOpenMemHandle return value: 217
    address: 0x1310200000
    Check the cuda.h file for what the return value means. A possible cause
    for this is not enough free device memory. Try to reduce the device
    memory footprint of your application.
    ————————————————————————–
    [user:01607] Failed to register remote memory, rc=-1
    [user:01608] Failed to register remote memory, rc=-1
    corrupted size vs. prev_size while consolidating
    [user:01607] *** Process received signal ***
    [user:01607] Signal: Aborted (6)
    [user:01607] Signal code: (-6)
    [user:01603] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcOpenMemHandle failed
    [user:01603] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
    `

    #7615
    Yuji
    Participant

    I noticed when I tyied $ mpirun -np 1 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’ it was successful;
    output is
    100, 100, 1, 1, 1, 1515.15

    However I still dont use 2 GPUs.
    error message is
    $ mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d’
    ————————————————————————–
    The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
    and will cause the program to abort.
    Hostname: YujiShimojima
    cuIpcOpenMemHandle return value: 217
    address: 0x1310200000
    Check the cuda.h file for what the return value means. A possible cause
    for this is not enough free device memory. Try to reduce the device
    memory footprint of your application.
    ————————————————————————–
    [YujiShimojima:01335] Failed to register remote memory, rc=-1
    [YujiShimojima:01334] Failed to register remote memory, rc=-1
    corrupted size vs. prev_size while consolidating
    [YujiShimojima:01334] *** Process received signal ***
    [YujiShimojima:01334] Signal: Aborted (6)
    [YujiShimojima:01334] Signal code: (-6)
    [YujiShimojima:01330] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcOpenMemHandle failed
    [YujiShimojima:01330] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

    #7619
    Yuji
    Participant

    I can not solve it still now…
    Could you give me any idea in order to calculate openLB using multiple GPUs without NVLINK?

    #7620
    Adrian
    Keymaster

    Did you verify whether single-GPU execution works correctly on the other GPU by setting CUDA_VISIBLE_DEVICES to 1?

    I will only be back from holiday tomorrow so can not verify right now but ” -mca btl_smcuda_use_cuda_ipc 0″ should disable NVlink for local GPU-GPU communication in CUDA-aware MPI.

    #7621
    Yuji
    Participant

    Dear Adrian,

    Sorry for interrupting your holliday.

    I could do mpirun with -mca btl_smcuda_use_cuda_ipc 0.
    Thank you so much.

    By the way, some instractions for the multiple GPUs calculation , for examples “gpu::cuda::device::synchronize();”, are written in the examples/laminar/cavity3dBenchmark/cavity3d.cpp.

    Is it necessary to write in my cpp too for mpirun with multiple GPUs?
    I think it is not necessary to write it for CPU mpirun. But I am not sure for GPU.
    Moreover I want to know where openMPI and openMP instractions in openLB (both CPU and GPU).
    Best

    #7623
    Adrian
    Keymaster

    No worries 🙂 So the multi-GPU execution now works with the additional flag?

    The gpu::cuda::device::synchronize calls are conditionally enabled only if GPU support is enabled. In general this is an artifact of the work-in-progress nature of heterogeneous computation support in OpenLB. Both this and the SuperLattice::setProcessingContext calls will be transparently hidden in the future.

    What do you mean exactly by your last question? Ignoring the mpirun / hardware setup issues, multi-GPU support in OpenLB is transparent in the sense that if A) CUDA-aware MPI support is enabled during comilation and B) the application works on a single GPU then it will work in multi-GPU mode also.

    OpenMP is only used for CPU-side parallelization on shared memory systems. Most commonly we use it in HYBRID mode for CPU-only simulations (i.e. each socket of a cluster is assigned a single OpenMPI process using OpenMP parallelization internally).

    The performance-critical parts of OpenMPI usage are contained in the SuperCommunicator (and its support infrastructure). This is the class responsible for handling all overlap communication between the individual blocks of the decomposition.

    #7625
    Yuji
    Participant

    Dear Adrian,

    Yes, the multi-GPU execution now works with the additional flag without NVLINK; ” -mca btl_smcuda_use_cuda_ipc 0″ . Thank you.

    Thank you for detail ansewers.

    The last questuion was meant that where the codes for parallelization are. Are there in src/communication? This is because I cannot understand so far why the codes can run with mpirun without openmpi API instruction for CPU-only simulation in the examples cases in openLB (now I see the cavity3d.cpp in examples/laminar/cavity3d). In this case, “int noCuboids = singleton::mpi().getSize();” is only written. I mean I would like to understabd why the orter API instructions of the openmpi and openMP are not written in “example cases .cpp file” for CPU-only simulation.

    Let me confirm to make sure for your answer
    1) for CPU-only simulation, it is not necessary to write openmpi API in “my case .cpp file” , (for example MPI_Comm_rank(MPI_COMM_WORLD, &rank); and so on.)
    2) for single-GPU simulation, it is also not necessary to write it.
    3) for multi-GPU simulation, it is necessary to write openmpi API “in my case .cpp file” . If I do not wirte openMPI API instruction and run with mpirun(for example mpirun -np 2), the GPU dose same simulation in each GPU(in case 2 GPU would do same run and provide same results).

    Are my understandigs right?

    Best

Viewing 15 posts - 1 through 15 (of 27 total)
  • You must be logged in to reply to this topic.