Skip to content

Multi-GPU MPI library is not CUDA-aware

Due to recent bot attacks we have chanced the sign-up process. If you want to participate in our forum, first register on this website and then send a message via our contact form.

Forums on OpenLB General Topics Multi-GPU MPI library is not CUDA-aware

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #10765
    alex.ws
    Participant

    Hello,

    We are trying to run some multi-gpu examples (aorta3d) and get the following error upon execution:
    [GPU_CUDA] The used MPI Library is not CUDA-aware. Multi-GPU execution will fail.

    Some info that may be useful:
    System is a 48-core EPYC with 2x NVIDIA RTX PRO 6000 Blackwell GPUs, 384GB memory.
    Ubuntu 24.04
    CUDA 13.0 with drivers 580.65.06
    OpenMPI 5.0.8

    running the command nvidia-smi I get:

    ubuntu@hpc:~/OpenLB_GPU/examples/turbulence/aorta3d$ nvidia-smi
    Mon Oct  6 14:14:23 2025       
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
    +-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:01:00.0 Off |                  Off |
    | 30%   25C    P8             12W /  300W |      15MiB /  97887MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   1  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:02:00.0 Off |                  Off |
    | 30%   24C    P8              9W /  300W |      15MiB /  97887MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |    0   N/A  N/A            2309      G   /usr/lib/xorg/Xorg                        4MiB |
    |    1   N/A  N/A            2309      G   /usr/lib/xorg/Xorg                        4MiB |
    +-----------------------------------------------------------------------------------------+
    

    NVCC:

    ubuntu@hpc:~/OpenLB_GPU/examples/turbulence/aorta3d$ nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2025 NVIDIA Corporation
    Built on Wed_Aug_20_01:58:59_PM_PDT_2025
    Cuda compilation tools, release 13.0, V13.0.88
    Build cuda_13.0.r13.0/compiler.36424714_0

    ompi_info:

    ubuntu@hpc:~/OpenLB_GPU/examples/turbulence/aorta3d$ ompi_info | grep -i cuda
      Configure command line: '--prefix=/opt/openmpi-5.0.8' '--with-cuda=/usr/local/cuda'
              MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
             MCA accelerator: cuda (MCA v2.1.0, API v1.0.0, Component v5.0.8)
                     MCA btl: smcuda (MCA v2.1.0, API v3.3.0, Component v5.0.8)
                    MCA coll: cuda (MCA v2.1.0, API v2.4.0, Component v5.0.8)
    ubuntu@hpc:~/OpenLB_GPU/examples/turbulence/aorta3d$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
    mca:mpi:base:param:mpi_built_with_cuda_support:value:true

    OpenLB config.mk:

    CXX             := nvcc
    CC              := nvcc
    
    CXXFLAGS        := -O3
    CXXFLAGS        += -std=c++20 --forward-unknown-to-host-compiler
    
    PARALLEL_MODE   := MPI
    
    # CPU/MPI compiler flags
    CXXFLAGS += -I/opt/openmpi-5.0.8/include   
    CCFLAGS  += -I/opt/openmpi-5.0.8/include
    
    # MPI linker flags
    LDFLAGS += -L/opt/openmpi-5.0.8/lib -lmpi
    
    PLATFORMS       := CPU_SISD GPU_CUDA
    
    USE_CUDA_AWARE_MPI := ON
    
    # for e.g. RTX 30* (Ampere), see table in 'rules.mk' for other options
    CUDA_ARCH       := 100
    
    FLOATING_POINT_TYPE := float
    
    USE_EMBEDDED_DEPENDENCIES := ON

    Single-GPU simulations run fine and as far as I can tell OpenMPI is installed with CUDA support enabled, however when compiling the OpenLB examples it doesn’t detect the CUDA-aware installation for multi-GPU runs. Any advice is appreciated, hopefully I have provided enough information.

    Thanks in advance,
    Alex

    • This topic was modified 1 month ago by alex.ws. Reason: formatting
    #10767
    Adrian
    Keymaster

    The command outputs (thanks!) all look fine so this may just be the automated check failing despite CUDA-awareness being available. The logic we use to check this

    
    #ifdef PARALLEL_MODE_MPI
    #if defined(MPIX_CUDA_AWARE_SUPPORT) && MPIX_CUDA_AWARE_SUPPORT
      if (!MPIX_Query_cuda_support()) {
        clout << "The used MPI Library is not CUDA-aware. Multi-GPU execution will fail." << std::endl;
      }
    #endif
    #if defined(MPIX_CUDA_AWARE_SUPPORT) && !MPIX_CUDA_AWARE_SUPPORT
      clout << "The used MPI Library is not CUDA-aware. Multi-GPU execution will fail." << std::endl;
    #endif
    #if !defined(MPIX_CUDA_AWARE_SUPPORT)
      clout << "Unable to check for CUDA-aware MPI support. Multi-GPU execution may fail." << std::endl;
    #endif
    #endif // PARALLEL_MODE_MPI
    

    can definitely have gaps. Does the program proceed as usual in multi-GPU after this? (If CUDA-awareness is indeed not working for some reason despite the command output I would expect it to instantly segfault on communication).

    #10768
    alex.ws
    Participant

    Hi Adrian,

    No it fails with a segmentation fault. Full output below:

    ubuntu@hpc:~/OpenLB_GPU/examples/turbulence/aorta3d$ mpirun -np 2 bash -c 'export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./aorta3d'
    [MpiManager] Sucessfully initialized, numThreads=2
    [ThreadPool] Sucessfully initialized, numThreads=1
    [GPU_CUDA] The used MPI Library is not CUDA-aware. Multi-GPU execution will fail.
    [UnitConverter] ----------------- UnitConverter information -----------------
    [UnitConverter] -- Parameters:
    [UnitConverter] Resolution:                       N=              40
    [UnitConverter] Lattice velocity:                 latticeU=       0.0225
    [UnitConverter] Lattice relaxation frequency:     omega=          1.99697
    [UnitConverter] Lattice relaxation time:          tau=            0.50076
    [UnitConverter] Characteristical length(m):       charL=          0.02246
    [UnitConverter] Characteristical speed(m/s):      charU=          0.45
    [UnitConverter] Phys. kinematic viscosity(m^2/s): charNu=         2.8436e-06
    [UnitConverter] Phys. density(kg/m^d):            charRho=        1055
    [UnitConverter] Characteristical pressure(N/m^2): charPressure=   0
    [UnitConverter] Mach number:                      machNumber=     0.0389711
    [UnitConverter] Reynolds number:                  reynoldsNumber= 3554.29
    [UnitConverter] Knudsen number:                   knudsenNumber=  1.09645e-05
    [UnitConverter] Characteristical CFL number:      charCFLnumber=  0.0225
    [UnitConverter] 
    [UnitConverter] -- Conversion factors:
    [UnitConverter] Voxel length(m):                  physDeltaX=     0.0005615
    [UnitConverter] Time step(s):                     physDeltaT=     2.8075e-05
    [UnitConverter] Velocity factor(m/s):             physVelocity=   20
    [UnitConverter] Density factor(kg/m^3):           physDensity=    1055
    [UnitConverter] Mass factor(kg):                  physMass=       1.86768e-07
    [UnitConverter] Viscosity factor(m^2/s):          physViscosity=  0.01123
    [UnitConverter] Force factor(N):                  physForce=      0.133049
    [UnitConverter] Pressure factor(N/m^2):           physPressure=   422000
    [UnitConverter] -------------------------------------------------------------
    [UnitConverter] WARNING:
    [UnitConverter] Potentially UNSTABLE combination of relaxation time (tau=0.50076)
    [UnitConverter] and characteristical CFL number (lattice velocity) charCFLnumber=0.0225!
    [UnitConverter] Potentially maximum characteristical CFL number (maxCharCFLnumber=0.00607729)
    [UnitConverter] Actual characteristical CFL number (charCFLnumber=0.0225) > 0.00607729
    [UnitConverter] Please reduce the the cell size or the time step size!
    [UnitConverter] We recommend to use the cell size of 0.000151659 m and the time step size of 7.58293e-06 s.
    [UnitConverter] -------------------------------------------------------------
    [STLreader] Voxelizing ...
    [STLmesh] nTriangles=2654; maxDist2=0.000610779
    [STLmesh] minPhysR(StlMesh)=(0.199901,0.0900099,0.0117236); maxPhysR(StlMesh)=(0.243584,0.249987,0.0398131)
    [Octree] radius=0.143744; center=(0.221602,0.169858,0.025628)
    [STLreader] voxelSize=0.0005615; stlSize=0.001
    [STLreader] minPhysR(VoxelMesh)=(0.199984,0.0904055,0.0118712); maxPhysR(VoxelMesh)=(0.24322,0.249873,0.0393848)
    [STLreader] Voxelizing ... OK
    [prepareGeometry] Prepare Geometry ...
    [SuperGeometry3D] cleaned 0 outer boundary voxel(s)
    [SuperGeometry3D] cleaned 0 outer boundary voxel(s)
    [SuperGeometry3D] cleaned 0 inner boundary voxel(s) of Type 3
    [SuperGeometryStatistics3D] updated
    [SuperGeometry3D] the model is correct!
    [CuboidDecomposition] ---Cuboid Structure Statistics---
    [CuboidDecomposition]  Number of Cuboids: 	16
    [CuboidDecomposition]  Delta       : 		0.0005615
    [CuboidDecomposition]  Ratio  (min): 		0.529412
    [CuboidDecomposition]         (max): 		1.77778
    [CuboidDecomposition]  Nodes  (min): 		16704
    [CuboidDecomposition]         (max): 		35640
    [CuboidDecomposition]  Weight (min): 		10726
    [CuboidDecomposition]         (max): 		20749
    [CuboidDecomposition] --------------------------------
    [SuperGeometryStatistics3D] materialNumber=0; count=160731; minPhysR=(0.199984,0.089844,0.0113097); maxPhysR=(0.243781,0.250433,0.0399462)
    [SuperGeometryStatistics3D] materialNumber=1; count=171226; minPhysR=(0.200546,0.0904055,0.0118712); maxPhysR=(0.24322,0.249872,0.0393847)
    [SuperGeometryStatistics3D] materialNumber=2; count=41080; minPhysR=(0.199984,0.089844,0.0113097); maxPhysR=(0.243781,0.250433,0.0399462)
    [SuperGeometryStatistics3D] materialNumber=3; count=1059; minPhysR=(0.208407,0.250433,0.0124327); maxPhysR=(0.228059,0.250433,0.0332082)
    [SuperGeometryStatistics3D] materialNumber=4; count=245; minPhysR=(0.200546,0.089844,0.0298392); maxPhysR=(0.210653,0.089844,0.0388232)
    [SuperGeometryStatistics3D] materialNumber=5; count=239; minPhysR=(0.234236,0.089844,0.0287162); maxPhysR=(0.24322,0.089844,0.0388232)
    [SuperGeometryStatistics3D] countTotal[1e6]=0.37458
    [prepareGeometry] Prepare Geometry ... OK
    [prepareLattice] Prepare Lattice ...
    [prepareLattice] Prepare Lattice ... OK
    [Timer] 
    [Timer] ----------------Summary:Timer----------------
    [Timer] measured time (rt) : 0.295s
    [Timer] measured time (cpu): 0.295s
    [Timer] ---------------------------------------------
    [main] starting simulation...
    [hpc:08008] *** Process received signal ***
    [hpc:08008] Signal: Segmentation fault (11)
    [hpc:08008] Signal code: Invalid permissions (2)
    [hpc:08008] Failing at address: 0x318d27e00
    [hpc:08009] *** Process received signal ***
    [hpc:08009] Signal: Segmentation fault (11)
    [hpc:08009] Signal code: Invalid permissions (2)
    [hpc:08009] Failing at address: 0x318d49200
    [hpc:08009] [ 0] [hpc:08008] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x71e29ba45330]
    [hpc:08008] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x74362a445330]
    [hpc:08009] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a440d)[0x71e29bba440d]
    [hpc:08008] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x1a440d)[0x74362a5a440d]
    [hpc:08009] [ 2] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(+0xcf985)[0x71e2a23b9985]
    [hpc:08008] [ 3] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(+0xcf985)[0x74362afb9985]
    [hpc:08009] [ 3] /opt/openmpi-5.0.8/lib/libmpi.so.40(mca_pml_ob1_send_request_schedule_once+0x24a)[0x74363105601a]
    [hpc:08009] [ 4] /opt/openmpi-5.0.8/lib/libmpi.so.40(mca_pml_ob1_send_request_schedule_once+0x24a)[0x71e2a265601a]
    [hpc:08008] [ 4] /opt/openmpi-5.0.8/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_ack+0x151)[0x71e2a264d681]
    [hpc:08008] [ 5] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(mca_btl_sm_poll_handle_frag+0x9b)[0x71e2a23bacab]
    [hpc:08008] [ 6] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(+0xd118b)[0x71e2a23bb18b]
    [hpc:08008] [ 7] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(opal_progress+0x34)[0x71e2a230ec84]
    [hpc:08008] [ 8] /opt/openmpi-5.0.8/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_ack+0x151)[0x74363104d681]
    [hpc:08009] [ 5] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(mca_btl_sm_poll_handle_frag+0x9b)[0x74362afbacab]
    [hpc:08009] [ 6] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(+0xd118b)[0x74362afbb18b]
    [hpc:08009] [ 7] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(opal_progress+0x34)[0x74362af0ec84]
    [hpc:08009] [ 8] /opt/openmpi-5.0.8/lib/libmpi.so.40(ompi_request_default_test+0x51)[0x71e2a2490ae1]
    [hpc:08008] [ 9] /opt/openmpi-5.0.8/lib/libmpi.so.40(PMPI_Test+0x4a)[0x71e2a24d72aa]
    [hpc:08008] [10] /opt/openmpi-5.0.8/lib/libmpi.so.40(ompi_request_default_test+0x51)[0x743630e90ae1]
    [hpc:08009] [ 9] ./aorta3d(+0x1338ca)[0x5abd5e10d8ca]
    [hpc:08008] [11] ./aorta3d(+0xc31c2)[0x5abd5e09d1c2]
    [hpc:08008] [12] ./aorta3d(+0x1adb5e)[0x5abd5e187b5e]
    [hpc:08008] [13] ./aorta3d(+0x33866)[0x5abd5e00d866]
    [hpc:08008] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x71e29ba2a1ca]
    [hpc:08008] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x71e29ba2a28b]
    [hpc:08008] [16] ./aorta3d(+0x35905)[0x5abd5e00f905]
    [hpc:08008] *** End of error message ***
    /opt/openmpi-5.0.8/lib/libmpi.so.40(PMPI_Test+0x4a)[0x743630ed72aa]
    [hpc:08009] [10] ./aorta3d(+0x1338ca)[0x625743dc38ca]
    [hpc:08009] [11] ./aorta3d(+0xc31c2)[0x625743d531c2]
    [hpc:08009] [12] ./aorta3d(+0x1adb5e)[0x625743e3db5e]
    [hpc:08009] [13] ./aorta3d(+0x33866)[0x625743cc3866]
    [hpc:08009] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x74362a42a1ca]
    [hpc:08009] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x74362a42a28b]
    [hpc:08009] [16] ./aorta3d(+0x35905)[0x625743cc5905]
    [hpc:08009] *** End of error message ***
    --------------------------------------------------------------------------
    prterun noticed that process rank 1 with PID 8009 on node hpc exited on
    signal 11 (Segmentation fault).
    --------------------------------------------------------------------------
    #10769
    Adrian
    Keymaster

    Ok, weird. I just re-tested the release on my dual GPU system and the example works as it should.

    One other possibility is that the nvcc selected in the environment is a different one than in your /usr/local/cuda. You could also try the “mixed” mode (see the example configs in config/) to directly use your mpic++ together with nvcc.

    Did you test any other CUDA-aware MPI apps in the same environment?

Viewing 4 posts - 1 through 4 (of 4 total)
  • You must be logged in to reply to this topic.