Multi-GPU MPI library is not CUDA-aware
› Forums › on OpenLB › General Topics › Multi-GPU MPI library is not CUDA-aware
- This topic has 3 replies, 2 voices, and was last updated 1 month ago by Adrian.
-
AuthorPosts
-
October 6, 2025 at 3:20 pm #10765alex.wsParticipant
Hello,
We are trying to run some multi-gpu examples (aorta3d) and get the following error upon execution:
[GPU_CUDA] The used MPI Library is not CUDA-aware. Multi-GPU execution will fail.Some info that may be useful:
System is a 48-core EPYC with 2x NVIDIA RTX PRO 6000 Blackwell GPUs, 384GB memory.
Ubuntu 24.04
CUDA 13.0 with drivers 580.65.06
OpenMPI 5.0.8running the command nvidia-smi I get:
ubuntu@hpc:~/OpenLB_GPU/examples/turbulence/aorta3d$ nvidia-smi Mon Oct 6 14:14:23 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX PRO 6000 Blac... Off | 00000000:01:00.0 Off | Off | | 30% 25C P8 12W / 300W | 15MiB / 97887MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA RTX PRO 6000 Blac... Off | 00000000:02:00.0 Off | Off | | 30% 24C P8 9W / 300W | 15MiB / 97887MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2309 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2309 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------------------+NVCC:
ubuntu@hpc:~/OpenLB_GPU/examples/turbulence/aorta3d$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Wed_Aug_20_01:58:59_PM_PDT_2025 Cuda compilation tools, release 13.0, V13.0.88 Build cuda_13.0.r13.0/compiler.36424714_0ompi_info:
ubuntu@hpc:~/OpenLB_GPU/examples/turbulence/aorta3d$ ompi_info | grep -i cuda Configure command line: '--prefix=/opt/openmpi-5.0.8' '--with-cuda=/usr/local/cuda' MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat MCA accelerator: cuda (MCA v2.1.0, API v1.0.0, Component v5.0.8) MCA btl: smcuda (MCA v2.1.0, API v3.3.0, Component v5.0.8) MCA coll: cuda (MCA v2.1.0, API v2.4.0, Component v5.0.8)ubuntu@hpc:~/OpenLB_GPU/examples/turbulence/aorta3d$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value mca:mpi:base:param:mpi_built_with_cuda_support:value:trueOpenLB config.mk:
CXX := nvcc CC := nvcc CXXFLAGS := -O3 CXXFLAGS += -std=c++20 --forward-unknown-to-host-compiler PARALLEL_MODE := MPI # CPU/MPI compiler flags CXXFLAGS += -I/opt/openmpi-5.0.8/include CCFLAGS += -I/opt/openmpi-5.0.8/include # MPI linker flags LDFLAGS += -L/opt/openmpi-5.0.8/lib -lmpi PLATFORMS := CPU_SISD GPU_CUDA USE_CUDA_AWARE_MPI := ON # for e.g. RTX 30* (Ampere), see table in 'rules.mk' for other options CUDA_ARCH := 100 FLOATING_POINT_TYPE := float USE_EMBEDDED_DEPENDENCIES := ONSingle-GPU simulations run fine and as far as I can tell OpenMPI is installed with CUDA support enabled, however when compiling the OpenLB examples it doesn’t detect the CUDA-aware installation for multi-GPU runs. Any advice is appreciated, hopefully I have provided enough information.
Thanks in advance,
Alex- This topic was modified 1 month ago by alex.ws. Reason: formatting
October 6, 2025 at 3:29 pm #10767AdrianKeymasterThe command outputs (thanks!) all look fine so this may just be the automated check failing despite CUDA-awareness being available. The logic we use to check this
#ifdef PARALLEL_MODE_MPI #if defined(MPIX_CUDA_AWARE_SUPPORT) && MPIX_CUDA_AWARE_SUPPORT if (!MPIX_Query_cuda_support()) { clout << "The used MPI Library is not CUDA-aware. Multi-GPU execution will fail." << std::endl; } #endif #if defined(MPIX_CUDA_AWARE_SUPPORT) && !MPIX_CUDA_AWARE_SUPPORT clout << "The used MPI Library is not CUDA-aware. Multi-GPU execution will fail." << std::endl; #endif #if !defined(MPIX_CUDA_AWARE_SUPPORT) clout << "Unable to check for CUDA-aware MPI support. Multi-GPU execution may fail." << std::endl; #endif #endif // PARALLEL_MODE_MPIcan definitely have gaps. Does the program proceed as usual in multi-GPU after this? (If CUDA-awareness is indeed not working for some reason despite the command output I would expect it to instantly segfault on communication).
October 6, 2025 at 3:34 pm #10768alex.wsParticipantHi Adrian,
No it fails with a segmentation fault. Full output below:
ubuntu@hpc:~/OpenLB_GPU/examples/turbulence/aorta3d$ mpirun -np 2 bash -c 'export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./aorta3d' [MpiManager] Sucessfully initialized, numThreads=2 [ThreadPool] Sucessfully initialized, numThreads=1 [GPU_CUDA] The used MPI Library is not CUDA-aware. Multi-GPU execution will fail. [UnitConverter] ----------------- UnitConverter information ----------------- [UnitConverter] -- Parameters: [UnitConverter] Resolution: N= 40 [UnitConverter] Lattice velocity: latticeU= 0.0225 [UnitConverter] Lattice relaxation frequency: omega= 1.99697 [UnitConverter] Lattice relaxation time: tau= 0.50076 [UnitConverter] Characteristical length(m): charL= 0.02246 [UnitConverter] Characteristical speed(m/s): charU= 0.45 [UnitConverter] Phys. kinematic viscosity(m^2/s): charNu= 2.8436e-06 [UnitConverter] Phys. density(kg/m^d): charRho= 1055 [UnitConverter] Characteristical pressure(N/m^2): charPressure= 0 [UnitConverter] Mach number: machNumber= 0.0389711 [UnitConverter] Reynolds number: reynoldsNumber= 3554.29 [UnitConverter] Knudsen number: knudsenNumber= 1.09645e-05 [UnitConverter] Characteristical CFL number: charCFLnumber= 0.0225 [UnitConverter] [UnitConverter] -- Conversion factors: [UnitConverter] Voxel length(m): physDeltaX= 0.0005615 [UnitConverter] Time step(s): physDeltaT= 2.8075e-05 [UnitConverter] Velocity factor(m/s): physVelocity= 20 [UnitConverter] Density factor(kg/m^3): physDensity= 1055 [UnitConverter] Mass factor(kg): physMass= 1.86768e-07 [UnitConverter] Viscosity factor(m^2/s): physViscosity= 0.01123 [UnitConverter] Force factor(N): physForce= 0.133049 [UnitConverter] Pressure factor(N/m^2): physPressure= 422000 [UnitConverter] ------------------------------------------------------------- [UnitConverter] WARNING: [UnitConverter] Potentially UNSTABLE combination of relaxation time (tau=0.50076) [UnitConverter] and characteristical CFL number (lattice velocity) charCFLnumber=0.0225! [UnitConverter] Potentially maximum characteristical CFL number (maxCharCFLnumber=0.00607729) [UnitConverter] Actual characteristical CFL number (charCFLnumber=0.0225) > 0.00607729 [UnitConverter] Please reduce the the cell size or the time step size! [UnitConverter] We recommend to use the cell size of 0.000151659 m and the time step size of 7.58293e-06 s. [UnitConverter] ------------------------------------------------------------- [STLreader] Voxelizing ... [STLmesh] nTriangles=2654; maxDist2=0.000610779 [STLmesh] minPhysR(StlMesh)=(0.199901,0.0900099,0.0117236); maxPhysR(StlMesh)=(0.243584,0.249987,0.0398131) [Octree] radius=0.143744; center=(0.221602,0.169858,0.025628) [STLreader] voxelSize=0.0005615; stlSize=0.001 [STLreader] minPhysR(VoxelMesh)=(0.199984,0.0904055,0.0118712); maxPhysR(VoxelMesh)=(0.24322,0.249873,0.0393848) [STLreader] Voxelizing ... OK [prepareGeometry] Prepare Geometry ... [SuperGeometry3D] cleaned 0 outer boundary voxel(s) [SuperGeometry3D] cleaned 0 outer boundary voxel(s) [SuperGeometry3D] cleaned 0 inner boundary voxel(s) of Type 3 [SuperGeometryStatistics3D] updated [SuperGeometry3D] the model is correct! [CuboidDecomposition] ---Cuboid Structure Statistics--- [CuboidDecomposition] Number of Cuboids: 16 [CuboidDecomposition] Delta : 0.0005615 [CuboidDecomposition] Ratio (min): 0.529412 [CuboidDecomposition] (max): 1.77778 [CuboidDecomposition] Nodes (min): 16704 [CuboidDecomposition] (max): 35640 [CuboidDecomposition] Weight (min): 10726 [CuboidDecomposition] (max): 20749 [CuboidDecomposition] -------------------------------- [SuperGeometryStatistics3D] materialNumber=0; count=160731; minPhysR=(0.199984,0.089844,0.0113097); maxPhysR=(0.243781,0.250433,0.0399462) [SuperGeometryStatistics3D] materialNumber=1; count=171226; minPhysR=(0.200546,0.0904055,0.0118712); maxPhysR=(0.24322,0.249872,0.0393847) [SuperGeometryStatistics3D] materialNumber=2; count=41080; minPhysR=(0.199984,0.089844,0.0113097); maxPhysR=(0.243781,0.250433,0.0399462) [SuperGeometryStatistics3D] materialNumber=3; count=1059; minPhysR=(0.208407,0.250433,0.0124327); maxPhysR=(0.228059,0.250433,0.0332082) [SuperGeometryStatistics3D] materialNumber=4; count=245; minPhysR=(0.200546,0.089844,0.0298392); maxPhysR=(0.210653,0.089844,0.0388232) [SuperGeometryStatistics3D] materialNumber=5; count=239; minPhysR=(0.234236,0.089844,0.0287162); maxPhysR=(0.24322,0.089844,0.0388232) [SuperGeometryStatistics3D] countTotal[1e6]=0.37458 [prepareGeometry] Prepare Geometry ... OK [prepareLattice] Prepare Lattice ... [prepareLattice] Prepare Lattice ... OK [Timer] [Timer] ----------------Summary:Timer---------------- [Timer] measured time (rt) : 0.295s [Timer] measured time (cpu): 0.295s [Timer] --------------------------------------------- [main] starting simulation... [hpc:08008] *** Process received signal *** [hpc:08008] Signal: Segmentation fault (11) [hpc:08008] Signal code: Invalid permissions (2) [hpc:08008] Failing at address: 0x318d27e00 [hpc:08009] *** Process received signal *** [hpc:08009] Signal: Segmentation fault (11) [hpc:08009] Signal code: Invalid permissions (2) [hpc:08009] Failing at address: 0x318d49200 [hpc:08009] [ 0] [hpc:08008] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x71e29ba45330] [hpc:08008] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x74362a445330] [hpc:08009] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a440d)[0x71e29bba440d] [hpc:08008] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x1a440d)[0x74362a5a440d] [hpc:08009] [ 2] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(+0xcf985)[0x71e2a23b9985] [hpc:08008] [ 3] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(+0xcf985)[0x74362afb9985] [hpc:08009] [ 3] /opt/openmpi-5.0.8/lib/libmpi.so.40(mca_pml_ob1_send_request_schedule_once+0x24a)[0x74363105601a] [hpc:08009] [ 4] /opt/openmpi-5.0.8/lib/libmpi.so.40(mca_pml_ob1_send_request_schedule_once+0x24a)[0x71e2a265601a] [hpc:08008] [ 4] /opt/openmpi-5.0.8/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_ack+0x151)[0x71e2a264d681] [hpc:08008] [ 5] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(mca_btl_sm_poll_handle_frag+0x9b)[0x71e2a23bacab] [hpc:08008] [ 6] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(+0xd118b)[0x71e2a23bb18b] [hpc:08008] [ 7] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(opal_progress+0x34)[0x71e2a230ec84] [hpc:08008] [ 8] /opt/openmpi-5.0.8/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_ack+0x151)[0x74363104d681] [hpc:08009] [ 5] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(mca_btl_sm_poll_handle_frag+0x9b)[0x74362afbacab] [hpc:08009] [ 6] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(+0xd118b)[0x74362afbb18b] [hpc:08009] [ 7] /opt/openmpi-5.0.8/lib/libopen-pal.so.80(opal_progress+0x34)[0x74362af0ec84] [hpc:08009] [ 8] /opt/openmpi-5.0.8/lib/libmpi.so.40(ompi_request_default_test+0x51)[0x71e2a2490ae1] [hpc:08008] [ 9] /opt/openmpi-5.0.8/lib/libmpi.so.40(PMPI_Test+0x4a)[0x71e2a24d72aa] [hpc:08008] [10] /opt/openmpi-5.0.8/lib/libmpi.so.40(ompi_request_default_test+0x51)[0x743630e90ae1] [hpc:08009] [ 9] ./aorta3d(+0x1338ca)[0x5abd5e10d8ca] [hpc:08008] [11] ./aorta3d(+0xc31c2)[0x5abd5e09d1c2] [hpc:08008] [12] ./aorta3d(+0x1adb5e)[0x5abd5e187b5e] [hpc:08008] [13] ./aorta3d(+0x33866)[0x5abd5e00d866] [hpc:08008] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x71e29ba2a1ca] [hpc:08008] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x71e29ba2a28b] [hpc:08008] [16] ./aorta3d(+0x35905)[0x5abd5e00f905] [hpc:08008] *** End of error message *** /opt/openmpi-5.0.8/lib/libmpi.so.40(PMPI_Test+0x4a)[0x743630ed72aa] [hpc:08009] [10] ./aorta3d(+0x1338ca)[0x625743dc38ca] [hpc:08009] [11] ./aorta3d(+0xc31c2)[0x625743d531c2] [hpc:08009] [12] ./aorta3d(+0x1adb5e)[0x625743e3db5e] [hpc:08009] [13] ./aorta3d(+0x33866)[0x625743cc3866] [hpc:08009] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x74362a42a1ca] [hpc:08009] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x74362a42a28b] [hpc:08009] [16] ./aorta3d(+0x35905)[0x625743cc5905] [hpc:08009] *** End of error message *** -------------------------------------------------------------------------- prterun noticed that process rank 1 with PID 8009 on node hpc exited on signal 11 (Segmentation fault). --------------------------------------------------------------------------October 6, 2025 at 4:33 pm #10769AdrianKeymasterOk, weird. I just re-tested the release on my dual GPU system and the example works as it should.
One other possibility is that the
nvccselected in the environment is a different one than in your/usr/local/cuda. You could also try the “mixed” mode (see the example configs inconfig/) to directly use yourmpic++together withnvcc.Did you test any other CUDA-aware MPI apps in the same environment?
-
AuthorPosts
- You must be logged in to reply to this topic.
