GPU and calculation time
OpenLB – Open Source Lattice Boltzmann Code › Forums › on OpenLB › General Topics › GPU and calculation time
- This topic has 4 replies, 2 voices, and was last updated 4 months, 1 week ago by sfraniatte.
-
AuthorPosts
-
March 10, 2025 at 8:46 pm #9923sfraniatteParticipant
Dear community,
I’m trying to run some examples (cavity3dBenchmark and nozzle3d) on a GPU (NVIDIA RTX A6000), but I’m observing an unexpected evolution of MLUPS. When the number of voxels is small (matching the default values in the examples), the computation on the GPU is indeed faster than on 32 CPU cores. However, when I set the resolution to N=15 for the nozzle example, the CPU computation becomes faster than the GPU computation.
Moreover, the MLUPS decrease as the resolution increases, and the same happens with GPU activity. The average GPU utilization percentage decreases as the number of voxels increases (I checked this using nvtop). Is this normal?
I should mention that I am correctly using gpu_only.mk for compilation and that I run make clean; make every time I compile. Additionally, for the nozzle3d example, I reach an average of 185 MLUPS (with peaks at 1271), which seems quite low compared to the numbers achieved with NVIDIA A100 GPUs, especially since I’m running in single precision.
Best regards,
SylvainMarch 10, 2025 at 9:43 pm #9924AdrianKeymasterThis is unusual. Which is the exact command you use to run the nozzle case? Did you increase the frequency of VTK output compared to the default?
Are gou using single or double precision for the value type?The low average leads me to believe that there is some issue with the VTK output (e.g. is not performed asynchronously or too often to hide it)
The nozzle case uses a computationally complex turbulent inlet condition s.t. heterogeneous CPU-GPU use can provide an advantage to GPU only.
I assume that you observe this only for the nozzle case and the cavity benchmark behaves as one would expect?
March 11, 2025 at 3:12 pm #9929sfraniatteParticipantI think I have solved my problem. The mistake was that I left the line “FEATURES :=” uncommented. Now, to be exhaustive, here the exact commands that I used :
_ ./nozzle3d –resolution 5 –max-phys-t 10 : the calculation duration (measured time cpu) was 29.278s and an average MLUPS of 376.907
_ ./nozzle3d –resolution 10 –max-phys-t 10 : the calculation duration (measured time cpu) was 501.328s and an average MLUPS of 686.931I ran these calculations today and the evolution seems good this time. Thanks a lot for your time and I hope this will be useful for someone.
Best regards,
SylvainMarch 12, 2025 at 1:52 am #9931AdrianKeymasterHappy to hear that it performes better now.
However, this is very unlikely as the actual reason. The features array is not involved in any way for this.
Cab you post your two configs?
March 12, 2025 at 10:26 am #9932sfraniatteParticipantYes, I can. Here is the good .mk file :
# Example build config for OpenLB using CUDA on single GPU systems
#
# Tested using CUDA 11.4
#
# Usage:
# – Copy this file to OpenLB root asconfig.mk
# – Adjust CUDA_ARCH to match your specifc GPU
# – Runmake clean; make
# – Switch to example directory, e.g.examples/laminar/cavity3dBenchmark
# – Runmake
# – Start the simulation using./cavity3d
CXX := nvcc
CC := nvccCXXFLAGS := -O3
CXXFLAGS += -std=c++17 –forward-unknown-to-host-compilerPARALLEL_MODE := NONE
PLATFORMS := CPU_SISD GPU_CUDA
# for e.g. RTX 30* (Ampere), see table in
rules.mk
for other options
CUDA_ARCH := 86FLOATING_POINT_TYPE := float
USE_EMBEDDED_DEPENDENCIES := ON
###########################################################################################
Here is the first one which does not work very well :
# OpenLB build configuration
#
# This file sets up the necessary build flags for compiling OpenLB with
# the GNU C++ compiler and sequential execution. For more complex setups
# edit this file or consult the example configs provided inconfig/
.
#
# Basic usage:
# – Edit variables to fit desired configuration
# – Runmake clean; make
to clean up any previous artifacts and compile the dependencies
# – Switch to example directory, e.g.examples/laminar/poiseuille2d
# – Runmake
# – Start the simulation using./poiseuille2d
# Compiler to use for C++ files, change to
mpic++
when using OpenMPI and GCC
#~ #parallel CPU ou hybrid
#~ CXX := mpic++
#GPU
CXX := nvcc# Compiler to use for C files (used for emebedded dependencies)
#parallel CPU ou hybrid
#~ CC := gcc
#GPU
CC := nvcc# Suggested optimized build flags for GCC, consult
config/
for further examples
#parallel CPU ou hybrid
#~ CXXFLAGS := -O3 -Wall -march=native -mtune=native
#GPU
CXXFLAGS := -O3
CXXFLAGS += –forward-unknown-to-host-compiler
# Uncomment to add debug symbols and enable runtime asserts
#~ #CXXFLAGS += -g -DOLB_DEBUG# OpenLB requires support for C++17
# works in:
# * gcc 9 or later (https://gcc.gnu.org/projects/cxx-status.html#cxx17)
# * icc 19.0 or later (https://software.intel.com/en-us/articles/c17-features-supported-by-intel-c-compiler)
# * clang 7 or later (https://clang.llvm.org/cxx_status.html#cxx17)
CXXFLAGS += -std=c++17# optional linker flags
LDFLAGS :=# Parallelization mode, must be one of: OFF, MPI, OMP, HYBRID
# Note that for MPI and HYBRID the compiler also needs to be adapted.
# See e.g.config/cpu_gcc_openmpi.mk
#parallel CPU
#~ PARALLEL_MODE := MPI
#GPU
PARALLEL_MODE := NONE
#~ #hybrid
#~ PARALLEL_MODE := HYBRID# optional MPI and OpenMP flags
#parallel CPU
#~ MPIFLAGS :=
#~ OMPFLAGS := -fopenmp# Options: CPU_SISD, CPU_SIMD, GPU_CUDA
# Both CPU_SIMD and GPU_CUDA require system-specific adjustment of compiler flags.
# See e.g.config/cpu_simd_intel_mpi.mk
orconfig/gpu_only.mk
for examples.
# CPU_SISD must always be present.
#parallel CPU
#~ PLATFORMS := CPU_SISD
#GPU
PLATFORMS := CPU_SISD GPU_CUDA
#hybrid
#~ PLATFORMS := CPU_SISD CPU_SIMD GPU_CUDA
#~ # Compiler to use for CUDA-enabled files
#~ CUDA_CXX := nvcc
#~ CUDA_CXXFLAGS := -O3 -std=c++17
#~ # Adjust to enable resolution of libcuda, libcudart, libcudadevrt
#~ CUDA_LDFLAGS := -L/run/libcuda/lib
#~ CUDA_LDFLAGS += -fopenmp
#~ #GPU ou hybrid
CUDA_ARCH := 86# Fundamental arithmetic data type
# Common options are float or double
#parallel CPU
#~ FLOATING_POINT_TYPE := double
#GPU ou hybrid
FLOATING_POINT_TYPE := float# Any entries are passed to the compiler as
-DFEATURE_*
declarations
# Used to enable some alternative code paths and dependencies
FEATURES :=# Set to OFF if libz and tinyxml are provided by the system (optional)
USE_EMBEDDED_DEPENDENCIES := ON###################################################################################
Also, I am trying to run my case on GPU but it is really too slow. The main difference with aorta example (which works well for me now) is the surface of the inlet (which is really bigger) and the fact that there are external edges and corners (the inlet has 5 faces). I am working on cleaning my code to have something as in the nozzle example (with stlReader). It can be that but I am not sure.Thank you !
-
AuthorPosts
- You must be logged in to reply to this topic.