GPU Examples
OpenLB – Open Source Lattice Boltzmann Code › Forums › on Lattice Boltzmann Methods › General Topics › GPU Examples
- This topic has 6 replies, 2 voices, and was last updated 2 years, 8 months ago by Adrian.
-
AuthorPosts
-
May 29, 2022 at 6:51 pm #6589MosembParticipant
Hi, y’all.
Which flags should be on when running GPU examples in the config.mk file? Am not sure what i should put on or off? So far i have not been able to run any of examples especially the cavity3dBenchmark after compilation i get this error at runtime below.terminate called after throwing an instance of ‘std::runtime_error’
what(): invalid device symbol
Aborted (core dumped)May 29, 2022 at 7:18 pm #6590AdrianKeymasterFor a basic GPU-only setup you can compare e.g. the config/gpu_only.mk example config.
In your case everything but the CUDA_ARCH setting seems to already be correct. Which GPU are you using? (There is also a reference table for CUDA_ARCH settings in rules.mk)
May 29, 2022 at 10:50 pm #6591MosembParticipantThanks Adrian. Right now am using 4 NVIDIA A100-SXM4-40GB. But i am going to be running everything on a cluster so am testing things out first on one node then after go full cluster.
May 29, 2022 at 11:41 pm #6592MosembParticipantI have used the same configurations as stated in the file but still get the same error
terminate called after throwing an instance of ‘std::runtime_error’
what(): invalid device symbol
Aborted (core dumped)May 30, 2022 at 8:01 am #6593AdrianKeymasterYou need to adapt the CUDA_ARCH value to match your target GPU. In case of the A100 you need to set CUDA_ARCH to 80 (lower can also work but leads to additional bytecode translation at startup). See e.g. the reference table in rules.mk and the comments in the example configs.
For multi GPU usage you will need to link against MPI with CUDA support, see e.g. config/gpu_openmpi.mk for a starting point but consult cluster documentation for further guidance on how to set up and execute MPI + CUDA on the specific system. E.g. in my tests on HoreKa the Nvidia HPC SDK provided the best results for multi node execution.
However, testing the single GPU setup first is a good idea and should work as soon as you adjust the CUDA_ARCH value and recompile.
May 31, 2022 at 2:06 pm #6596MosembParticipantHey Adrian thanks. So i was able to change the config.mk file like below
CXX := nvcc
CC := nvcc
CXXFLAGS := -O3
CXXFLAGS += -std=c++17
PARALLEL_MODE := MPI
MPIFLAGS := -lmpi_cxx -lmpi
PLATFORMS := CPU_SISD GPU_CUDA
# for e.g. RTX 30* (Ampere), see table inrules.mk
for other options
CUDA_ARCH := 80
USE_EMBEDDED_DEPENDENCIES := ONThe idea is to run the application with cuda in pararrel. But as i run the application with
mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./bstep2d’ . The output i get seems not to be parallized, am using 2 nodes at the point and here is the output below[Timer]
[Timer] —————-Summary:Timer—————-
[Timer] measured time (rt) : 649.823s
[Timer] measured time (cpu): 634.863s
[Timer] average MLUPs : 181.855
[Timer] average MLUPps: 181.855
[Timer] ———————————————
[SuperPlaneIntegralFluxVelocity2D] regionSize[m]=0.00468; flowRate[m^2/s]=0.00485398; meanVelocity[m/s]=1.03718
[SuperPlaneIntegralFluxPressure2D] regionSize[m]=0.00468; force[N]=0.0182643; meanPressure[Pa]=3.90263
[Timer] step=576920; percent=99.9995; passedTime=653.267; remTime=0.00339701; MLUPs=186.57
[LatticeStatistics] step=576920; t=1.99999; uMax=0.0301884; avEnergy=9.06721e-05; avRho=1.00147
[Timer]
[Timer] —————-Summary:Timer—————-
[Timer] measured time (rt) : 653.461s
[Timer] measured time (cpu): 639.569s
[Timer] average MLUPs : 180.842
[Timer] average MLUPps: 180.842
[Timer] ———————————————My expectation would be having one value in terms of MLUPs and MLUPps. But i get 2 values for every individual node. How can i fix this?
May 31, 2022 at 3:11 pm #6597AdrianKeymasterGood to hear that at least the CUDA_ARCH issue is solved.
As for the MPI usage: You are correct in that this is not actually parallelized in the output. Did you recompile (“make clean; make”) after switching to the MPI-enabled config?
How did you configure the SLURM (?) script on your cluster? As you used two nodes of 4 A100 each but only get two outputs instead of 8 leads me to believe that something is also wrong there (we need one MPI process per GPU).
Sadly, setting up a multi GPU simulation correctly is not as straight forward as a plain CPU-only application. However, once we have found a working config for your situation it should work for all apps.
-
AuthorPosts
- You must be logged in to reply to this topic.