Skip to content

GPU Examples

Viewing 7 posts - 1 through 7 (of 7 total)
  • Author
    Posts
  • #6589
    Mosemb
    Participant

    Hi, y’all.
    Which flags should be on when running GPU examples in the config.mk file? Am not sure what i should put on or off? So far i have not been able to run any of examples especially the cavity3dBenchmark after compilation i get this error at runtime below.

    terminate called after throwing an instance of ‘std::runtime_error’
    what(): invalid device symbol
    Aborted (core dumped)

    #6590
    Adrian
    Keymaster

    For a basic GPU-only setup you can compare e.g. the config/gpu_only.mk example config.

    In your case everything but the CUDA_ARCH setting seems to already be correct. Which GPU are you using? (There is also a reference table for CUDA_ARCH settings in rules.mk)

    #6591
    Mosemb
    Participant

    Thanks Adrian. Right now am using 4 NVIDIA A100-SXM4-40GB. But i am going to be running everything on a cluster so am testing things out first on one node then after go full cluster.

    #6592
    Mosemb
    Participant

    I have used the same configurations as stated in the file but still get the same error

    terminate called after throwing an instance of ‘std::runtime_error’
    what(): invalid device symbol
    Aborted (core dumped)

    #6593
    Adrian
    Keymaster

    You need to adapt the CUDA_ARCH value to match your target GPU. In case of the A100 you need to set CUDA_ARCH to 80 (lower can also work but leads to additional bytecode translation at startup). See e.g. the reference table in rules.mk and the comments in the example configs.

    For multi GPU usage you will need to link against MPI with CUDA support, see e.g. config/gpu_openmpi.mk for a starting point but consult cluster documentation for further guidance on how to set up and execute MPI + CUDA on the specific system. E.g. in my tests on HoreKa the Nvidia HPC SDK provided the best results for multi node execution.

    However, testing the single GPU setup first is a good idea and should work as soon as you adjust the CUDA_ARCH value and recompile.

    #6596
    Mosemb
    Participant

    Hey Adrian thanks. So i was able to change the config.mk file like below
    CXX := nvcc
    CC := nvcc
    CXXFLAGS := -O3
    CXXFLAGS += -std=c++17
    PARALLEL_MODE := MPI
    MPIFLAGS := -lmpi_cxx -lmpi
    PLATFORMS := CPU_SISD GPU_CUDA
    # for e.g. RTX 30* (Ampere), see table in rules.mk for other options
    CUDA_ARCH := 80
    USE_EMBEDDED_DEPENDENCIES := ON

    The idea is to run the application with cuda in pararrel. But as i run the application with
    mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./bstep2d’ . The output i get seems not to be parallized, am using 2 nodes at the point and here is the output below

    [Timer]
    [Timer] —————-Summary:Timer—————-
    [Timer] measured time (rt) : 649.823s
    [Timer] measured time (cpu): 634.863s
    [Timer] average MLUPs : 181.855
    [Timer] average MLUPps: 181.855
    [Timer] ———————————————
    [SuperPlaneIntegralFluxVelocity2D] regionSize[m]=0.00468; flowRate[m^2/s]=0.00485398; meanVelocity[m/s]=1.03718
    [SuperPlaneIntegralFluxPressure2D] regionSize[m]=0.00468; force[N]=0.0182643; meanPressure[Pa]=3.90263
    [Timer] step=576920; percent=99.9995; passedTime=653.267; remTime=0.00339701; MLUPs=186.57
    [LatticeStatistics] step=576920; t=1.99999; uMax=0.0301884; avEnergy=9.06721e-05; avRho=1.00147
    [Timer]
    [Timer] —————-Summary:Timer—————-
    [Timer] measured time (rt) : 653.461s
    [Timer] measured time (cpu): 639.569s
    [Timer] average MLUPs : 180.842
    [Timer] average MLUPps: 180.842
    [Timer] ———————————————

    My expectation would be having one value in terms of MLUPs and MLUPps. But i get 2 values for every individual node. How can i fix this?

    #6597
    Adrian
    Keymaster

    Good to hear that at least the CUDA_ARCH issue is solved.

    As for the MPI usage: You are correct in that this is not actually parallelized in the output. Did you recompile (“make clean; make”) after switching to the MPI-enabled config?

    How did you configure the SLURM (?) script on your cluster? As you used two nodes of 4 A100 each but only get two outputs instead of 8 leads me to believe that something is also wrong there (we need one MPI process per GPU).

    Sadly, setting up a multi GPU simulation correctly is not as straight forward as a plain CPU-only application. However, once we have found a working config for your situation it should work for all apps.

Viewing 7 posts - 1 through 7 (of 7 total)
  • You must be logged in to reply to this topic.