Skip to content

Speedup in multi-calculation-nodes

OpenLB – Open Source Lattice Boltzmann Code Forums on OpenLB General Topics Speedup in multi-calculation-nodes

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
    Posts
  • #1954
    steed188
    Participant

    Hi, everyone,
    I’ve tried several cases which are all over 100 million lattice grids. I found that if they were simulated on only one computer node with 16-CPU-thread, the calculation would speed up very well when using 1,2,4,8 and 16 threads. But if I tried two use 2 computer nodes of 32-thread, or 4 nodes of 64-threads, the speed almost the same with 1 nodes of 16-thread. There was no improvement in speed.

    I think the problem is similar with Mr. jepson’s in
    optilb.org/openlb/forum?mingleforumaction=viewtopic&t=12

    I would like to know why it could not speed up in more than one node. Is it because a large communication between nodes or a limitation of lattice quantity? Can it be improved?

    with best
    steed188

    #2760
    Markus Mohrhard
    Participant

    Hey steed188,

    are you using OpenMP or MPI for your simulations?

    In general our current OpenMP code is not really efficient and it is generally recommended to use MPI for the current releases (we are working on an improved hybrid OpenMP + MPI mode).

    In general we scale quite well (at least for weak scaling) but obviously as soon as you move from one node to two nodes you will get an overhead through the communication that now can no longer be implemented through shared memory copy operations.

    However I think in general the performance for most cases should be somewhat stable. I will try to post some numbers from our own HPC system soon.

    #2761
    Markus Mohrhard
    Participant

    So and finally a quick run with an adapted cylinder2d example has finished. I only changed the value of N to 8 in examples/cylinder2d/cylinder2d.cpp and left the rest to the normal OpenLB 1.1 state.

    On the HPC cluster I ran the job with 1 node/8 cores, 1 node/16 cores and 2 nodes/16 cores each. The following performance results can be obtained:

    1n/8c : 131,4 MLUPs => 16.4 MLUPps
    1n/16c: 240.7 MLUPs => 15.0 MLUPps
    2n/32c: 450.4 MLUPs => 14.1 MLUPps

    Another result that I still have for a test run with N as 20 and 16 nodes each using 16 cores (256 total cores):

    16n/256c: 2018.4 MLUPs => 7.9 MLUPps

    This shows that while we don’t have a perfect scaling (especially for such small problems, the last one were less than 14k grid points per core) we still scale quite well to several nodes and a few hundred cores.

    To help you with your scaling problem I would need some more info.

    • What type of cluster are you using?
    • Which code are you running?
    • Which compiler options are you using in Makefile.inc?

    Regards,
    Markus

    #2762
    steed188
    Participant

    Dear Markus
    I used mpiexec of OpenMPI to simulate. My case is just like a rectangle wind tunnel with a single box in it. I used smagorinsky model, Re=125000, the mesh quantity is 16 million.

    For the 16 cpu of one node, its like this
    [Timer] Lattice-Timesteps | CPU time/estim | REAL time/estim | ETA | MLUPs
    [Timer] 5000/600500 ( 0%) |1059.57/127254.36 | 1060.94/127419.13 |126359 | 0.00

    For the 64 cpu of 4 nodes, its like this
    [Timer] Lattice-Timesteps | CPU time/estim | REAL time/estim | ETA | MLUPs
    [Timer] 5000/600500 ( 0%) |3143.54/377539.15 | 3149.02/378197.78 |375049 | 0.00

    It seems that the caluculation time of 4 nodes is even 3 times of 1 node?

    And I also tried the example cylinder2d with the mesh quantity of 1000 times of the original one. The calculation time of 4 nodes is almost the same of 1node.

    My cluster has 4 nodes. The node is HP ProLiant DL360 Gen9, one node has 2 processors of Xeon E5-2667 3.2GH. and every proceccor has 8 cores, and every node 128GB memory. My cluster has 4 nodes.

    My Makefile.inc is like below

    Code:
    #CXX := g++
    #CXX := icpc -D__aligned__=ignored
    #CXX := mpiCC
    CXX := mpic++

    CC := gcc # necessary for zlib, for Intel use icc

    OPTIM := -O3 -Wall -march=native -mtune=native # for gcc
    #OPTIM := -O3 -Wall -xHost # for Intel compiler
    DEBUG := -g -DOLB_DEBUG

    CXXFLAGS := $(OPTIM)
    #CXXFLAGS := $(DEBUG)

    CXXFLAGS += -std=c++0x
    #CXXFLAGS += -std=c++11

    #CXXFLAGS += -fdiagnostics-color=auto
    #CXXFLAGS += -std=gnu++14

    ARPRG := ar
    #ARPRG := xiar # mandatory for intel compiler

    LDFLAGS :=

    #PARALLEL_MODE := OFF
    PARALLEL_MODE := MPI
    #PARALLEL_MODE := OMP
    #PARALLEL_MODE := HYBRID

    MPIFLAGS :=
    OMPFLAGS := -fopenmp

    #BUILDTYPE := precompiled
    BUILDTYPE := generic

    best wishes,
    steed188

    #2763
    Markus Mohrhard
    Participant

    Hey,

    already the absolute performance seems to be way too low. A performance of 0.00 MLUPs points to some other problems.

    Additionally cavity2d is known to scale quite well even for small grid sizes. For larger grid sizes even strong scaling should be fairly good. One thing that might be a problem for you might be the connection between the nodes. I think we see serious scaling problems if we switch from our infiniband network to our normal ethernet network.

    In general I would start inspecting why the parallel version of cavity2d or cavity3d does not scale on your hardware. These examples are known to scale quite well.

    Regards,
    Markus

Viewing 5 posts - 1 through 5 (of 5 total)
  • You must be logged in to reply to this topic.