Speedup in multi-calculation-nodes
OpenLB – Open Source Lattice Boltzmann Code › Forums › on OpenLB › General Topics › Speedup in multi-calculation-nodes
- This topic has 4 replies, 2 voices, and was last updated 7 years, 4 months ago by Markus Mohrhard.
-
AuthorPosts
-
November 7, 2017 at 1:17 pm #1954steed188Participant
Hi, everyone,
I’ve tried several cases which are all over 100 million lattice grids. I found that if they were simulated on only one computer node with 16-CPU-thread, the calculation would speed up very well when using 1,2,4,8 and 16 threads. But if I tried two use 2 computer nodes of 32-thread, or 4 nodes of 64-threads, the speed almost the same with 1 nodes of 16-thread. There was no improvement in speed.I think the problem is similar with Mr. jepson’s in
optilb.org/openlb/forum?mingleforumaction=viewtopic&t=12I would like to know why it could not speed up in more than one node. Is it because a large communication between nodes or a limitation of lattice quantity? Can it be improved?
with best
steed188November 7, 2017 at 8:22 pm #2760Markus MohrhardParticipantHey steed188,
are you using OpenMP or MPI for your simulations?
In general our current OpenMP code is not really efficient and it is generally recommended to use MPI for the current releases (we are working on an improved hybrid OpenMP + MPI mode).
In general we scale quite well (at least for weak scaling) but obviously as soon as you move from one node to two nodes you will get an overhead through the communication that now can no longer be implemented through shared memory copy operations.
However I think in general the performance for most cases should be somewhat stable. I will try to post some numbers from our own HPC system soon.
November 7, 2017 at 8:46 pm #2761Markus MohrhardParticipantSo and finally a quick run with an adapted cylinder2d example has finished. I only changed the value of N to 8 in examples/cylinder2d/cylinder2d.cpp and left the rest to the normal OpenLB 1.1 state.
On the HPC cluster I ran the job with 1 node/8 cores, 1 node/16 cores and 2 nodes/16 cores each. The following performance results can be obtained:
1n/8c : 131,4 MLUPs => 16.4 MLUPps
1n/16c: 240.7 MLUPs => 15.0 MLUPps
2n/32c: 450.4 MLUPs => 14.1 MLUPpsAnother result that I still have for a test run with N as 20 and 16 nodes each using 16 cores (256 total cores):
16n/256c: 2018.4 MLUPs => 7.9 MLUPps
This shows that while we don’t have a perfect scaling (especially for such small problems, the last one were less than 14k grid points per core) we still scale quite well to several nodes and a few hundred cores.
To help you with your scaling problem I would need some more info.
- What type of cluster are you using?
- Which code are you running?
- Which compiler options are you using in Makefile.inc?
Regards,
MarkusNovember 9, 2017 at 9:07 am #2762steed188ParticipantDear Markus
I used mpiexec of OpenMPI to simulate. My case is just like a rectangle wind tunnel with a single box in it. I used smagorinsky model, Re=125000, the mesh quantity is 16 million.For the 16 cpu of one node, its like this
[Timer] Lattice-Timesteps | CPU time/estim | REAL time/estim | ETA | MLUPs
[Timer] 5000/600500 ( 0%) |1059.57/127254.36 | 1060.94/127419.13 |126359 | 0.00For the 64 cpu of 4 nodes, its like this
[Timer] Lattice-Timesteps | CPU time/estim | REAL time/estim | ETA | MLUPs
[Timer] 5000/600500 ( 0%) |3143.54/377539.15 | 3149.02/378197.78 |375049 | 0.00It seems that the caluculation time of 4 nodes is even 3 times of 1 node?
And I also tried the example cylinder2d with the mesh quantity of 1000 times of the original one. The calculation time of 4 nodes is almost the same of 1node.
My cluster has 4 nodes. The node is HP ProLiant DL360 Gen9, one node has 2 processors of Xeon E5-2667 3.2GH. and every proceccor has 8 cores, and every node 128GB memory. My cluster has 4 nodes.
My Makefile.inc is like below
Code:#CXX := g++
#CXX := icpc -D__aligned__=ignored
#CXX := mpiCC
CXX := mpic++CC := gcc # necessary for zlib, for Intel use icc
OPTIM := -O3 -Wall -march=native -mtune=native # for gcc
#OPTIM := -O3 -Wall -xHost # for Intel compiler
DEBUG := -g -DOLB_DEBUGCXXFLAGS := $(OPTIM)
#CXXFLAGS := $(DEBUG)CXXFLAGS += -std=c++0x
#CXXFLAGS += -std=c++11#CXXFLAGS += -fdiagnostics-color=auto
#CXXFLAGS += -std=gnu++14ARPRG := ar
#ARPRG := xiar # mandatory for intel compilerLDFLAGS :=
#PARALLEL_MODE := OFF
PARALLEL_MODE := MPI
#PARALLEL_MODE := OMP
#PARALLEL_MODE := HYBRIDMPIFLAGS :=
OMPFLAGS := -fopenmp#BUILDTYPE := precompiled
BUILDTYPE := genericbest wishes,
steed188November 9, 2017 at 2:33 pm #2763Markus MohrhardParticipantHey,
already the absolute performance seems to be way too low. A performance of 0.00 MLUPs points to some other problems.
Additionally cavity2d is known to scale quite well even for small grid sizes. For larger grid sizes even strong scaling should be fairly good. One thing that might be a problem for you might be the connection between the nodes. I think we see serious scaling problems if we switch from our infiniband network to our normal ethernet network.
In general I would start inspecting why the parallel version of cavity2d or cavity3d does not scale on your hardware. These examples are known to scale quite well.
Regards,
Markus -
AuthorPosts
- You must be logged in to reply to this topic.