Re: Speedup in multi-calculation-nodes
OpenLB – Open Source Lattice Boltzmann Code › Forums › on OpenLB › General Topics › Speedup in multi-calculation-nodes › Re: Speedup in multi-calculation-nodes
Dear Markus
I used mpiexec of OpenMPI to simulate. My case is just like a rectangle wind tunnel with a single box in it. I used smagorinsky model, Re=125000, the mesh quantity is 16 million.
For the 16 cpu of one node, its like this
[Timer] Lattice-Timesteps | CPU time/estim | REAL time/estim | ETA | MLUPs
[Timer] 5000/600500 ( 0%) |1059.57/127254.36 | 1060.94/127419.13 |126359 | 0.00
For the 64 cpu of 4 nodes, its like this
[Timer] Lattice-Timesteps | CPU time/estim | REAL time/estim | ETA | MLUPs
[Timer] 5000/600500 ( 0%) |3143.54/377539.15 | 3149.02/378197.78 |375049 | 0.00
It seems that the caluculation time of 4 nodes is even 3 times of 1 node?
And I also tried the example cylinder2d with the mesh quantity of 1000 times of the original one. The calculation time of 4 nodes is almost the same of 1node.
My cluster has 4 nodes. The node is HP ProLiant DL360 Gen9, one node has 2 processors of Xeon E5-2667 3.2GH. and every proceccor has 8 cores, and every node 128GB memory. My cluster has 4 nodes.
My Makefile.inc is like below
#CXX := icpc -D__aligned__=ignored
#CXX := mpiCC
CXX := mpic++
CC := gcc # necessary for zlib, for Intel use icc
OPTIM := -O3 -Wall -march=native -mtune=native # for gcc
#OPTIM := -O3 -Wall -xHost # for Intel compiler
DEBUG := -g -DOLB_DEBUG
CXXFLAGS := $(OPTIM)
#CXXFLAGS := $(DEBUG)
CXXFLAGS += -std=c++0x
#CXXFLAGS += -std=c++11
#CXXFLAGS += -fdiagnostics-color=auto
#CXXFLAGS += -std=gnu++14
ARPRG := ar
#ARPRG := xiar # mandatory for intel compiler
LDFLAGS :=
#PARALLEL_MODE := OFF
PARALLEL_MODE := MPI
#PARALLEL_MODE := OMP
#PARALLEL_MODE := HYBRID
MPIFLAGS :=
OMPFLAGS := -fopenmp
#BUILDTYPE := precompiled
BUILDTYPE := generic
best wishes,
steed188