Re: Speedup in multi-calculation-nodes • OpenLB - Open source lattice Boltzmann code

November 9, 2017 at 9:07 am #2762

Participant

Dear Markus
I used mpiexec of OpenMPI to simulate. My case is just like a rectangle wind tunnel with a single box in it. I used smagorinsky model, Re=125000, the mesh quantity is 16 million.

For the 16 cpu of one node, its like this
[Timer] Lattice-Timesteps | CPU time/estim | REAL time/estim | ETA | MLUPs
[Timer] 5000/600500 ( 0%) |1059.57/127254.36 | 1060.94/127419.13 |126359 | 0.00

For the 64 cpu of 4 nodes, its like this
[Timer] Lattice-Timesteps | CPU time/estim | REAL time/estim | ETA | MLUPs
[Timer] 5000/600500 ( 0%) |3143.54/377539.15 | 3149.02/378197.78 |375049 | 0.00

It seems that the caluculation time of 4 nodes is even 3 times of 1 node?

And I also tried the example cylinder2d with the mesh quantity of 1000 times of the original one. The calculation time of 4 nodes is almost the same of 1node.

My cluster has 4 nodes. The node is HP ProLiant DL360 Gen9, one node has 2 processors of Xeon E5-2667 3.2GH. and every proceccor has 8 cores, and every node 128GB memory. My cluster has 4 nodes.

My Makefile.inc is like below

Code:

#CXX := g++
#CXX := icpc -D__aligned__=ignored
#CXX := mpiCC
CXX := mpic++

CC := gcc # necessary for zlib, for Intel use icc

OPTIM := -O3 -Wall -march=native -mtune=native # for gcc
#OPTIM := -O3 -Wall -xHost # for Intel compiler
DEBUG := -g -DOLB_DEBUG

CXXFLAGS := $(OPTIM)
#CXXFLAGS := $(DEBUG)

CXXFLAGS += -std=c++0x
#CXXFLAGS += -std=c++11

#CXXFLAGS += -fdiagnostics-color=auto
#CXXFLAGS += -std=gnu++14

ARPRG := ar
#ARPRG := xiar # mandatory for intel compiler

LDFLAGS :=

#PARALLEL_MODE := OFF
PARALLEL_MODE := MPI
#PARALLEL_MODE := OMP
#PARALLEL_MODE := HYBRID

MPIFLAGS :=
OMPFLAGS := -fopenmp

#BUILDTYPE := precompiled
BUILDTYPE := generic

best wishes,
steed188