Speedup in multi-calculation-nodes • OpenLB - Open source lattice Boltzmann code

This topic has 4 replies, 2 voices, and was last updated 6 years, 8 months ago by Markus Mohrhard.

Viewing 5 posts - 1 through 5 (of 5 total)

Author

Posts
November 7, 2017 at 1:17 pm #1954

steed188
Participant

Hi, everyone,
I’ve tried several cases which are all over 100 million lattice grids. I found that if they were simulated on only one computer node with 16-CPU-thread, the calculation would speed up very well when using 1,2,4,8 and 16 threads. But if I tried two use 2 computer nodes of 32-thread, or 4 nodes of 64-threads, the speed almost the same with 1 nodes of 16-thread. There was no improvement in speed.

I think the problem is similar with Mr. jepson’s in
optilb.org/openlb/forum?mingleforumaction=viewtopic&t=12

I would like to know why it could not speed up in more than one node. Is it because a large communication between nodes or a limitation of lattice quantity? Can it be improved?

with best
steed188

November 7, 2017 at 8:22 pm #2760

Markus Mohrhard
Participant

Hey steed188,

are you using OpenMP or MPI for your simulations?

In general our current OpenMP code is not really efficient and it is generally recommended to use MPI for the current releases (we are working on an improved hybrid OpenMP + MPI mode).

In general we scale quite well (at least for weak scaling) but obviously as soon as you move from one node to two nodes you will get an overhead through the communication that now can no longer be implemented through shared memory copy operations.

However I think in general the performance for most cases should be somewhat stable. I will try to post some numbers from our own HPC system soon.

November 7, 2017 at 8:46 pm #2761
Markus Mohrhard
Participant
So and finally a quick run with an adapted cylinder2d example has finished. I only changed the value of N to 8 in examples/cylinder2d/cylinder2d.cpp and left the rest to the normal OpenLB 1.1 state.

On the HPC cluster I ran the job with 1 node/8 cores, 1 node/16 cores and 2 nodes/16 cores each. The following performance results can be obtained:

1n/8c : 131,4 MLUPs => 16.4 MLUPps
1n/16c: 240.7 MLUPs => 15.0 MLUPps
2n/32c: 450.4 MLUPs => 14.1 MLUPps

Another result that I still have for a test run with N as 20 and 16 nodes each using 16 cores (256 total cores):

16n/256c: 2018.4 MLUPs => 7.9 MLUPps

This shows that while we don’t have a perfect scaling (especially for such small problems, the last one were less than 14k grid points per core) we still scale quite well to several nodes and a few hundred cores.

To help you with your scaling problem I would need some more info.
- What type of cluster are you using?
- Which code are you running?
- Which compiler options are you using in Makefile.inc?
Regards,
Markus
November 9, 2017 at 9:07 am #2762

steed188
Participant

Dear Markus
I used mpiexec of OpenMPI to simulate. My case is just like a rectangle wind tunnel with a single box in it. I used smagorinsky model, Re=125000, the mesh quantity is 16 million.

For the 16 cpu of one node, its like this
[Timer] Lattice-Timesteps | CPU time/estim | REAL time/estim | ETA | MLUPs
[Timer] 5000/600500 ( 0%) |1059.57/127254.36 | 1060.94/127419.13 |126359 | 0.00

For the 64 cpu of 4 nodes, its like this
[Timer] Lattice-Timesteps | CPU time/estim | REAL time/estim | ETA | MLUPs
[Timer] 5000/600500 ( 0%) |3143.54/377539.15 | 3149.02/378197.78 |375049 | 0.00

It seems that the caluculation time of 4 nodes is even 3 times of 1 node?

And I also tried the example cylinder2d with the mesh quantity of 1000 times of the original one. The calculation time of 4 nodes is almost the same of 1node.

My cluster has 4 nodes. The node is HP ProLiant DL360 Gen9, one node has 2 processors of Xeon E5-2667 3.2GH. and every proceccor has 8 cores, and every node 128GB memory. My cluster has 4 nodes.

My Makefile.inc is like below

Code:

#CXX := g++
#CXX := icpc -D__aligned__=ignored
#CXX := mpiCC
CXX := mpic++

CC := gcc # necessary for zlib, for Intel use icc

OPTIM := -O3 -Wall -march=native -mtune=native # for gcc
#OPTIM := -O3 -Wall -xHost # for Intel compiler
DEBUG := -g -DOLB_DEBUG

CXXFLAGS := $(OPTIM)
#CXXFLAGS := $(DEBUG)

CXXFLAGS += -std=c++0x
#CXXFLAGS += -std=c++11

#CXXFLAGS += -fdiagnostics-color=auto
#CXXFLAGS += -std=gnu++14

ARPRG := ar
#ARPRG := xiar # mandatory for intel compiler

LDFLAGS :=

#PARALLEL_MODE := OFF
PARALLEL_MODE := MPI
#PARALLEL_MODE := OMP
#PARALLEL_MODE := HYBRID

MPIFLAGS :=
OMPFLAGS := -fopenmp

#BUILDTYPE := precompiled
BUILDTYPE := generic

best wishes,
steed188

November 9, 2017 at 2:33 pm #2763

Markus Mohrhard
Participant

Hey,

already the absolute performance seems to be way too low. A performance of 0.00 MLUPs points to some other problems.

Additionally cavity2d is known to scale quite well even for small grid sizes. For larger grid sizes even strong scaling should be fairly good. One thing that might be a problem for you might be the connection between the nodes. I think we see serious scaling problems if we switch from our infiniband network to our normal ethernet network.

In general I would start inspecting why the parallel version of cavity2d or cavity3d does not scale on your hardware. These examples are known to scale quite well.

Regards,
Markus
Author

Posts

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.