Example issues on Cluster • OpenLB - Open source lattice Boltzmann code

This topic has 16 replies, 5 voices, and was last updated 2 years ago by Adrian.

Viewing 15 posts - 1 through 15 (of 17 total)

1 2 →

Author

Posts
January 27, 2022 at 7:59 pm #6291

Anand
Participant

Dear Developer,
I am running the OpenLB on the cluster in Parallel and non-parallel mode.

In non-parallel mode, everything works properly.
In parallel mode, some examples do not work, like bifurcation example, Euler Euler work but Euler-Lagrange does not work, it shows following error:

.
.
.
.
[prepareGeometry] Prepare Geometry … OK
[prepareLattice] Prepare Lattice …
[prepareLattice] Prepare Lattice … OK
[main] Prepare Particles …
[main] Prepare Particles … OK
[bear-pg0105u24b:1466329:0:1466329] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x6)
==== backtrace (tid:1466329) ====
0 0x00000000000211ce ucs_debug_print_backtrace() /dev/shm/edmondac-admin-2021a-EL8/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x0000000000012c20 .annobin_sigaction.c() sigaction.c:0
2 0x0000000000482f09 olb::OuterVelocityCornerProcessor3D<double, olb::descriptors::D3Q19<>, 1, 1, -1>::process() ???:0
3 0x00000000004307a8 olb::BlockLattice3D<double, olb::descriptors::D3Q19<> >::initialize() ???:0
4 0x000000000041b50d setBoundaryValues() ???:0
5 0x00000000004098e0 main() ???:0
6 0x0000000000023493 __libc_start_main() ???:0
7 0x000000000040a41e _start() ???:0
=================================
[bear-pg0105u24b:1466329] *** Process received signal ***
[bear-pg0105u24b:1466329] Signal: Segmentation fault (11)
[bear-pg0105u24b:1466329] Signal code: (-6)
[bear-pg0105u24b:1466329] Failing at address: 0x9043000165fd9
[bear-pg0105u24b:1466329] [ 0] /lib64/libpthread.so.0(+0x12c20)[0x14e898a2ec20]
[bear-pg0105u24b:1466329] [ 1] ./bifurcation3d[0x482f09]
[bear-pg0105u24b:1466329] [ 2] ./bifurcation3d[0x4307a8]
[bear-pg0105u24b:1466329] [ 3] ./bifurcation3d[0x41b50d]
[bear-pg0105u24b:1466329] [ 4] ./bifurcation3d[0x4098e0]
[bear-pg0105u24b:1466329] [ 5] /lib64/libc.so.6(__libc_start_main+0xf3)[0x14e89957f493]
[bear-pg0105u24b:1466329] [ 6] ./bifurcation3d[0x40a41e]
[bear-pg0105u24b:1466329] *** End of error message ***
————————————————————————–
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
————————————————————————–
————————————————————————–
mpirun noticed that process rank 0 with PID 1466329 on node bear-pg0105u24b exited on signal 11 (Segmentation fault).
————————————————————————–

Also, I am running my case, in parallel and non-parallel mode, in both cases, it stops after one timestep shows:

Unkown material number
————————————————————————–
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
————————————————————————–
————————————————————————–
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[39073,1],15]
Exit code: 255

But the same code works very well on my personal PC (Ubuntu).

Please can you guide me to deal with this issue?
Thank you
Regards
Ananda

January 29, 2022 at 2:14 am #6295

Anand
Participant

Issue resolved…….

January 29, 2022 at 3:28 pm #6296

Adrian
Keymaster

How did you resolve it?

We discussed this very issue quite some time ago in our internal issue tracker and implemented a general fix (however one can also often use e.g. a slightly different resolution as a workaround for specific examples). This is one of many fixes that will be included in the upcoming release but we may be able to provide a patch sooner if required.

January 29, 2022 at 8:47 pm #6299

Anand
Participant

Hi Adrian,
I am running the simulations on clusters.
Our system has prepared a module (installed the OpenLB).
After that, I unzip the software on my network drive and made changes in config.mk to run the simulation in parallel mode (activated mpi).
Then, I created a folder in the “examples/turbulent/Pipe”, where i put the my .cpp, .stl, Makefile, module.mk and .sh file.

My .sh file is as follows:

#!/bin/bash
#SBATCH –qos bbdefault
#SBATCH –ntasks 72
#SBATCH –nodes 1
#SBATCH –time 72:00:00
#SBATCH –mail-type ALL
#SBATCH –account=accountname
module purge; module load bluebear
module load OpenLB/1.4-0-foss-2021a
module load gnuplot/5.4.1-GCCcore-10.2.0
export OMP_NUM_THREADS=1
make clean
make cleanbuild
make
mpiexec -np 72 ./Pipe

I just submit it and it works.

I have noticed some issues with the cluster for example:
If I use the following line in the code, my simulation diverges but the same line works fine on my computer.
“CirclePowerLawTurbulent3D<T> uF( superGeometry,3,maxVelocity[0], 7, 0.05, T(0));”

So I just replace this line with the following lines and it worked very well with Clusters.

std::vector<T> origin = { pipelength, piperadius, piperadius};
std::vector<T> axis = { 1, 0, 0 };
CirclePowerLawTurbulent3D<T> POu(origin, axis, piperadius, maxVelocity[0], 7, 0.05, T(1));

Please let me know what you think.
Thank you
Regards
Ananda

February 11, 2022 at 10:59 am #6334
Adrian
Keymaster
Thanks for the update.

With respect to the CirclePowerLawTurbulent3D functor usage: Assuming that the variable pireradiusactually contains a radius, the sequence of arguments seems to be mixed up:
```
CirclePowerLawTurbulent3D(std::vector<T> axisPoint_, std::vector<T> axisDirection,  T maxVelocity, T radius, T n = 7, T turbulenceIntensity = 0.05, T scale = T(1));
```
Be aware that this functor uses random numbers internally which may be a potential platform-specific problems.
February 11, 2022 at 12:00 pm #6336

Anand
Participant

Dear Adrian,
Thank you for your reply.
I noticed this mistake.
Now, everything is working properly on a cluster.

Regards
Ananda

April 3, 2022 at 7:33 pm #6409

jflorezgi
Participant

Hi Adrian, I’m working in thermal indoor applications with OpenLB libraries, but I have issues with cluster MPI runnings. In my personal computer I don’t have problems loading the checkpoint files even if i’m running on parallel mode, but in the cluster generates the following error:

[prepareGeometry] Prepare Geometry … OK
[prepareLattice] Prepare Lattice …
[prepareLattice] Prepare Lattice … OK
[theclimatebox-ubuntu5:08595] *** Process received signal ***
[theclimatebox-ubuntu5:08595] Signal: Segmentation fault (11)
[theclimatebox-ubuntu5:08595] Signal code: (128)
[theclimatebox-ubuntu5:08595] Failing at address: (nil)
[theclimatebox-ubuntu5:08595] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x430c0)[0x7fe8d349c0c0]
[theclimatebox-ubuntu5:08595] [ 1] ./challenge2022-3DTurb(+0x7d612)[0x561067232612]
[theclimatebox-ubuntu5:08595] [ 2] ./challenge2022-3DTurb(+0x86bc6)[0x56106723bbc6]
[theclimatebox-ubuntu5:08595] [ 3] ./challenge2022-3DTurb(+0x2b05d)[0x5610671e005d]
[theclimatebox-ubuntu5:08595] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fe8d347d0b3]
[theclimatebox-ubuntu5:08595] [ 5] ./challenge2022-3DTurb(+0x2b58e)[0x5610671e058e]
[theclimatebox-ubuntu5:08595] *** End of error message ***
————————————————————————–
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
————————————————————————–
————————————————————————–
mpirun noticed that process rank 24 with PID 0 on node theclimatebox-ubuntu5 exited on signal 11 (Segmentation fault).

As you say, I think that it is necessary to include a patch as soon as possible, in my special case I am working on a server that restarts every 24 hours, so this function is vital for my work.

Thank you for your attention, I will be waiting for your answer.

April 4, 2022 at 10:27 am #6411

Adrian
Keymaster

Can you provide a backtrace / describe in more detail what you are doing (w.r.t. to checkpoint loading)? This sounds like a different issue than the one discussed previously in this thread.

In any case, the next release 1.5 is finally just around the corner (planned for ~middle of this month, currently doing the last merges and testing)

April 4, 2022 at 3:43 pm #6412

jflorezgi
Participant

Hi Adrian, thank you for your reply. I’m working first in a simulation intended to mimic a violent expiratory event resembling a mild cough, so for now I have two coupled lattices (NSLattice and ADLattice) to evolve the flow and temperature field. I’m using a D3Q19 for NSLattice and D3Q7 for ADLattice velocity fields, SmagorinskyForceMRTDynamics and SmagorinskyMRTDynamics are used respectively for the dynamics, and I’m using the SmagorinskyBoussinesqCouplingGenerator3D to couple the two lattices to calculate the buoyancy force (Boussinesq Approx.) in NSLattice and to pass the convective velocity to ADLattice.

The code to load the checkpoint just after the prepareLattice function call is:

// === 4th Step: Main Loop with Timer ===
std::size_t iT = 0;
Timer<T> timer( converter.getLatticeTime( maxPhysT ), superGeometry.getStatistics().getNvoxel() );

// checks whether there is already data of the fluid from an earlier calculation
if ( !(NSLattice.load(“NSChallenge2022Coupled.checkpoint”))){

// if there is no data available, it is generated
timer.start();

for ( ; iT <= converter.getLatticeTime( maxPhysT ); ++iT ) {

// === 5bth Step: Definition of Initial and Boundary Conditions ===
setBoundaryValues( converter, NSLattice, ADLattice, superGeometry, iT );

// === 6th Step: Collide and Stream Execution ===
NSLattice.collideAndStream();
ADLattice.collideAndStream();

NSLattice.executeCoupling();

// === 7th Step: Computation and Output of the Results ===
getResults( NSLattice, ADLattice, cuboidGeometry, converter, iT, superGeometry, timer, file );
}

timer.stop();
timer.printSummary();

delete bulkDynamics;
delete TbulkDynamics;
}

// if there exists already data of the fluid from an earlier calculation, this is used
else{
NSLattice.load(“NSChallenge2022Coupled.checkpoint”);
ADLattice.load(“ADChallenge2022Coupled.checkpoint”);
NSLattice.postLoad();
ADLattice.postLoad();

iT = lastCPTime;
timer.update(iT);

for ( ; iT <= converter.getLatticeTime( maxPhysT ); ++iT ) {

// === 5bth Step: Definition of Initial and Boundary Conditions ===
setBoundaryValues( converter, NSLattice, ADLattice, superGeometry, iT );

// === 6th Step: Collide and Stream Execution ===
NSLattice.collideAndStream();
ADLattice.collideAndStream();
NSLattice.executeCoupling();

// === 7th Step: Computation and Output of the Results ===
getResults( NSLattice, ADLattice, cuboidGeometry, converter, iT, superGeometry, timer, file );
}

timer.stop();
timer.printSummary();

delete bulkDynamics;
delete TbulkDynamics;
}

The checkpoint save is call in getResults function. I’m glad to know that next release is coming soon, I’m trying to build the Outflow boundary conditions (M. Junk and Z. Yang (2008)) to open boundaries but i have some questions, but i think that is better to open a new thread with this topic. Finally, I want to know if in the next release is possible to use a gpu paralelization.

Thanks for your help.

April 4, 2022 at 4:08 pm #6413

Adrian
Keymaster

Yes, amongst many other new features and improvements OpenLB 1.5 will include support for Nvidia GPUs in addition to vectorization on CPUs.

As for the checkpointing issue I suspect that the process config / assignment of blocks to processes is not the same between runs. Unfortunately this is a restriction of OpenLB’s current serialization system but in any case it should not segfault (and doesn’t when I just tested it using the pre-release version). Can you save and load a snapshot during the same HPC job to confirm that this is the issue?

April 4, 2022 at 10:22 pm #6414
jflorezgi
Participant
Yes, I have been doing some tests and I noticed that if I send the simulation through more than four threads, the segmentation fault error appears, after loading the data from the save files , it doesn’t matter if I’m running on cluster or on my computer, but the loading call is working if there is less than or equal to four threads. I don’t know how to deal with this problem, if you have any ideas I’d appreciate it.

Thanks in advance
- This reply was modified 2 years, 3 months ago by jflorezgi. Reason: my mistake
April 22, 2022 at 5:07 pm #6494

mathias
Keymaster

Can you check with olb 1.5 now?

April 26, 2022 at 4:49 pm #6505

jflorezgi
Participant

Hi Mathias, I have been doing some tests to load the saved files but so far I haven’t been able to load them properly running either serial or parallel (MPI). The output generates an error like the following:

[prepareGeometry] Prepare Geometry … OK [prepareLattice] defining dynamics [prepareLattice] Prepare Lattice … OK [LAPTOP-H5JAJF81:01813] *** Process received signal *** [LAPTOP-H5JAJF81:01813] Signal: Segmentation fault (11) [LAPTOP-H5JAJF81:01813] Signal code: Address not mapped (1) [LAPTOP-H5JAJF81:01813] Failing at address: 0x1a8 [LAPTOP-H5JAJF81:01813] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fba20e16210] [LAPTOP-H5JAJF81:01813] [ 1] ./prueba2D(+0x5b150)[0x7fba21525150] [LAPTOP-H5JAJF81:01813] [ 2] ./prueba2D(+0x70dec)[0x7fba2153adec] [LAPTOP-H5JAJF81:01813] [ 3] ./prueba2D(+0x2e773)[0x7fba214f8773] [LAPTOP-H5JAJF81:01813] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fba20df70b3] [LAPTOP-H5JAJF81:01813] [ 5] ./prueba2D(+0x2ebfe)[0x7fba214f8bfe] [LAPTOP-H5JAJF81:01813] *** End of error message *** ————————————————————————– Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ————————————————————————– [LAPTOP-H5JAJF81:01812] *** Process received signal *** [LAPTOP-H5JAJF81:01812] Signal: Segmentation fault (11) [LAPTOP-H5JAJF81:01812] Signal code: Address not mapped (1) [LAPTOP-H5JAJF81:01812] Failing at address: 0x1a8 [LAPTOP-H5JAJF81:01812] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f87fe5a6210] [LAPTOP-H5JAJF81:01812] [ 1] ./prueba2D(+0x5b150)[0x7f87fecaf150] [LAPTOP-H5JAJF81:01812] [ 2] ./prueba2D(+0x70dec)[0x7f87fecc4dec] [LAPTOP-H5JAJF81:01812] [ 3] ./prueba2D(+0x2e773)[0x7f87fec82773] [LAPTOP-H5JAJF81:01812] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f87fe5870b3] [LAPTOP-H5JAJF81:01812] [ 5] ./prueba2D(+0x2ebfe)[0x7f87fec82bfe] [LAPTOP-H5JAJF81:01812] *** End of error message *** ————————————————————————– mpirun noticed that process rank 2 with PID 0 on node LAPTOP-H5JAJF81 exited on signal 11 (Segmentation fault).

I don’t know if I’m calling the load function properly, so I’ll leave you part of the code, I appreciate if you can check this part:

NSlattice.addLatticeCoupling(coupling, ADlattice);

prepareLattice(converter, NSlattice, ADlattice, superGeometry);

/// === 4th Step: Main Loop with Timer ===
std::size_t iT = 0;
util::Timer<T> timer(converter.getLatticeTime(maxPhysT), superGeometry.getStatistics().getNvoxel() );
util::ValueTracer<T> converge(converter.getLatticeTime(0.01),epsilon);

// checks whether there is already data of the fluid from an earlier calculation
if ( !(NSlattice.load(“NSprueba2DCoupled”))){
// if there is no data available, it is generated
timer.start();

for ( ; iT < converter.getLatticeTime(maxPhysT); ++iT) {

if (converge.hasConverged()) {
clout << “Simulation converged.” << std::endl;
getResults(converter, NSlattice, ADlattice, iT, superGeometry, timer, file, converge.hasConverged());

clout << “Time ” << iT << “.” << std::endl;

break;
}

/// === 5th Step: Definition of Initial and Boundary Conditions ===
setBoundaryValues(converter, NSlattice, ADlattice, iT, superGeometry);

/// === 6th Step: Collide and Stream Execution ===
ADlattice.collideAndStream();
NSlattice.collideAndStream();
NSlattice.executeCoupling();

/// === 7th Step: Computation and Output of the Results ===
getResults(converter, NSlattice, ADlattice, iT, superGeometry, timer, file, converge.hasConverged());
converge.takeValue(ADlattice.getStatistics().getAverageEnergy(),true);
}
timer.stop();
timer.printSummary();
}
// if there exists already data of the fluid from an earlier calculation, this is used
else{
NSlattice.load(“NSprueba2DCoupled”);
ADlattice.load(“ADprueba2DCoupled”);
NSlattice.postLoad();
ADlattice.postLoad();

iT = lastCPTime;
timer.update(iT);

for ( ; iT < converter.getLatticeTime(maxPhysT); ++iT) {

if (converge.hasConverged()) {
clout << “Simulation converged.” << std::endl;
getResults(converter, NSlattice, ADlattice, iT, superGeometry, timer, file, converge.hasConverged());

clout << “Time ” << iT << “.” << std::endl;

break;
}

/// === 5th Step: Definition of Initial and Boundary Conditions ===
setBoundaryValues(converter, NSlattice, ADlattice, iT, superGeometry);

/// === 6th Step: Collide and Stream Execution ===
ADlattice.collideAndStream();
NSlattice.collideAndStream();
NSlattice.executeCoupling();

/// === 7th Step: Computation and Output of the Results ===
getResults(converter, NSlattice, ADlattice, iT, superGeometry, timer, file, converge.hasConverged());
converge.takeValue(ADlattice.getStatistics().getAverageEnergy(),true);
}
timer.stop();
timer.printSummary();
}

April 27, 2022 at 2:56 pm #6520

mathias
Keymaster

Can your check our examples? BackStepxd?

April 28, 2022 at 2:10 pm #6524
jflorezgi
Participant
Hi Mathias, I tested the bstep2d and rayleighBenard2d examples, and these are loading the saved files correctly after the checkpoints, so I passed my code using as template the second example and now is working properly. I think the problem is related to the way that I defined the lastCPTime variable but I’m not sure.

Thanks for your help.
- This reply was modified 2 years, 3 months ago by jflorezgi.
Author

Posts

Viewing 15 posts - 1 through 15 (of 17 total)

1 2 →

You must be logged in to reply to this topic.