Skip to content

Example issues on Cluster

Viewing 15 posts - 1 through 15 (of 17 total)
  • Author
    Posts
  • #6291
    Anand
    Participant

    Dear Developer,
    I am running the OpenLB on the cluster in Parallel and non-parallel mode.

    In non-parallel mode, everything works properly.
    In parallel mode, some examples do not work, like bifurcation example, Euler Euler work but Euler-Lagrange does not work, it shows following error:

    .
    .
    .
    .
    [prepareGeometry] Prepare Geometry … OK
    [prepareLattice] Prepare Lattice …
    [prepareLattice] Prepare Lattice … OK
    [main] Prepare Particles …
    [main] Prepare Particles … OK
    [bear-pg0105u24b:1466329:0:1466329] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x6)
    ==== backtrace (tid:1466329) ====
    0 0x00000000000211ce ucs_debug_print_backtrace() /dev/shm/edmondac-admin-2021a-EL8/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
    1 0x0000000000012c20 .annobin_sigaction.c() sigaction.c:0
    2 0x0000000000482f09 olb::OuterVelocityCornerProcessor3D<double, olb::descriptors::D3Q19<>, 1, 1, -1>::process() ???:0
    3 0x00000000004307a8 olb::BlockLattice3D<double, olb::descriptors::D3Q19<> >::initialize() ???:0
    4 0x000000000041b50d setBoundaryValues() ???:0
    5 0x00000000004098e0 main() ???:0
    6 0x0000000000023493 __libc_start_main() ???:0
    7 0x000000000040a41e _start() ???:0
    =================================
    [bear-pg0105u24b:1466329] *** Process received signal ***
    [bear-pg0105u24b:1466329] Signal: Segmentation fault (11)
    [bear-pg0105u24b:1466329] Signal code: (-6)
    [bear-pg0105u24b:1466329] Failing at address: 0x9043000165fd9
    [bear-pg0105u24b:1466329] [ 0] /lib64/libpthread.so.0(+0x12c20)[0x14e898a2ec20]
    [bear-pg0105u24b:1466329] [ 1] ./bifurcation3d[0x482f09]
    [bear-pg0105u24b:1466329] [ 2] ./bifurcation3d[0x4307a8]
    [bear-pg0105u24b:1466329] [ 3] ./bifurcation3d[0x41b50d]
    [bear-pg0105u24b:1466329] [ 4] ./bifurcation3d[0x4098e0]
    [bear-pg0105u24b:1466329] [ 5] /lib64/libc.so.6(__libc_start_main+0xf3)[0x14e89957f493]
    [bear-pg0105u24b:1466329] [ 6] ./bifurcation3d[0x40a41e]
    [bear-pg0105u24b:1466329] *** End of error message ***
    ————————————————————————–
    Primary job terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    ————————————————————————–
    ————————————————————————–
    mpirun noticed that process rank 0 with PID 1466329 on node bear-pg0105u24b exited on signal 11 (Segmentation fault).
    ————————————————————————–

    Also, I am running my case, in parallel and non-parallel mode, in both cases, it stops after one timestep shows:

    Unkown material number
    ————————————————————————–
    Primary job terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    ————————————————————————–
    ————————————————————————–
    mpirun detected that one or more processes exited with non-zero status, thus causing
    the job to be terminated. The first process to do so was:

    Process name: [[39073,1],15]
    Exit code: 255

    But the same code works very well on my personal PC (Ubuntu).

    Please can you guide me to deal with this issue?
    Thank you
    Regards
    Ananda

    #6295
    Anand
    Participant

    Issue resolved…….

    #6296
    Adrian
    Keymaster

    How did you resolve it?

    We discussed this very issue quite some time ago in our internal issue tracker and implemented a general fix (however one can also often use e.g. a slightly different resolution as a workaround for specific examples). This is one of many fixes that will be included in the upcoming release but we may be able to provide a patch sooner if required.

    #6299
    Anand
    Participant

    Hi Adrian,
    I am running the simulations on clusters.
    Our system has prepared a module (installed the OpenLB).
    After that, I unzip the software on my network drive and made changes in config.mk to run the simulation in parallel mode (activated mpi).
    Then, I created a folder in the “examples/turbulent/Pipe”, where i put the my .cpp, .stl, Makefile, module.mk and .sh file.

    My .sh file is as follows:

    #!/bin/bash
    #SBATCH –qos bbdefault
    #SBATCH –ntasks 72
    #SBATCH –nodes 1
    #SBATCH –time 72:00:00
    #SBATCH –mail-type ALL
    #SBATCH –account=accountname
    module purge; module load bluebear
    module load OpenLB/1.4-0-foss-2021a
    module load gnuplot/5.4.1-GCCcore-10.2.0
    export OMP_NUM_THREADS=1
    make clean
    make cleanbuild
    make
    mpiexec -np 72 ./Pipe

    I just submit it and it works.

    I have noticed some issues with the cluster for example:
    If I use the following line in the code, my simulation diverges but the same line works fine on my computer.
    “CirclePowerLawTurbulent3D<T> uF( superGeometry,3,maxVelocity[0], 7, 0.05, T(0));”

    So I just replace this line with the following lines and it worked very well with Clusters.

    std::vector<T> origin = { pipelength, piperadius, piperadius};
    std::vector<T> axis = { 1, 0, 0 };
    CirclePowerLawTurbulent3D<T> POu(origin, axis, piperadius, maxVelocity[0], 7, 0.05, T(1));

    Please let me know what you think.
    Thank you
    Regards
    Ananda

    #6334
    Adrian
    Keymaster

    Thanks for the update.

    With respect to the CirclePowerLawTurbulent3D functor usage: Assuming that the variable pireradiusactually contains a radius, the sequence of arguments seems to be mixed up:

    
    CirclePowerLawTurbulent3D(std::vector<T> axisPoint_, std::vector<T> axisDirection,  T maxVelocity, T radius, T n = 7, T turbulenceIntensity = 0.05, T scale = T(1));
    

    Be aware that this functor uses random numbers internally which may be a potential platform-specific problems.

    #6336
    Anand
    Participant

    Dear Adrian,
    Thank you for your reply.
    I noticed this mistake.
    Now, everything is working properly on a cluster.

    Regards
    Ananda

    #6409
    jflorezgi
    Participant

    Hi Adrian, I’m working in thermal indoor applications with OpenLB libraries, but I have issues with cluster MPI runnings. In my personal computer I don’t have problems loading the checkpoint files even if i’m running on parallel mode, but in the cluster generates the following error:

    [prepareGeometry] Prepare Geometry … OK
    [prepareLattice] Prepare Lattice …
    [prepareLattice] Prepare Lattice … OK
    [theclimatebox-ubuntu5:08595] *** Process received signal ***
    [theclimatebox-ubuntu5:08595] Signal: Segmentation fault (11)
    [theclimatebox-ubuntu5:08595] Signal code: (128)
    [theclimatebox-ubuntu5:08595] Failing at address: (nil)
    [theclimatebox-ubuntu5:08595] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x430c0)[0x7fe8d349c0c0]
    [theclimatebox-ubuntu5:08595] [ 1] ./challenge2022-3DTurb(+0x7d612)[0x561067232612]
    [theclimatebox-ubuntu5:08595] [ 2] ./challenge2022-3DTurb(+0x86bc6)[0x56106723bbc6]
    [theclimatebox-ubuntu5:08595] [ 3] ./challenge2022-3DTurb(+0x2b05d)[0x5610671e005d]
    [theclimatebox-ubuntu5:08595] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fe8d347d0b3]
    [theclimatebox-ubuntu5:08595] [ 5] ./challenge2022-3DTurb(+0x2b58e)[0x5610671e058e]
    [theclimatebox-ubuntu5:08595] *** End of error message ***
    ————————————————————————–
    Primary job terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    ————————————————————————–
    ————————————————————————–
    mpirun noticed that process rank 24 with PID 0 on node theclimatebox-ubuntu5 exited on signal 11 (Segmentation fault).

    As you say, I think that it is necessary to include a patch as soon as possible, in my special case I am working on a server that restarts every 24 hours, so this function is vital for my work.

    Thank you for your attention, I will be waiting for your answer.

    #6411
    Adrian
    Keymaster

    Can you provide a backtrace / describe in more detail what you are doing (w.r.t. to checkpoint loading)? This sounds like a different issue than the one discussed previously in this thread.

    In any case, the next release 1.5 is finally just around the corner (planned for ~middle of this month, currently doing the last merges and testing)

    #6412
    jflorezgi
    Participant

    Hi Adrian, thank you for your reply. I’m working first in a simulation intended to mimic a violent expiratory event resembling a mild cough, so for now I have two coupled lattices (NSLattice and ADLattice) to evolve the flow and temperature field. I’m using a D3Q19 for NSLattice and D3Q7 for ADLattice velocity fields, SmagorinskyForceMRTDynamics and SmagorinskyMRTDynamics are used respectively for the dynamics, and I’m using the SmagorinskyBoussinesqCouplingGenerator3D to couple the two lattices to calculate the buoyancy force (Boussinesq Approx.) in NSLattice and to pass the convective velocity to ADLattice.

    The code to load the checkpoint just after the prepareLattice function call is:

    // === 4th Step: Main Loop with Timer ===
    std::size_t iT = 0;
    Timer<T> timer( converter.getLatticeTime( maxPhysT ), superGeometry.getStatistics().getNvoxel() );

    // checks whether there is already data of the fluid from an earlier calculation
    if ( !(NSLattice.load(“NSChallenge2022Coupled.checkpoint”))){

    // if there is no data available, it is generated
    timer.start();

    for ( ; iT <= converter.getLatticeTime( maxPhysT ); ++iT ) {

    // === 5bth Step: Definition of Initial and Boundary Conditions ===
    setBoundaryValues( converter, NSLattice, ADLattice, superGeometry, iT );

    // === 6th Step: Collide and Stream Execution ===
    NSLattice.collideAndStream();
    ADLattice.collideAndStream();

    NSLattice.executeCoupling();

    // === 7th Step: Computation and Output of the Results ===
    getResults( NSLattice, ADLattice, cuboidGeometry, converter, iT, superGeometry, timer, file );
    }

    timer.stop();
    timer.printSummary();

    delete bulkDynamics;
    delete TbulkDynamics;
    }

    // if there exists already data of the fluid from an earlier calculation, this is used
    else{
    NSLattice.load(“NSChallenge2022Coupled.checkpoint”);
    ADLattice.load(“ADChallenge2022Coupled.checkpoint”);
    NSLattice.postLoad();
    ADLattice.postLoad();

    iT = lastCPTime;
    timer.update(iT);

    for ( ; iT <= converter.getLatticeTime( maxPhysT ); ++iT ) {

    // === 5bth Step: Definition of Initial and Boundary Conditions ===
    setBoundaryValues( converter, NSLattice, ADLattice, superGeometry, iT );

    // === 6th Step: Collide and Stream Execution ===
    NSLattice.collideAndStream();
    ADLattice.collideAndStream();
    NSLattice.executeCoupling();

    // === 7th Step: Computation and Output of the Results ===
    getResults( NSLattice, ADLattice, cuboidGeometry, converter, iT, superGeometry, timer, file );
    }

    timer.stop();
    timer.printSummary();

    delete bulkDynamics;
    delete TbulkDynamics;
    }

    The checkpoint save is call in getResults function. I’m glad to know that next release is coming soon, I’m trying to build the Outflow boundary conditions (M. Junk and Z. Yang (2008)) to open boundaries but i have some questions, but i think that is better to open a new thread with this topic. Finally, I want to know if in the next release is possible to use a gpu paralelization.

    Thanks for your help.

    #6413
    Adrian
    Keymaster

    Yes, amongst many other new features and improvements OpenLB 1.5 will include support for Nvidia GPUs in addition to vectorization on CPUs.

    As for the checkpointing issue I suspect that the process config / assignment of blocks to processes is not the same between runs. Unfortunately this is a restriction of OpenLB’s current serialization system but in any case it should not segfault (and doesn’t when I just tested it using the pre-release version). Can you save and load a snapshot during the same HPC job to confirm that this is the issue?

    #6414
    jflorezgi
    Participant

    Yes, I have been doing some tests and I noticed that if I send the simulation through more than four threads, the segmentation fault error appears, after loading the data from the save files , it doesn’t matter if I’m running on cluster or on my computer, but the loading call is working if there is less than or equal to four threads. I don’t know how to deal with this problem, if you have any ideas I’d appreciate it.

    Thanks in advance

    • This reply was modified 2 years ago by jflorezgi. Reason: my mistake
    #6494
    mathias
    Keymaster

    Can you check with olb 1.5 now?

    #6505
    jflorezgi
    Participant

    Hi Mathias, I have been doing some tests to load the saved files but so far I haven’t been able to load them properly running either serial or parallel (MPI). The output generates an error like the following:

    [prepareGeometry] Prepare Geometry … OK [prepareLattice] defining dynamics [prepareLattice] Prepare Lattice … OK [LAPTOP-H5JAJF81:01813] *** Process received signal *** [LAPTOP-H5JAJF81:01813] Signal: Segmentation fault (11) [LAPTOP-H5JAJF81:01813] Signal code: Address not mapped (1) [LAPTOP-H5JAJF81:01813] Failing at address: 0x1a8 [LAPTOP-H5JAJF81:01813] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fba20e16210] [LAPTOP-H5JAJF81:01813] [ 1] ./prueba2D(+0x5b150)[0x7fba21525150] [LAPTOP-H5JAJF81:01813] [ 2] ./prueba2D(+0x70dec)[0x7fba2153adec] [LAPTOP-H5JAJF81:01813] [ 3] ./prueba2D(+0x2e773)[0x7fba214f8773] [LAPTOP-H5JAJF81:01813] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fba20df70b3] [LAPTOP-H5JAJF81:01813] [ 5] ./prueba2D(+0x2ebfe)[0x7fba214f8bfe] [LAPTOP-H5JAJF81:01813] *** End of error message *** ————————————————————————– Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. ————————————————————————– [LAPTOP-H5JAJF81:01812] *** Process received signal *** [LAPTOP-H5JAJF81:01812] Signal: Segmentation fault (11) [LAPTOP-H5JAJF81:01812] Signal code: Address not mapped (1) [LAPTOP-H5JAJF81:01812] Failing at address: 0x1a8 [LAPTOP-H5JAJF81:01812] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f87fe5a6210] [LAPTOP-H5JAJF81:01812] [ 1] ./prueba2D(+0x5b150)[0x7f87fecaf150] [LAPTOP-H5JAJF81:01812] [ 2] ./prueba2D(+0x70dec)[0x7f87fecc4dec] [LAPTOP-H5JAJF81:01812] [ 3] ./prueba2D(+0x2e773)[0x7f87fec82773] [LAPTOP-H5JAJF81:01812] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f87fe5870b3] [LAPTOP-H5JAJF81:01812] [ 5] ./prueba2D(+0x2ebfe)[0x7f87fec82bfe] [LAPTOP-H5JAJF81:01812] *** End of error message *** ————————————————————————– mpirun noticed that process rank 2 with PID 0 on node LAPTOP-H5JAJF81 exited on signal 11 (Segmentation fault).

    I don’t know if I’m calling the load function properly, so I’ll leave you part of the code, I appreciate if you can check this part:

    NSlattice.addLatticeCoupling(coupling, ADlattice);

    prepareLattice(converter, NSlattice, ADlattice, superGeometry);

    /// === 4th Step: Main Loop with Timer ===
    std::size_t iT = 0;
    util::Timer<T> timer(converter.getLatticeTime(maxPhysT), superGeometry.getStatistics().getNvoxel() );
    util::ValueTracer<T> converge(converter.getLatticeTime(0.01),epsilon);

    // checks whether there is already data of the fluid from an earlier calculation
    if ( !(NSlattice.load(“NSprueba2DCoupled”))){
    // if there is no data available, it is generated
    timer.start();

    for ( ; iT < converter.getLatticeTime(maxPhysT); ++iT) {

    if (converge.hasConverged()) {
    clout << “Simulation converged.” << std::endl;
    getResults(converter, NSlattice, ADlattice, iT, superGeometry, timer, file, converge.hasConverged());

    clout << “Time ” << iT << “.” << std::endl;

    break;
    }

    /// === 5th Step: Definition of Initial and Boundary Conditions ===
    setBoundaryValues(converter, NSlattice, ADlattice, iT, superGeometry);

    /// === 6th Step: Collide and Stream Execution ===
    ADlattice.collideAndStream();
    NSlattice.collideAndStream();
    NSlattice.executeCoupling();

    /// === 7th Step: Computation and Output of the Results ===
    getResults(converter, NSlattice, ADlattice, iT, superGeometry, timer, file, converge.hasConverged());
    converge.takeValue(ADlattice.getStatistics().getAverageEnergy(),true);
    }
    timer.stop();
    timer.printSummary();
    }
    // if there exists already data of the fluid from an earlier calculation, this is used
    else{
    NSlattice.load(“NSprueba2DCoupled”);
    ADlattice.load(“ADprueba2DCoupled”);
    NSlattice.postLoad();
    ADlattice.postLoad();

    iT = lastCPTime;
    timer.update(iT);

    for ( ; iT < converter.getLatticeTime(maxPhysT); ++iT) {

    if (converge.hasConverged()) {
    clout << “Simulation converged.” << std::endl;
    getResults(converter, NSlattice, ADlattice, iT, superGeometry, timer, file, converge.hasConverged());

    clout << “Time ” << iT << “.” << std::endl;

    break;
    }

    /// === 5th Step: Definition of Initial and Boundary Conditions ===
    setBoundaryValues(converter, NSlattice, ADlattice, iT, superGeometry);

    /// === 6th Step: Collide and Stream Execution ===
    ADlattice.collideAndStream();
    NSlattice.collideAndStream();
    NSlattice.executeCoupling();

    /// === 7th Step: Computation and Output of the Results ===
    getResults(converter, NSlattice, ADlattice, iT, superGeometry, timer, file, converge.hasConverged());
    converge.takeValue(ADlattice.getStatistics().getAverageEnergy(),true);
    }
    timer.stop();
    timer.printSummary();
    }

    #6520
    mathias
    Keymaster

    Can your check our examples? BackStepxd?

    #6524
    jflorezgi
    Participant

    Hi Mathias, I tested the bstep2d and rayleighBenard2d examples, and these are loading the saved files correctly after the checkpoints, so I passed my code using as template the second example and now is working properly. I think the problem is related to the way that I defined the lastCPTime variable but I’m not sure.

    Thanks for your help.

    • This reply was modified 1 year, 11 months ago by jflorezgi.
Viewing 15 posts - 1 through 15 (of 17 total)
  • You must be logged in to reply to this topic.