Multi GPUs Calculation • OpenLB - Open source lattice Boltzmann code

This topic has 26 replies, 2 voices, and was last updated 9 months, 2 weeks ago by Adrian.

Viewing 12 posts - 16 through 27 (of 27 total)

← 1 2

Author

Posts
July 14, 2023 at 10:24 am #7626

Yuji
Participant

Sorry again. Additionaly, I dont understand “SuperLattice::setProcessingContext”. How dose it work?
I looked up it at doxygen. But I could not undestand it. Sorry.

July 14, 2023 at 7:10 pm #7628

Adrian
Keymaster

You mean why mpirun starts the program even if MPI support is not enabled during compilation? If so this because mpirun basically only starts the given executable N times and provides each with the environment required for communication – whether this is actually utilized is up to the indidual program and not verified by mpirun.

The communication code is primarily contained in src/communication and to some degree src/core/platform. That you can not easily see it is kind of the point of the abstraction as otherwise each example case would have to duplicate its own version of communication which goes against the idea of building a framework. (If I understand your question correctly)

1) Yes, you should never need to directly use MPI commands when using OpenLB as a simulation framework utilizing the existing feature set. (Ignoring that the number of ranks is passed to the cuboid geometry during construction)

2) Yes

3) No, same as 1 and 2. From the user side there is no MPI-related code difference between CPU-only and (multi)GPU mode.

July 14, 2023 at 7:16 pm #7629

Adrian
Keymaster

SuperLattice::setProcessingContext is only a shorthand for triggering the synchronization between host-side (i.e. CPU side) and device-side (i.e. GPU side). You only see it in the cases where e.g. time dependent boundary conditions are updated or the results are written to memory as VTK.

July 18, 2023 at 7:08 am #7631

Yuji
Participant

Thank you for kind replying.

To your the quetion, no. I meant that “.cpp file” has no MPI API in examples/laminar/cavity3d/cavity3d.cpp but it can caluclate with MPI(when set MPI PARARELL MODE in Makefile ).And I mean that “MPI API” are like below (these are my previous a part of CFD codes written by fortran90 )
——————————————————————-
#ifdef MPI
call mpi_comm_rank(icomm_xi,mpi_local_rank_xi,ierror)
call mpi_comm_size(icomm_xi,mpi_local_size_xi,ierror)
icomm_et=mpi_et_comm
call mpi_comm_rank(icomm_et,mpi_local_rank_et,ierror)
call mpi_comm_size(icomm_et,mpi_local_size_et,ierror)
icomm_zt=mpi_zt_comm
call mpi_comm_rank(icomm_zt,mpi_local_rank_zt,ierror)
call mpi_comm_size(icomm_zt,mpi_local_size_zt,ierror)
#else
mpi_local_rank_xi = 0
mpi_local_rank_et = 0
mpi_local_rank_zt = 0
#endif
—————————-
In my understasnding, when I used MPI, I always needed MPI instraction directly in my codes. However in openLB, MPI instraction is witten in src/core/platform and src/communication and we can use it through “olb3D.h” and “olb3D.hh”. Is my understandig right?

For your answer ( 3) No, same as 1 and 2. From the user side there is no MPI-related code difference between CPU-only and (multi)GPU mode.), I am confused because there are MPI APIs written in examples/laminar/cavity3dBenchmark.
If I delete MPI API, will this codes work well with GPUs as well.

I would you like to give me some time to understand openLB codes deeply because I don’t understand the point yet.I want to understand this code structures. I think bebugger(LLDB or GDB) in VS code can show the move of codes. How do you do to understand code structure at openLB team?
And as to parallel computing, I notice that I am lack of my knowledge. are there any recomendation web pages, books or papers to understand basic parallel computing?

July 18, 2023 at 7:53 am #7632

Yuji
Participant

In terms of SuperLattice::setProcessingContext, what is the difference between sLattice.setProcessingContext<Array<momenta::FixedVelocityMomentumGeneric::VELOCITY>>(ProcessingContext::Simulation); and sLattice.setProcessingContext(ProcessingContext::Simulation);
I would like to set the value of velcity at Inlet and the value of pressure(or Rho) at Outlet.

July 18, 2023 at 4:51 pm #7634

Adrian
Keymaster

Yes, OpenLB of course uses MPI API calls internally but they are mostly hidden behind more convenient abstractions. Otherwise one would have to copy-and-paste the same communication (and spatial decompostion, load balancing, setup, …) code into each new simulation case. Making this more convenient and easier to maintain is one of the main points of having a software library for LBM. (The same way that the MPI software library abstracts the details on how to efficiently communicate messages accross diverse hardware).

The calls to our MPI wrapper functions in cavity3dBenchmark and only used to A) instantiate the cuboid geometry for the appropriate number of blocks (same as in all example cases) and B) to provide the user with output on the degree of parallelization for benchmark logging.

If you delete MPI API usage it will still work for single GPUs / CPUs, just not multiple of them. (referring to all MPI usage in OpenLB not just the logging in the cavity3d benchmark case)

In general you do not and should not need to understand the entire library structure in order to use it to setup your own cases. I would recommend to start by modifying an existing case for your application (or to continue with the case we setup together at the spring school), you will learn the required internals on the way. The point of these abstractions is exactly that: Abstraction. This way you can focus on your problem instead of the data structures, how to efficiently parallelize execution and so on.

For parallel computing basics I would suggest to check out some introductory university course (every compsci department should offer at least one, there likely are many recordings available on the net, e.g. from MIT (just quickly searched it on the web)).

The templated setProcessingContext call only synchronizes the given field. This is a performance optimization. Until you get more experience I would recommend to synchronize all data at the points where synchronization is necessary (the non-templated version of the function).

July 20, 2023 at 5:22 am #7638

Yuji
Participant

Dear Adrian,

I appreciate kind your answer.
Let me check and give me time in order to understand it.

As to your answer (( 3) No, same as 1 and 2. From the user side there is no MPI-related code difference between CPU-only and (multi)GPU mode.) and, If you delete MPI API usage it will still work for single GPUs / CPUs, just not multiple of them. (referring to all MPI usage in OpenLB not just the logging in the cavity3d benchmark case)) if I want to use multiple GPUs, I need to write MPI wrapper functions like in examples/laminar/cavity3dBenchmark. Is it my understanding right?

This is because I am being confused. As to your answer of ” 3) No, same as 1 and 2. From the user side there is no MPI-related code difference between CPU-only and (multi)GPU mode.)”, I dont need to write MPI wrapper functions. If the “PARALLEL_MODE” sets MPI in config.mk, it works in Multiple GPUs even though MPI wrapper is not written in .cpp.
However, as to “If you delete MPI API usage it will still work for single GPUs / CPUs, just not multiple of them. (referring to all MPI usage in OpenLB not just the logging in the cavity3d benchmark case)”, I have to write MPI wrapper in .cpp.
Which are these answer right?
Sorry, I am being confused…

July 20, 2023 at 8:14 pm #7640

Adrian
Keymaster

No, you do not need to add or change anything in order to use any kind of parallel processing in OpenLB (be it CPU-only, Vectorization, OpenMP, Single GPU, Multiple GPU, whatever).

OpenLB contains everything that is needed to transparently parallelize LBM execution. You do not need to write any new wrappers or change anything at all to go from a sequentially executed case to a parallel one.

This is one of the main point of why there is a OpenLB library in the first place. You can fully focus on your model implementation. Efficient parallelization is already implemented and done for you.

July 25, 2023 at 8:34 am #7672

Yuji
Participant

Dear Adrian,

Thank you for reply.

I’m catching on.

You said that “No, you do not need to add or change anything in order to use any kind of parallel processing in OpenLB”. In examples/laminar/cavity3dBenchmark, why is mpi API used such as gpu::cuda::device::synchronize();? Following your useful comments, I think it is not necessary.
I apologize that my comprehension is lacking.

July 25, 2023 at 2:11 pm #7675

Adrian
Keymaster

gpu::cuda::device::synchronize is not a MPI function (wrapper). It only synchronizes the default CUDA execution stream to the CPU one.

You may still be correct that this is not needed anymore in the current version of the code but this is separate from MPI concerns. The execution is synchronized implicitly on any processing context switch in any case.

July 31, 2023 at 4:04 pm #7685
Yuji
Participant
Dear Adrian,

Thank you for very kind replying. I am getting to understand it gradually.

I would like to calculate LBM under the 2.5*10^11 grid points(I know it is very huge memory comsumption )
I reckon “gpu_hybrid_mixed.mk” is useful for my case becasuse when compiled by gpu_hybrid_mixed.mk, CPU and GPU are used for LBM calculation.
Now I have two GPUs and my CPU has 8 cores.
In my understanding, “mpirun -np 2 ./cavity3d” commands use 2 cores of CPU and 1 GPU(No.1 GPU is not used) on the other hand, “mpirun -np 2 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d” commands use 2 GPUs(No.0 and No.1 GPUs are used) but CPU is not used for LBM calculation.
I would like to use 2GPUs and 8cores of CPU. Do you have any recommendation commnads? And is my understanding correct?
Thank you.
Yuji
- This reply was modified 9 months, 2 weeks ago by Yuji.
July 31, 2023 at 5:40 pm #7687

Adrian
Keymaster

Indeed, 2.5e11 cells are an extremely large problem. You will need 19 TB of GPU memory for the lattice alone (without any overhead). You will need access to a large fraction of a large GPU cluster for this.

Heterogeneous processing (dividing the lattice between CPUs and GPUs) is supported by OpenLB but load balancing in this context is a work in progress (I actually talked about this last week at the DSFD conference).

I can aid you in getting the heterogeneous setup running but whether this is an advantage for you heavily depends on the specific application. I would recommend to focus on multi GPU only execution for now. (With a problem size orders of magnitude smaller for local workstation usage).
Author

Posts

Viewing 12 posts - 16 through 27 (of 27 total)

← 1 2

You must be logged in to reply to this topic.