Skip to content

Heterogeneous Load Balancing

Heterogeneous Load Balancing in OpenLB: Cooperatively Utilizing CPUs and GPUs for a Turbulent Mixing Simulation

Following up on the turbulent micromixer simulation showcased here, the present video illustrates OpenLB’s heterogeneous computation capabilities.

The performance of the simulation case is improved by up to 87% when using heterogeneous CPU-GPU based compared to GPU-only execution. This is achived by distributing the two computationally expensive turbulent inlet regions onto CPUs while the comparatively cheaper bulk regions are processed on GPUs. The underlying inhomogeneous spatial domain decomposition was obtained using a novel genetic algorithm for cost-aware optimization.

A single accelerated CPU-GPU node of the HoreKa supercomputer (2x Intel Xeon Platinum 8368, 4x NVIDIA A100) was used for the showcased simulation consisting of 355 million lattice cells.
OpenLB enabled the cooperative usage of MPI, OpenMP, AVX-512 vectorization and CUDA, reaching a throughput of ~19.25 billion (NSE-only) resp. ~4.79 billion cell updates per second for the fully coupled case.

Simulation setup: Fedor Bukreev
Heterogeneous Load Balancing, Performance engineering, Visualization: Adrian Kummerländer

All further details on the specific load balancing approach are available in our recent preprint:

A. Kummerländer, F. Bukreev, D. Teutscher, M. Dorn, and M.J. Krause. Optimization of Single Node Load Balancing for Lattice Boltzmann Methods on Heterogeneous High Performance Computers.
Preprint, under review (February 2024). DOI:

Abstract: Lattice Boltzmann Methods (LBM) are particularly suited to highly parallel computational fluid dynamics simulations both on SIMD CPUs and GPUs. While heterogeneous systems combining CPUs and GPUs are ubiquitous in high performance computation (HPC), the computationally dominant collide-and-stream loop commonly only utilizes either CPUs or GPUs homogeneously. This article proposes a novel approach utilizing genetic programming for cost-aware optimization of spatial domain decompositions targeting heterogeneous execution environments. The implementation and performance of the genetic algorithm for spatial decomposition, as well as the subsequently derived rank assignment approaches, are discussed in detail. The resulting comprehensive load balancing strategy is implemented in the open source LBM framework OpenLB and applied to turbulent flow reference cases, including a multi-physics reactive mixer benchmark. Evaluation of its computational performance on heterogeneous HPC nodes yields speedups up to 87% compared to homogeneous GPU-only execution.