# Parallelism in DisCoTec

The DisCoTec framework can work with existing MPI-parallelized PDE solver codes
operating on structured grids.
In addition to the parallelism provided by the PDE solver, it adds the combination
technique's parallelism.
This is achieved mainly through *process groups* (pgs):
`MPI_COMM_WORLD` is subdivided into equal-sized process groups
(and optionally, a manager rank).

![schematic of MPI ranks in DisCoTec](../gfx/discotec-ranks.svg)

The image describes the two ways of scaling up:
One can either increase the size or the number of process groups.
Figure originally published in (Pollinger [2024](https://elib.uni-stuttgart.de/handle/11682/14229)).

All component grids are distributed to process groups (either statically, or
dynamically through the manager rank).
Each process group will be responsible for a set of component grids and the
solver instance belonging to each.
Within a process group and each grid, one can observe the exact same domain
decomposition parallelism as when using the solver directly.
(The domain decomposition needs to be the same for all solver instances.)

During a time step, the PDE solver step applies one or multiple time step updates
to the values in each component grid.
During the PDE solver time step and most of the combination step, MPI communication
only happens within the process groups.
Conversely, for the sparse grid reduction using the combination coefficients $c_{\vec{\ell}}^c$,
MPI communication only happens between a rank and its colleagues in the other
process groups, e.g., rank 0 in group 0 will only talk to rank 0 in all other groups.
Thus, major bottlenecks arising from global communication can be avoided altogether.

In practice, the parallelization in DisCoTec is set at runtime, by calling

```cpp
  size_t ngroup = 8;
  size_t nprocs = 32;
  bool withWorldManager = false;
  combigrid::theMPISystem()->initWorldReusable(MPI_COMM_WORLD, ngroup, nprocs, withWorldManager);
```

for example.
If needed, the exact distribution of coordinates to ranks is set in the
`CombiParameters` class.

Combining the two ways of scaling up, DisCoTec's scalability was demonstrated on
several machines, with the experiments comprising up to 524288 cores:

![timings for 6-d advection solver step on HAWK at various
parallelizations](../gfx/times-solver-on-hawk.svg)
![timings for combination step on
HAWK at various parallelizations](../gfx/times-combination-on-hawk.svg)

We see the timings (in seconds) for the solver and combination steps respectively,
for an illustrative (6-D) PDE problem.
This weak scaling experiment used four OpenMP threads per rank, and starts with
one pg of four processes in the upper left corner.
The largest parallelization is 64 pgs of 2048 processes each.
Figure originally published in (Pollinger [2024](https://elib.uni-stuttgart.de/handle/11682/14229)).

This image also describes the challenges in large-scale experiments with DisCoTec:
Generally, the process group size can be chosen so that it matches the solver
and the desired resolutions well.
But if the process groups become too large, the MPI communication of the multiscale
transform starts to dominate the combination time.
Conversely, the number of process groups can be freely chosen, as long as there
is sufficient work for each group to do (one grid per group at a bare minimum).
If there are too many pgs, the combination reduction will dominate the
combination time.
However, the times required for the PDE solver stay relatively constant;
they are determined by the PDE solver's own scaling and the load balancing quality.

There are only few codes that allow weak scaling up to this problem size:
a size that uses most of the available main memory of the entire system.