Tutorial: Using Your Code With DisCoTec
So you want to solve your timestepped PDE problem with the DisCoTec framework. You already have a PDE solver that can do it on a regular structured grid. This Tutorial gives you an outline of the steps required.
Your Code Interface: Init, Timestep, Get/Set Full Grid, Finish
The first step is to prepare your code in a way that it can be called by DisCoTec. Typically, your PDE solver code will look similar to this (in pseudocode):
int num_points_x = ...
int num_points_y = ...
int num_points_z = ...
int num_points_vx = ...
int num_points_vy = ...
int num_points_vz = ...
...
initial_function = ...
grid.initialize(
initial_function,
num_points_x,
num_points_y,
num_points_z,
num_points_vx,
num_points_vy,
num_points_vz,
...
)
helper_data_structures.initialize(...)
float end_time = ...
float time_step = ...
float time_now = 0.0;
while(time_now < end_time) { # time loop
do_timestep(grid, time_step)
time_now += time_step
}
properties = compute_properties(grid)
write_output(properties, grid)
This example assumes that your PDE solver uses three dimensions in both space and velocity.
For use with DisCoTec, we assume nestable power-of-two discretizations, i.e.,
where your grid spacing can be \(2^{-l}\) for \(l \in \mathbb{N}\).
For simplicity, in this tutorial we also assume you use periodic boundary
conditions and all num_points_ are chosen as powers of two.
You need to transform your current solver code into stateful functions, or a
stateful data structure.
Let’s say you introduce a class YourSolver: computations before the time
loop go into its constructor
or an init() function (if your code uses MPI, this is also where a
sub-communicator should be passed for YourSolver to use).
Computation in the time loop goes into its run() function.
Anything after the time loop goes into the destructor or a function
finalize().
Then, the following should do the same:
int num_points_x = ...
int num_points_y = ...
int num_points_z = ...
int num_points_vx = ...
int num_points_vy = ...
int num_points_vz = ...
...
- initial_function = ...
- grid.initialize(
- initial_function,
- num_points_x,
- num_points_y,
- num_points_z,
- num_points_vx,
- num_points_vy,
- num_points_vz,
- ...
- )
- helper_data_structures.initialize(...)
+ my_solver = YourSolver(
+ num_points_x,
+ num_points_y,
+ num_points_z,
+ num_points_vx,
+ num_points_vy,
+ num_points_vz,
+ ...
+ )
float end_time = ...
float time_step = ...
float time_now = 0.0;
while(time_now < end_time) { # time loop
- do_timestep(grid, time_step)
+ my_solver.run(time_step)
time_now += time_step
}
- properties = compute_properties(grid)
- write_output(properties, grid)
+ my_solver.finalize()
This setup assumes that we can pass the number of grid points per dimension in
your high-d contiguous array as arguments.
In addition, you will need an interface that allows DisCoTec to get and set
this data.
The most portable and memory-efficient way of doing so is to provide a pointer
to the beginning of the contiguous array.
Let’s call this getter get_tensor_pointer.
Make Your PDE Solver a DisCoTec Task and Your Grid a DisCoTec DistributedFullGrid
From now on, we assume that the interface to YourSolver is wrapped in a
C++ header YourSolver.h, roughly like this:
#include <vector>
class YourSolver {
public:
YourSolver(int num_points_x, int num_points_y, int num_points_z,
int num_points_vx, int num_points_vy, int num_points_vz,
...);
//...rule of 5...
void run(double time_step);
double* get_tensor_pointer();
void finalize();
};
Now, DisCoTec comes into play. Create a new folder or project that can access
both YourSolver and DisCoTec.
For the combination technique, multiple instances of YourSolver will be
instantiated, each at different resolutions.
The Task class is the interface you will need to implement.
With YourSolver, that will be as simple as:
#include <memory>
// discotec includes, you may have to use a longer path here
#include "discotec/fullgrid/DistributedFullGrid.hpp"
#include "discotec/task/Task.hpp"
#include "discotec/utils/PowerOfTwo.hpp"
#include "discotec/utils/Types.hpp"
#include "YourSolver.h"
// The CombiDataType template parameter defaults to double.
// In this tutorial we use a plain double so the example compiles in isolation.
// In your own code, you can instead alias this to combigrid::CombiDataType
// (from your project's Config.hpp or equivalent) once you include that header.
// DIM is a compile-time constant for the dimensionality (1-6).
using CombiDataType = double;
static constexpr combigrid::DimType DIM = 6;
class YourTask : public combigrid::Task<CombiDataType> {
public:
YourTask(combigrid::LevelVector& l,
const std::vector<combigrid::BoundaryType>& boundary,
combigrid::real coeff,
combigrid::real dt)
: Task<CombiDataType>(DIM, l, boundary, coeff, nullptr, nullptr),
sim_(nullptr), dfg_(nullptr), dt_(dt) {}
virtual ~YourTask(){
if (sim_ != nullptr){
sim_->finalize();
}
}
void init(combigrid::CommunicatorType lcomm,
const std::vector<combigrid::IndexVector>& decomposition =
std::vector<combigrid::IndexVector>()) override {
// DisCoTec assumes Fortran (column-major) ordering,
// if you use C (row-major) ordering, you
// need to assign the levels in reverse order
const auto& l = this->getLevelVector();
int num_points_x = combigrid::powerOfTwoByBitshift(l[0]);
int num_points_y = combigrid::powerOfTwoByBitshift(l[1]);
int num_points_z = combigrid::powerOfTwoByBitshift(l[2]);
int num_points_vx = combigrid::powerOfTwoByBitshift(l[3]);
int num_points_vy = combigrid::powerOfTwoByBitshift(l[4]);
int num_points_vz = combigrid::powerOfTwoByBitshift(l[5]);
// this is the number of MPI processes in each dimension --
// if all are 1, we are only using the parallelism between grids
std::vector<int> p = {1, 1, 1, 1, 1, 1};
// if using MPI within your solver,
// pass p and the lcomm communicator to sim_, too
sim_ =
std::make_unique<YourSolver>(num_points_x, num_points_y, num_points_z,
num_points_vx, num_points_vy, num_points_vz);
// wrap tensor in a DistributedFullGrid
dfg_ = std::make_unique<combigrid::DistributedFullGrid<CombiDataType, DIM>>(
DIM, this->getLevelVector(), lcomm, this->getBoundary(),
sim_->get_tensor_pointer(), p, false, decomposition);
}
void run(combigrid::CommunicatorType lcomm) override { sim_->run(dt_); }
combigrid::DistributedFullGridRef<CombiDataType> getDistributedFullGrid(
size_t n = 0) override {
return *dfg_;
}
void setZero() override { dfg_->setZero(); }
std::unique_ptr<YourSolver> sim_;
std::unique_ptr<combigrid::DistributedFullGrid<CombiDataType, DIM>> dfg_;
double dt_;
};
Note that this also turns your data into a DistributedFullGrid, without
making a copy of the data.
DistributedFullGrid<CombiDataType, DIM> takes two template parameters:
the data type (CombiDataType, typically double) and the dimensionality
DIM as a compile-time constant (between 1 and 6).
The compile-time DIM enables fixed-size std::array types internally
for better performance.
Task<CombiDataType> and ProcessGroupWorker<CombiDataType> are also
templated on the data type (defaulting to double).
DisCoTec just assumes that you pass it a pointer to a correctly-sized
contiguous array.
The size of the whole “global” grid is \(2^{l_d}\), with \(l_d\) the level vector
for each dimension \(d\).
Every MPI process in your solver should then have \(2^{l_d} / p_d\) grid
points, where \(p\) is the Cartesian process vector
containing the number of processes per dimension in each process group.
Make Your MPI Processes Workers
Now, you can use the YourTask class to instantiate many tasks as part of a
combination scheme.
There are a lot of choices to make regarding the combined simulation run,
which is why more than half of the example code is dedicated to defining
parameters and generating input data structures:
#include <string>
#include <vector>
// discotec includes, you may have to use a longer path here
#include "discotec/combischeme/CombiMinMaxScheme.hpp"
#include "discotec/manager/CombiParameters.hpp"
#include "discotec/manager/ProcessGroupWorker.hpp"
#include "discotec/task/Task.hpp"
#include "discotec/utils/Stats.hpp"
#include "discotec/utils/Types.hpp"
// include user task
#include "YourTask.h"
int main(int argc, char** argv) {
static_assert(!combigrid::ENABLE_FT); // no fault tolerance in this example
[[maybe_unused]] auto mpiOnOff = combigrid::MpiOnOff(&argc, &argv); // initialize MPI
// number of process groups and number of processes per group
const size_t ngroup = 8;
const size_t nprocs = 1;
// divide the MPI processes into process group and initialize the
// corresponding communicators
combigrid::theMPISystem()->initWorldReusable(MPI_COMM_WORLD, ngroup, nprocs, false, true);
// input parameters
const combigrid::DimType dim = 6; // six-dimensional problem
const combigrid::LevelVector lmin = {
2, 2, 2,
2, 2, 2}; // minimal level vector for each grid -> have at least 4 points in each dimension
const combigrid::LevelVector lmax = {6, 6, 6, 6, 6, 6}; // maximum level vector -> level vector of target grid
const std::vector<int> p = {
1, 1, 1, 1, 1, 1}; // parallelization of domain (one process per dimension) -> must match nprocs
const std::vector<bool> hierarchizationDims(dim, true); // all dimensions should be hierarchized
const std::vector<combigrid::BoundaryType> boundary(dim,
1); // periodic boundary in every dimension
const combigrid::real dt = 0.01; // time step
const size_t ncombi = 20; // number of combinations
// hierarchical basis functions for intermediate representation
const std::string basis = "biorthogonal_periodic";
// algorithm variant for data exchange between groups
const combigrid::CombinationVariant combinationVariant =
combigrid::CombinationVariant::subspaceReduce;
// generate combination scheme from parameters
std::vector<combigrid::LevelVector> levels; // level vector for each component grid
std::vector<combigrid::real> coeffs; // combination coefficient for each component grid
std::vector<size_t> taskNumbers; // only used in case of static task assignment
{
const auto& pgroupNumber = combigrid::theMPISystem()->getProcessGroupNumber();
auto scheme = combigrid::CombiMinMaxScheme(dim, lmin, lmax);
scheme.createClassicalCombischeme();
size_t totalNumTasks = combigrid::getLoadBalancedLevels(
scheme, pgroupNumber, combigrid::theMPISystem()->getNumGroups(), boundary, levels, coeffs,
taskNumbers);
MASTER_EXCLUSIVE_SECTION {
std::cout << combigrid::getTimeStamp() << " Process group " << pgroupNumber << " will run "
<< levels.size() << " of " << totalNumTasks << " tasks." << std::endl;
}
}
These parameter values shold be suitable for a very small-scale proof-of-concept. The last scope assigns tasks (i.e. your PDE solver instances) to process groups statically.
The remaining part of the code looks a lot like your initial time loop again:
{
// ProcessGroupWorker is also templated on CombiDataType (defaults to double)
combigrid::ProcessGroupWorker<> worker;
// create combiparamters
combigrid::CombiParameters params(dim, lmin, lmax, boundary, levels, coeffs,
hierarchizationDims, taskNumbers, ncombi, 1,
combinationVariant);
setCombiParametersHierarchicalBasesUniform(params, basis);
params.setParallelization(p);
worker.setCombiParameters(std::move(params));
// initialize individual tasks (component grids)
for (size_t i = 0; i < levels.size(); i++) {
worker.initializeTask(
std::unique_ptr<combigrid::Task<>>(new YourTask(levels[i], boundary, coeffs[i], dt)));
}
worker.initCombinedDSGVector();
worker.zeroDsgsData(); // initialize sparse grid data structures
for (size_t i = 0; i < ncombi; ++i) {
// run tasks for next time interval
worker.runAllTasks();
worker.combineAtOnce();
}
MIDDLE_PROCESS_EXCLUSIVE_SECTION {
std::cout << combigrid::getTimeStamp() << " Performed " << ncombi << " combinations."
<< std::endl;
}
}
return 0;
}
Instantiating a ProcessGroupWorker at the beginning of the scope sets up some
data structures, in particular a TaskWorker and a SparseGridWorker.
Coming back to the parallelism in DisCoTec:
The TaskWorker deals with the task execution and other within-group
operations (vertical in the graphics), while the SparseGridWorker handles the
operations across groups, such as the reduction in the combination
(horizontal in the graphics).
The ProcessGroupWorker bundles the two and adds abstracts functions required
for further control hierarchies.
If you want to add further functionality to DisCoTec, this would be a typical
extension point.
Otherwise, it’s just where functions are encapsulated for you to use!
Combine
Compile the code using your familiar build tools. MPI and DisCoTec should be the only immediate dependencies.
$ mpiexec -n 8 ./tutorial_example
MPI: 8 groups with 1 ranks each with 1 threads each and 1-fold nested parallelism; without world manager
[20240912T133902] Process group 5 will run 5 of 35 tasks.
[20240912T133902] Process group 7 will run 4 of 35 tasks.
[20240912T133902] Process group 4 will run 5 of 35 tasks.
[20240912T133902] Process group 6 will run 5 of 35 tasks.
[20240912T133902] Process group 0 will run 4 of 35 tasks.
[20240912T133902] Process group 2 will run 4 of 35 tasks.
[20240912T133902] Process group 3 will run 4 of 35 tasks.
[20240912T133902] Process group 1 will run 4 of 35 tasks.
Found max. 7 different communicators for subspaces
[20240912T133902] Performed 20 combinations.
Depending on your use case, now you need to evaluate: are the results “good” (in your own metric)? You can evaluate this, for instance, by Monte-Carlo integration of the error compared to a reference solution (example here), or by combining output files as a postprocessing step (example here).
This is your starting point for playing with the combination technique:
How do the minimum and maximum resolution affect the results?
Can you see a difference when using different hierarchical basis functions?
Can you do multiple time steps per combination? Do you have to combine at all to get the required accuracy?
If you need to go to a larger scale or want to further customize the combination behavior, keep reading on about advanced topics.