Template Class ProcessManager

Class Documentation

template<typename CombiDataType = double>
class ProcessManager

The ProcessManager class orchestrates the whole simulation in case of a manager-worker scheme in DisCoTec.

It should only be instantiated once&#8212;on the manager rank&#8212;and it holds one ProcessGroupManager instance for communication with each of the process groups.

All other ranks should instantiate a ProcessGroupWorker and call wait() on the worker, until an exit signal is received.

Public Functions

inline ProcessManager(ProcessGroupManagerContainer<CombiDataType> &pgroups, TaskContainer<CombiDataType> &instances, CombiParameters &params, std::unique_ptr<LoadModel> loadModel, std::unique_ptr<TaskRescheduler> rescheduler = std::unique_ptr<TaskRescheduler>(new StaticTaskRescheduler{}))

Constructor.

Parameters:
  • pgroups – a vector of ProcessGroupManager s

  • instances – a vector of Task s

  • params – the combination parameters

  • loadModel – The load model to use for (re)scheduling

  • rescheduler – The rescheduler to use for dynamic task rescheduling. By default, the static task rescheduler is used and therefore no rescheduling perfomed.

inline void removeGroups(std::vector<int> removeIndices)

Removes the process groups with the given indices from the simulation.

Used for fault tolerance

bool runfirst(bool doInitDSGUs = true)

signal to run the first combination step, i.e. initialize and run each task

this is where initial load balancing is done: tasks are sorted by expected runtime and assigned to process groups as they become available again

Parameters:

doInitDSGUs – whether to initialize the DSGUs after the first combination step

Returns:

true if no group failed

void initDsgus()

signal to initialize the sparse grid data structures on the worker ranks

void exit()

signal to exit the simulation

virtual ~ProcessManager() = default
void waitForAllGroupsToWait() const

wait until all groups have signaled completion on the last signal

bool runnext()

signal to run a combination step after the first one

inline void combine()

signal to combine the results of the tasks, according to the CombiParameters

This function performs the so-called recombination. First, the combination solution will be reduced in the given sparse grid space (first within, then across process groups). Also, the local component grids will be updated from the globally combined solution.

inline void combineThirdLevel()

signal to perform a widely-distributed combination

based on TCP/socket setup with third level manager.

Combination with third level parallelism e.g. between two HPC systems: The process manager induces a local and global combination first. Then he signals ready to the third level manager who decides which system sends and receives first. All pgs which do not participate in the third level combination directly idle in a broadcast function and wait for their update from the third level pg.

Different roles of the manager:

Senders role: The processGroupManager transfers the dsgus from the workers of the third level pg to the third level manager, who further forwards them to the remote system. After sending, he receives the remotely reduced data and sends it back to the third level pg.

Receivers role: In the role of the receiver, the ProcessGroupManager receives the remote dsgus and reduces it with the local solution. Afterwards, he sends the solution back to the remote system and to the local workers.

inline void combineThirdLevelFileBasedWrite(const std::string &filenamePrefixToWrite, const std::string &writeCompleteTokenFileName)

signal to start a widely-distributed combination

based on file-exchange mechanism (w/o third level manager)

inline void combineThirdLevelFileBasedReadReduce(const std::string &filenamePrefixToRead, const std::string &startReadingTokenFileName)

signal to reduce the results of a widely-distributed combination

based on file-exchange mechanism (w/o third level manager)

inline void combineThirdLevelFileBased(const std::string &filenamePrefixToWrite, const std::string &writeCompleteTokenFileName, const std::string &filenamePrefixToRead, const std::string &startReadingTokenFileName)

signal to perform a whole widely-distributed combination

based on file-exchange mechanism (w/o third level manager); equivalent to calling combineThirdLevelFileBasedWrite and combineThirdLevelFileBasedReadReduce in succession

inline size_t pretendCombineThirdLevelForBroker(std::vector<long long> numDofsToCommunicate, bool checkValues)

signal to pretend a widely-distributed combination

based on TCP/socket setup with third level manager; for testing the widely-distributed combination between the third level manager and the workers in the third level process group

like combineThirdLevel, but without involving any process groups &#8212; sending dummy data instead

inline void pretendCombineThirdLevelForWorkers()

signal to pretend a widely-distributed combination

based on TCP/socket setup with third level manager; for testing the widely-distributed combination between the workers in and outside the third level process group

inline size_t unifySubspaceSizesThirdLevel(bool thirdLevelExtraSparseGrid)

signal to reduce the subspace sizes between the systems

based on TCP/socket setup with third level manager

Unifies the subspace sizes of all dsgus which are collectively combined during third level reduce:

First, the processGroupManager collects the subspace sizes from all workers’ dsgus. This is achieved in a single MPI_Gatherv call. The sizes of the send buffers are gathered beforehand. Afterwards, the process manager signals ready to the third level manager who then decides which system sends and receives first.

Senders role: The processGroupManager sends and receives data from the third level manager.

Receivers role: The ProcessGroupManager receives and sends data to the third level manager.

In both roles the manager locally reduces the data and scatters the updated sizes back to the workers of the third level pg who will then distribute it to the other pgs.

inline size_t pretendUnifySubspaceSizesThirdLevel()

signal to pretend a reduction of the subspace sizes between the systems

based on TCP/socket setup with third level manager; for testing.

like unifySubspaceSizesThirdLevel, but without sending any data widely. instead, the manager sends only zeros to the third level group, so it will keep its own sparse grid sizes

void monteCarloThirdLevel(size_t numPoints, std::vector<std::vector<real>> &coordinates, std::vector<CombiDataType> &values)

signal to perform a widely-distributed Monte-Carlo interpolation of the current simulation

based on TCP/socket setup with third level manager

inline void combineSystemWide()

signal to perform a system-wide (not widely-distributed) combination

inline void recomputeOptimumCoefficients(std::string prob_name, std::vector<size_t> &faultsID, std::vector<size_t> &redistributefaultsID, std::vector<size_t> &recomputeFaultsID)

recompute coefficients for the combination technique

based on given grid faults using an optimization scheme; used for fault tolerance

inline Task<CombiDataType> *getTask(size_t taskID)

get a pointer to the task with the given ID

void updateCombiParameters()

signal to receive the combination parameters and send new ones

void getGroupFaultIDs(std::vector<size_t> &faultsID, std::vector<ProcessGroupManagerID<CombiDataType>> &groupFaults)

Computes group faults in current combi scheme step.

void parallelEval(const LevelVector &leval, std::string &filename, size_t groupID)

signal one group to interpolate the current solution from the current sparse grid at resolution level leval

writes the solution to a binary file readable with Paraview

void doDiagnostics(size_t taskID)

signal to perform diagnostics on the task with the given ID

can only be used with Tasks that implement the doDiagnostics method

std::map<size_t, double> getLpNorms(int p = 2)

signal to compute the Lp norm of the current component grids, and gather them

Parameters:

p – the p in Lp norm

Returns:

a map from task ID to Lp norm

double getLpNorm(int p = 2)

get the Lp norm of the current combined solution from workers

std::vector<CombiDataType> interpolateValues(const std::vector<std::vector<real>> &interpolationCoords)

signal to interpolate the current solution on all component grids at the given interpolationCoords

requires that the component grids are in nodal representation (not hierarchized). The results are sent to the manager rank and returned here.

void writeInterpolatedValuesSingleFile(const std::vector<std::vector<real>> &interpolationCoords, const std::string &filenamePrefix)

signal to interpolate at the given interpolationCoords and write results to file

like interpolateValues, but the last process group writes the results to a file

void writeInterpolatedValuesPerGrid(const std::vector<std::vector<real>> &interpolationCoords, const std::string &filenamePrefix)

signal to interpolate at the given interpolationCoords at each grid and write results to one file per grid

void writeInterpolationCoordinates(const std::vector<std::vector<real>> &interpolationCoords, const std::string &filenamePrefix) const

write the interpolation coordinates to a file

void writeSparseGridMinMaxCoefficients(const std::string &filename)

signal the last group to write minimum and maximum subspace coefficients to a file

Parameters:

filename – the filename to write to

void redistribute(std::vector<size_t> &taskID)

assign tasks to available process groups

used for fault tolerance: if a process group fails, its tasks are redistributed to other groups

void reInitializeGroup(std::vector<ProcessGroupManagerID<CombiDataType>> &taskID, std::vector<size_t> &tasksToIgnore)

signal to reinitialize the group with the given task IDs

used for fault tolerance

void recompute(std::vector<size_t> &taskID, bool failedRecovery, std::vector<ProcessGroupManagerID<CombiDataType>> &recoveredGroups)

signal to recompute the given task IDs on some recovered groups

used for fault tolerance; the tasks will be re-initialized from the current sparse grid solution

bool recoverCommunicators(std::vector<ProcessGroupManagerID<CombiDataType>> failedGroups)
void restoreCombischeme()
void setupThirdLevel()

establish connection to third level manager

based on TCP/socket setup with third level manager

void reschedule()

perform rescheduling using the given rescheduler and load model.

The rescheduling removes tasks from one process group and assigns them to a different process group. The result of the combination is used to restore values of the newly assigned task. Implications:

  • Should only be called after the combination step and before runnext.

  • Accuracy of calculated values is lost if leval is not equal to 0.

void writeDSGsToDisk(const std::string &filenamePrefix)

signal all groups to write their sparse grid data structures to disk

void readDSGsFromDisk(const std::string &filenamePrefix)

signal all groups to read their sparse grid data structures from disk