Template Class ProcessManager
Defined in File ProcessManager.hpp
Class Documentation
-
template<typename CombiDataType = double>
class ProcessManager The ProcessManager class orchestrates the whole simulation in case of a manager-worker scheme in DisCoTec.
It should only be instantiated once—on the manager rank—and it holds one ProcessGroupManager instance for communication with each of the process groups.
All other ranks should instantiate a ProcessGroupWorker and call wait() on the worker, until an exit signal is received.
Public Functions
-
inline ProcessManager(ProcessGroupManagerContainer<CombiDataType> &pgroups, TaskContainer<CombiDataType> &instances, CombiParameters ¶ms, std::unique_ptr<LoadModel> loadModel, std::unique_ptr<TaskRescheduler> rescheduler = std::unique_ptr<TaskRescheduler>(new StaticTaskRescheduler{}))
Constructor.
- Parameters:
pgroups – a vector of ProcessGroupManager s
instances – a vector of Task s
params – the combination parameters
loadModel – The load model to use for (re)scheduling
rescheduler – The rescheduler to use for dynamic task rescheduling. By default, the static task rescheduler is used and therefore no rescheduling perfomed.
-
inline void removeGroups(std::vector<int> removeIndices)
Removes the process groups with the given indices from the simulation.
Used for fault tolerance
-
bool runfirst(bool doInitDSGUs = true)
signal to run the first combination step, i.e. initialize and run each task
this is where initial load balancing is done: tasks are sorted by expected runtime and assigned to process groups as they become available again
- Parameters:
doInitDSGUs – whether to initialize the DSGUs after the first combination step
- Returns:
true if no group failed
-
void initDsgus()
signal to initialize the sparse grid data structures on the worker ranks
-
void exit()
signal to exit the simulation
-
virtual ~ProcessManager() = default
-
void waitForAllGroupsToWait() const
wait until all groups have signaled completion on the last signal
-
bool runnext()
signal to run a combination step after the first one
-
inline void combine()
signal to combine the results of the tasks, according to the CombiParameters
This function performs the so-called recombination. First, the combination solution will be reduced in the given sparse grid space (first within, then across process groups). Also, the local component grids will be updated from the globally combined solution.
-
inline void combineThirdLevel()
signal to perform a widely-distributed combination
based on TCP/socket setup with third level manager.
Combination with third level parallelism e.g. between two HPC systems: The process manager induces a local and global combination first. Then he signals ready to the third level manager who decides which system sends and receives first. All pgs which do not participate in the third level combination directly idle in a broadcast function and wait for their update from the third level pg.
Different roles of the manager:
Senders role: The processGroupManager transfers the dsgus from the workers of the third level pg to the third level manager, who further forwards them to the remote system. After sending, he receives the remotely reduced data and sends it back to the third level pg.
Receivers role: In the role of the receiver, the ProcessGroupManager receives the remote dsgus and reduces it with the local solution. Afterwards, he sends the solution back to the remote system and to the local workers.
-
inline void combineThirdLevelFileBasedWrite(const std::string &filenamePrefixToWrite, const std::string &writeCompleteTokenFileName)
signal to start a widely-distributed combination
based on file-exchange mechanism (w/o third level manager)
-
inline void combineThirdLevelFileBasedReadReduce(const std::string &filenamePrefixToRead, const std::string &startReadingTokenFileName)
signal to reduce the results of a widely-distributed combination
based on file-exchange mechanism (w/o third level manager)
-
inline void combineThirdLevelFileBased(const std::string &filenamePrefixToWrite, const std::string &writeCompleteTokenFileName, const std::string &filenamePrefixToRead, const std::string &startReadingTokenFileName)
signal to perform a whole widely-distributed combination
based on file-exchange mechanism (w/o third level manager); equivalent to calling combineThirdLevelFileBasedWrite and combineThirdLevelFileBasedReadReduce in succession
-
inline size_t pretendCombineThirdLevelForBroker(std::vector<long long> numDofsToCommunicate, bool checkValues)
signal to pretend a widely-distributed combination
based on TCP/socket setup with third level manager; for testing the widely-distributed combination between the third level manager and the workers in the third level process group
like combineThirdLevel, but without involving any process groups — sending dummy data instead
-
inline void pretendCombineThirdLevelForWorkers()
signal to pretend a widely-distributed combination
based on TCP/socket setup with third level manager; for testing the widely-distributed combination between the workers in and outside the third level process group
-
inline size_t unifySubspaceSizesThirdLevel(bool thirdLevelExtraSparseGrid)
signal to reduce the subspace sizes between the systems
based on TCP/socket setup with third level manager
Unifies the subspace sizes of all dsgus which are collectively combined during third level reduce:
First, the processGroupManager collects the subspace sizes from all workers’ dsgus. This is achieved in a single MPI_Gatherv call. The sizes of the send buffers are gathered beforehand. Afterwards, the process manager signals ready to the third level manager who then decides which system sends and receives first.
Senders role: The processGroupManager sends and receives data from the third level manager.
Receivers role: The ProcessGroupManager receives and sends data to the third level manager.
In both roles the manager locally reduces the data and scatters the updated sizes back to the workers of the third level pg who will then distribute it to the other pgs.
-
inline size_t pretendUnifySubspaceSizesThirdLevel()
signal to pretend a reduction of the subspace sizes between the systems
based on TCP/socket setup with third level manager; for testing.
like unifySubspaceSizesThirdLevel, but without sending any data widely. instead, the manager sends only zeros to the third level group, so it will keep its own sparse grid sizes
-
void monteCarloThirdLevel(size_t numPoints, std::vector<std::vector<real>> &coordinates, std::vector<CombiDataType> &values)
signal to perform a widely-distributed Monte-Carlo interpolation of the current simulation
based on TCP/socket setup with third level manager
-
inline void combineSystemWide()
signal to perform a system-wide (not widely-distributed) combination
-
inline void recomputeOptimumCoefficients(std::string prob_name, std::vector<size_t> &faultsID, std::vector<size_t> &redistributefaultsID, std::vector<size_t> &recomputeFaultsID)
recompute coefficients for the combination technique
based on given grid faults using an optimization scheme; used for fault tolerance
-
inline Task<CombiDataType> *getTask(size_t taskID)
get a pointer to the task with the given ID
-
void updateCombiParameters()
signal to receive the combination parameters and send new ones
-
void getGroupFaultIDs(std::vector<size_t> &faultsID, std::vector<ProcessGroupManagerID<CombiDataType>> &groupFaults)
Computes group faults in current combi scheme step.
-
void parallelEval(const LevelVector &leval, std::string &filename, size_t groupID)
signal one group to interpolate the current solution from the current sparse grid at resolution level
levalwrites the solution to a binary file readable with Paraview
-
void doDiagnostics(size_t taskID)
signal to perform diagnostics on the task with the given ID
can only be used with Tasks that implement the doDiagnostics method
-
std::map<size_t, double> getLpNorms(int p = 2)
signal to compute the Lp norm of the current component grids, and gather them
- Parameters:
p – the p in Lp norm
- Returns:
a map from task ID to Lp norm
-
double getLpNorm(int p = 2)
get the Lp norm of the current combined solution from workers
-
std::vector<CombiDataType> interpolateValues(const std::vector<std::vector<real>> &interpolationCoords)
signal to interpolate the current solution on all component grids at the given
interpolationCoordsrequires that the component grids are in nodal representation (not hierarchized). The results are sent to the manager rank and returned here.
-
void writeInterpolatedValuesSingleFile(const std::vector<std::vector<real>> &interpolationCoords, const std::string &filenamePrefix)
signal to interpolate at the given
interpolationCoordsand write results to filelike interpolateValues, but the last process group writes the results to a file
-
void writeInterpolatedValuesPerGrid(const std::vector<std::vector<real>> &interpolationCoords, const std::string &filenamePrefix)
signal to interpolate at the given
interpolationCoordsat each grid and write results to one file per grid
-
void writeInterpolationCoordinates(const std::vector<std::vector<real>> &interpolationCoords, const std::string &filenamePrefix) const
write the interpolation coordinates to a file
-
void writeSparseGridMinMaxCoefficients(const std::string &filename)
signal the last group to write minimum and maximum subspace coefficients to a file
- Parameters:
filename – the filename to write to
-
void redistribute(std::vector<size_t> &taskID)
assign tasks to available process groups
used for fault tolerance: if a process group fails, its tasks are redistributed to other groups
-
void reInitializeGroup(std::vector<ProcessGroupManagerID<CombiDataType>> &taskID, std::vector<size_t> &tasksToIgnore)
signal to reinitialize the group with the given task IDs
used for fault tolerance
-
void recompute(std::vector<size_t> &taskID, bool failedRecovery, std::vector<ProcessGroupManagerID<CombiDataType>> &recoveredGroups)
signal to recompute the given task IDs on some recovered groups
used for fault tolerance; the tasks will be re-initialized from the current sparse grid solution
-
bool recoverCommunicators(std::vector<ProcessGroupManagerID<CombiDataType>> failedGroups)
-
void restoreCombischeme()
-
void setupThirdLevel()
establish connection to third level manager
based on TCP/socket setup with third level manager
-
void reschedule()
perform rescheduling using the given rescheduler and load model.
The rescheduling removes tasks from one process group and assigns them to a different process group. The result of the combination is used to restore values of the newly assigned task. Implications:
Should only be called after the combination step and before runnext.
Accuracy of calculated values is lost if leval is not equal to 0.
-
void writeDSGsToDisk(const std::string &filenamePrefix)
signal all groups to write their sparse grid data structures to disk
-
void readDSGsFromDisk(const std::string &filenamePrefix)
signal all groups to read their sparse grid data structures from disk
-
inline ProcessManager(ProcessGroupManagerContainer<CombiDataType> &pgroups, TaskContainer<CombiDataType> &instances, CombiParameters ¶ms, std::unique_ptr<LoadModel> loadModel, std::unique_ptr<TaskRescheduler> rescheduler = std::unique_ptr<TaskRescheduler>(new StaticTaskRescheduler{}))