Skip to content

Parallel Utilities

Philipp Bucher edited this page Nov 22, 2021 · 26 revisions

Parallel Utilities

The landscape of parallelization within C++ is currently undergoing a change, as since C++11 the language has acquired a set of standard extensions which aim to rival with OpenMP, namely the thread class. A number of utilities which should simplify the parallelization of loops by using C++11 lambda syntax exist. Some advantages of the functions are the following:

  • direct iteration over containers with random access.
  • allow defining custom reductions, a feature that we were currently prevented from using due to limitations of OpenMP support in MSVC.
  • will (in a future) allow switching to pure-c++ implementation (no openmp)
  • exception handling in parallel regions

Basic, statically scheduled, for loop

the implementation is divided, under the hood, in a partitioner class and a "for" function, both contained in the header "parallel_utilities.h".

See also the tests of the parallel utilities for more details.

To make an example, making a parallel for loop over some nodes to set the value of a variable to a prescribed value would look in the user code as:

block_for_each(model_part.Nodes(), [&](Node<3>& rNode){
    noalias(rNode.FastGetSolutionStepValue(rVariable)) = value;
});

It is worth focusing on the use of the C++ lambda function. In the current code everything is captured by reference (the [&]) thus emulating OpenMP's default shared. The user should refer to the C++11 documentation or to some tutorial to understand the syntax of lambdas. For example (https://www.cprogramming.com/c++11/c++11-lambda-closures.html).

A limitation of using C++11 is that the type of iterator must be specified explicitly in the lambda function. Once we eventually switch to C++14, the former will simplify to:

block_for_each(model_part.Nodes(), [&](auto& rNode){
    noalias(rNode.FastGetSolutionStepValue(rVariable)) = value;
});

This type of loops is good for implementing simple cases, in which no firstprivate or private functions are needed. The point here is that passing a private value to the lambda can be trivially achieved by simply capturing the relevant call by argument. For example we could have made a local copy of value by doing (which would imply capturing by value the variable with the same name)

block_for_each(model_part.Nodes(), [value](auto& it){
    noalias(it.FastGetSolutionStepValue(rVariable)) = value;
});

Note that this usage has performance implications, since lambda capturing happens _once per execution of the lambda_ and not once per thread as was the case for OpenMP.

For emulating private and firstprivate of OpenMP a ThreadLocalStorage (TLS) has to be defined and passed to the function. This could be e.g. a Matrix that acts as a temporary container and can be reused in the same thread. It is important to know that the TLS is copy-constructed to each thread.

// here the TLS (a Matrix) is constructed in place, which is the equivalent of "private" in OpenMP
block_for_each(model_part.Elements(), Matrix(), [](Element& rElement, Matrix& rTLSMatrix){
    rElement.CalculateLocalSystem(rTLSMatrix, ...);
});

More complex TLS structures can be realized with e.g. struct, std::tuple or other data structures that group data

struct my_tls {
    Vector mVec;
    Matrix mMat;
};
// here the TLS (a Matrix) is constructed in place, which is the equivalent of "private" in OpenMP
block_for_each(model_part.Elements(), my_tls (), [](Element& rElement, my_tls& rTLS){
    // access the internals with:
    rElement.GetGeometry().LumpingFactors(rTLS.mVec);
});

If the thread local storage has to be initialized before being passed, then the TLS can be initialized beforehand. This is equivalent to the firstprivate of OpenMP

// initialize thread local storage
Matrix tls_matrix(2,2);
// initialize the tls as desired
// ...

// then pass the fully initialized TLS to the function
block_for_each(model_part.Elements(), tls_matrix, [](Element& rElement, Matrix& rTLSMatrix){
    rElement.CalculateLocalSystem(rTLSMatrix, ...);
});

Simple Reductions

The proposed functions also provide the essential tools to allow reductions (see the reduction_utilities for what is currently available):

The idea is that:

  • reduction type is specified by the template parameter
  • reduction type is specified by the return paramter of the lambda function

let's imagine that we want to find the maximum value of a given input "data_vector". This is achieved by

std::vector<double> data_vector(100000);
double max_value = block_for_each<MaxReduction<double>>(data_vector, [](double& item){
    return item; // note that the value to be reduced should be returned
});

the same interface also allows multiple reductions to be combined at the same time. For example we could compute the max_value and sum_value at once by

// here we define how to combine more than one reductions at once
using MultipleReduction = CombinedReduction< 
                              SumReduction<double>,
                              MaxReduction<double>>; 

double sum_value,max_value;
std::tie(sum_value,max_value) = block_for_each<MultipleReduction>(data_vector, [&](unsigned int i){
    double to_sum = data_vector[i];
    double to_max = data_vector[i];
    return std::make_tuple(to_sum, to_max); // note that these may have different types
});

defining "custom" reductions

The approach described also allows users to eventually define their own custom reductions. This is as simple as defining a "Reducer" class which essnetially provides 3 functions:

  • GetValue - returning the reduced value
  • LocalReduce - reduction local to a thread, is the "serial" implementation of the reduce, without any need of threadsafety
  • ThreadSafeReduce - threadsafe version of the reduction (will be called as few times as possible) to sync the reduced value between the threads

An example of implementation for computing at once the reduction of the max_value and of the max_absolute_value is shown in the following code snippet

class CustomReducer{
    public:
        using value_type = std::tuple<double,double>;
        double max_value = std::numeric_limits<double>::lowest();
        double max_abs = 0.0;

        value_type GetValue()
        {
            value_type values;
            std::get<0>(values) = max_value;
            std::get<1>(values) = max_abs;
            return values;
        }

        void LocalReduce(double function_return_value)
        {
            this->max_value = std::max(this->max_value,function_return_value);
            this->max_abs   = std::max(this->max_abs,std::abs(function_return_value));
        }
        
        void ThreadSafeReduce(CustomReducer& rOther)
        {
            #pragma omp critical
            {
            this->max_value = std::max(this->max_value,rOther.max_value);
            this->max_abs   = std::max(this->max_abs,std::abs(rOther.max_abs));
            }
        }
};

Index Based loops

The last relevant feature is the possibility of providing parallelization "by integer index". A very simple example, in which we elevate to the power 0.01 every entry in a given vector is

std::vector<double> data_vector(100000);
// here we partition the index span of data vector, and then use [i] access
IndexPartition<unsigned int>(data_vector.size()).for_each([&](unsigned int i){
    output[i] = std::pow(data_vector[i], 0.01);
});

This feature is important since it also opens the way to the definition of "parallel" sections (one could take the number of partitions to be equal to the number of threads to be used in total)

Skipping iterations

In regular loops continue skips the execution of the remaining loop body for the current iteration:

for (...) {
    // this is always executed
    if (some_condition) {
        // if "some_condition" is true, then the execution 
        // of the remaining loop body is skipped
        continue;
    }
    // everything from here is only executed if "some_condition" was false
}

The continue keyword cannot be used when using the parallel utilities, as a lambda is passed to the functions (the lambda is called once per loop iteration). In order to skip part of a loop body, the return keyword has to be used:

block_for_each(..., [](i){
    // this is always executed
    if (some_condition) {
        // if "some_condition" is true, then the execution 
        // of the remaining loop body is skipped
        return;
    }
    // everything from here is only executed if "some_condition" was false
});

See also here, the same applies to standard library algorithms such as std::for_each.

Synchronization

Synchronization is necessary when a resource is modified in parallel, e.g. if two threads want to add to a variable at the same time.

Currently atomic operations are provided in the atomic_utilities. In the future also critical sections will be supported

Switch from OpenMP based parallelization to C++ based parallelization

The CMake flag KRATOS_SHARED_MEMORY_PARALLELIZATION can be used to select the mechanism that is used for the shared memory parallelization. The following options are available:

  • OpenMP (currently the default)
  • C++11
  • None : disables shared memory parallelization, this can be useful on systems that don't have OpenMP or when Kratos is only used in MPI on a cluster to avoid issues from interaction between shared and distributed (aka hybrid) parallelization

TODO:

some features that will (well, let's say "may", also depending on requests) be added in a future are

  • Port the BuilderAndSolvers (here the nowait makes it tricky)
  • add interface for dynamic scheduling
  • porting of omp loops to parallel_utilities

Project information

Getting Started

Tutorials

Developers

Kratos structure

Conventions

Solvers

Debugging, profiling and testing

HOW TOs

Utilities

Kratos API

Kratos Structural Mechanics API

Clone this wiki locally