Parallel Utilities

The landscape of parallelization within C++ is currently undergoing a change, as since C++11 the language is acquiring a set of standard extensions which aim to rival with OpenMP. We propose a number of utilities which should simplify the parallelization of loops by using C++11 lambda syntax, while also allowing the direct use of iterators in loops. The new features also allow defining custom reductions, a feature that we were currently prevented from using due to limitations of OpenMP support in MSVC.

Basic, statically scheduled, for loop

the implementation is divded, under the hood, in a partitioner class and a "for" function, both contained in the header "parallel_utilities.h"

to make an example, making a parallel for loop to set the value of a variable to a prescribed value would look in the user code as:

     block_parallel_for(model_part.Nodes(), [&](Node<3>& rNode){
            noalias(rNode.FastGetSolutionStepValue(rVariable)) = value;
        });

It is worth focusing on the use of the C++ lambda function. In the current code everything is captured by reference (the [&]) thus emulating OpenMP's default shared. The user should refer to the C++11 documentation or to some tutorial to understand the syntax of lambdas. For example (https://www.cprogramming.com/c++11/c++11-lambda-closures.html).

A limitation of using C++11 is that the type of iterator must be specified explicitly in the lambda function. Once we eventually switch to C++14, the former will simplify to:

     block_parallel_for(model_part.Nodes(), [&](auto  rNode){
            noalias(rNode.FastGetSolutionStepValue(rVariable)) = value;
        });

This type of loops is good for implementing simple cases, in which no firstprivate or private functions are needed. The point here is that passing a private value to the lambda can be trivially achieved by simply capturing the relevant call by argument. For example we could have made a local copy of value by doing (which would imply capturing by value the variable with the same name)

     block_parallel_for(model_part.Nodes(), [value](auto it)
        {
            noalias(it->FastGetSolutionStepValue(rVariable)) = value;
        }
     );

Note that this usage has performance implications, since lambda capturing happens _once per execution of the lambda_ and not once per thread as was the case for OpenMP. We plan in the future to give a more general version of this function to allow "per chunk" thread local storage

Simple Reductions

The proposed functions also provide the essential tools to allow reductions:

The idea is that:

reduction type is specified by the template parameter
reduction type is specified by the return paramter of the lambda function

let's imagine that we want to find the maximum value of a given input "data_vector". This is achieved by

    double max_value = block_for_reduce<MaxReduction<double>>(data_vector, [](double& item){
            return item; //NOTE THAT THE RETURN THE VALUE TO BE REDUCED!
        });

the same interface also allows multiple reductions to be combined at the same time. For example we could compute the max_value and sum_value at once by

    //here we define how to combine more than one reductions at once
    typedef CombinedReduction< SumReduction<double>,
                               MaxReduction<double>
            > MultipleReduction; 

    double sum_value,max_value;
    std::tie(sum_value,max_value) =  block_for_reduce<MultipleReduction>(data_vector,  [&](unsigned int i){
                    double to_sum = data_vector[i];
                    double to_max = data_vector[i];
                    return std::make_tuple( to_sum, to_max ); //note that these may have different types
                });
        });

defining "custom" reductions

The approach described also allows users to eventually define their own custom reductions. This is as simple as defining a "Reducer" class which essnetially provides 3 functions:

GetValue - returning the reduced value
LocalReduce - reduction local to a thread, is the "serial" implementation of the reduce, without any need of threadsafety
ThreadSafeReduce - threadsafe version of the reduction (will be called as few times as possible) to sync the reduced value between the threads

An example of implementation for computing at once the reduction of the max_value and of the max_absolute_value is shown in the following code snippet

    class CustomReducer{
        public:
            typedef std::tuple<double,double> value_type;
            double max_value = -std::numeric_limits<double>::max();
            double max_abs = 0.0;

            value_type GetValue()
            {
                value_type values;
                std::get<0>(values) = max_value;
                std::get<1>(values) = max_abs;
                return values;
            }

            void LocalReduce(double function_return_value){
                this->max_value = std::max(this->max_value,function_return_value);
                this->max_abs   = std::max(this->max_abs,std::abs(function_return_value));
            }
            void ThreadSafeReduce(CustomReducer& rOther){
                #pragma omp critical
                {
                this->max_value = std::max(this->max_value,rOther.max_value);
                this->max_abs   = std::max(this->max_abs,std::abs(rOther.max_abs));
                }
            }
    };

Index Based loops

the signature of such more flexible function is

    template<class InputIt, class BinaryFunction>
    inline void block_parallel_for(InputIt first, const int n, const int NChunks, BinaryFunction f)

which is also used in writing reductions. for example to sum a value we get from all the nodes.

    array_1d<double, 3> sum_value = ZeroVector(3);
    
    //here a reduction is to be made. This is achieved by using the block parallel for and defining the Loops
    //note also that in order to be "didactic" all the values are being captured explicitly by reference, except for rVar which is captured by value
    Kratos::Parallel::block_parallel_for(rModelPart.NodesBegin(), 
                                            [&sum_value, rVar, &rModelPart](auto it_begin, auto it_end){
            array_1d<double, 3> tmp = ZeroVector(3);
            for(auto it = it_begin; it != it_end; ++it){
                noalias(tmp) += it->GetValue(rVar);
            }
            
            Kratos::Parallel::AtomicAddVector(tmp, sum_value); //here we do sum_value += tmp
        }
    );

the important point here is that the array of nodes is subdivided in nchunks portions, each of which is processed independently. Each chunk must then process the data on the entire chunk, so that the lambda capturing (and the sync of the reduction variable) only happens nchunks times

The example also shows the use of AtomicAdd function. A number of atomic functions exist performing

AtomicAdd/AtomicSub/AtomicAssign for integers and doubles
AtomicAddVector/AtomicSubVector/AtomicAssignVector for vectors

Crucial difference to OpenMP parallelism

The type of parallelism introduced in C++11 and following standards, makes a fundamentally different assumption wrt OpenMP regarding the total number of processes being run. This is a design decision oriented to allowing in the future the support for GPU-type parallelization within the std lib.

From the user perspective, the visual effect of this is that by design the new c++ standards avoid defining a function equivalent to omp_get_thread_num, which in OpenMP returns the integer Id, of the thread executing the code. The closest equivalent is

      std::thread::id this_id = std::this_thread::get_id();

which as described in (http://en.cppreference.com/w/cpp/thread/get_id) returns _a hash value corresponding to the thread Id_. The practical implication of this is that the allocation of local arrays of size(nthread) is not viable.

Project information

Getting Started

Getting Kratos (Last compiled Release)
- Kratos for Linux
- Kratos for Windows
- Kratos for GiD
Compiling Kratos
- See INSTALL.md

Tutorials

Developers

Kratos structure

Conventions

Local Axes Orientation

Solvers

Debugging, profiling and testing

HOW TOs

Utilities

Kratos API

MultiPoint Constraint (MPC) interface

Kratos Structural Mechanics API

Documentation about the Constitutive Laws

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Utilities

Parallel Utilities

Basic, statically scheduled, for loop

Simple Reductions

defining "custom" reductions

Index Based loops

Crucial difference to OpenMP parallelism

Project information

Getting Started

Tutorials

Developers

Clone this wiki locally