Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a chapter about threading layer into the docs #2848

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

Vika-F
Copy link
Contributor

@Vika-F Vika-F commented Jul 16, 2024

  • Added a chapter about internal threading primitives into oneDAL documentation
  • Improved commenting in the header specifying those primitives APIs (threading.h)

@Vika-F Vika-F added the docs Issue/PR related to oneDAL docs label Jul 16, 2024
@Vika-F
Copy link
Contributor Author

Vika-F commented Jul 30, 2024

@emmwalsh Please review this PR. As it mostly deals with oneDAL docs.

@Vika-F
Copy link
Contributor Author

Vika-F commented Jul 30, 2024

CONTRIBUTING.md Outdated

### Threading Layer

In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form so called [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form so called [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.
In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form are called the [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.

virtual ~tls()
{
d->del(voidLambda);
delete d;
_daal_del_tls_ptr(tlsPtr);
}

/// Access a local data of a thread by value
///
/// @return When first ionvoced by a thread, a lambda provided to the constructor is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// @return When first ionvoced by a thread, a lambda provided to the constructor is
/// @return When first invoked by a thread, a lambda provided to the constructor is

threader_for
************

Lets consider you need to compute an elementwise sum of two arrays and store the results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Lets consider you need to compute an elementwise sum of two arrays and store the results
Consider a case where you need to compute an elementwise sum of two arrays and store the results


.. include:: ../includes/threading/sum-sequential.rst

There are several options available in oneDAL's threading layer to let the iterations of this code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There are several options available in oneDAL's threading layer to let the iterations of this code
There are several options available in the threading layer of oneDAL to let the iterations of this code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "oneDAL's" is not allowed by the namesdb.intel.com. There might be different rules if you are referring to the open source version, but for the most part, the open source name should follow the same rules as the branded name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a new information for me. Thank you!
I had fixed all the mentions in this PR.

Parallel computations can be performed in two steps:

1. Compute partial dot product at each threaded.
2. Perform a reduction: Sum the partial results from all threads to compute the final dot product.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "sum the partial results" should it be "add the partial results"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yes. Not sure as I am not a native speaker. Modified to "add the partial results".

by each iteration.

In the cases when it is known that the iterations perform equal amount of work it
might be benefitial to use pre-defined mapping of the loop's iterations to threads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
might be benefitial to use pre-defined mapping of the loop's iterations to threads.
might be benefitial to use predefined mapping of the loop's iterations to threads.

******************

It is allowed to have nested parallel loops within oneDAL.
What is important to know is that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
What is important to know is that
It is important to know that:


-- `oneTBB documentation <https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021-13/work-isolation.html>`_

In practice this means that, for example, a thread-local variable might unexpectedly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In practice this means that, for example, a thread-local variable might unexpectedly
In practice, this means that a thread-local variable might unexpectedly

calls another parallel algorithm, the inner one, within a parallel region.

The inner algorithm in this case can also be called solely, without additional nesting.
And we do not want to always make it isolated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
And we do not want to always make it isolated.
And we do not always want to make it isolated.

The inner algorithm in this case can also be called solely, without additional nesting.
And we do not want to always make it isolated.

For the cases like that oneDAL provides ``daal::ls``. Its ``local()`` method always
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For the cases like that oneDAL provides ``daal::ls``. Its ``local()`` method always
For the cases like that, oneDAL provides ``daal::ls``. Its ``local()`` method always

Copy link
Contributor

@emmwalsh emmwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

/// Let ``t`` be the number of threads available to oneDAL.
///
/// Then the number of iterations processed by each threads (except maybe the last one)
/// is copmputed as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// is copmputed as:
/// is computed as:

/// is copmputed as:
/// ``nI = (n + t - 1) / t``
///
/// Here is how the work in split across the threads:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Here is how the work in split across the threads:
/// Here is how the work is split across the threads:

computations on CPU.

But oneTBB is not used in the code of oneDAL algorithms directly. The algorithms rather
use custom primitives that either wrap oneTBB functionality or are inhome developed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use custom primitives that either wrap oneTBB functionality or are inhome developed.
use custom primitives that either wrap oneTBB functionality or are developed in-house.


The API of the layer is defined in
`threading.h <https://github.com/oneapi-src/oneDAL/blob/main/cpp/daal/src/threading/threading.h>`_.
Please be aware that those APIs are not publicly defined. So they can be changed at any time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Please be aware that those APIs are not publicly defined. So they can be changed at any time
Please be aware that those APIs are not publicly defined, so they can be changed at any time

Comment on lines 128 to 129
``daal::static_threader_for`` and ``daal::static_tls`` allow to implement static
work scheduling within oneDAL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``daal::static_threader_for`` and ``daal::static_tls`` allow to implement static
work scheduling within oneDAL.
``daal::static_threader_for`` and ``daal::static_tls`` allow implementation of
static work scheduling within oneDAL.

Nested parallelism
******************

It is allowed to have nested parallel loops within oneDAL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is allowed to have nested parallel loops within oneDAL.
OneDAL supports nested parallel loops.

@Vika-F Vika-F removed the request for review from maria-Petrova July 31, 2024 08:55
@Vika-F
Copy link
Contributor Author

Vika-F commented Jul 31, 2024

Thank you @bdmoore1 , @ethanglaser and @emmwalsh for thorough reviews.
I've made all the proposed modifications. Hope to merge soon.

@napetrov
Copy link
Contributor

@keeranroth @rakshithgb-fujitsu - please check on this one - how clear topic is described and if you would like to get any additional details.

And for the fire - topics that might worth adding beyond threading topic

by each iteration.

In the cases when it is known that the iterations perform an equal amount of work, it
might be benefitial to use predefined mapping of the loop's iterations to threads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
might be benefitial to use predefined mapping of the loop's iterations to threads.
might be beneficial to use predefined mapping of the loop's iterations to threads.

Copy link
Contributor

@ethanglaser ethanglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more minor spelling adjustment mentioned above. Otherwise this looks great!

Comment on lines 230 to 231
/// @tparam F Lambda function of type ``[/* captures */](int i) -> void``,
/// where ``i`` is the loop's iteration index, ``0 <= i < n``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, but I've always used single backtick for monospaced in doxygen. Is the double backtick valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this. We are using not only doxygen, but also restructured text + some python pre-processing.
Double backtick is valid but it results in italic text.
I'm going to replace it with single backtick to have monospace.

@@ -223,14 +223,33 @@ inline void threader_func_break(int i, bool & needBreak, const void * a)
lambda(i, needBreak);
}

/// Execute the for loop defined by the input parameters in parallel.
/// The maximal number of iterations in the loop is 2^31 - 1.
/// The work is scheduled dynamically across threads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment suggests that the work will be executed in parallel, but it may not be the case. It depends on the implementation behind the abstraction. This function passes the parameters to the threading layer, which may execute the work in parallel. And must the work be scheduled dynamically? Can the threading layer decide to just split n equally into however many threads there are (this is the static schedule in OpenMP parlance). Rewording this to something like:

Pass a function to be executed in a for loop to the threading layer. The work is scheduled over threads by the threading layer.

Also, it would be good to mention loop carried dependencies. Is the assumption here that the work in each loop iteration is independent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of this API is that it uses the default scheduling provided by the underlying implementation.
Of course, the execution might not be parallel for various reasons. For example, if only one thread is available.

It is expected that the iterations of the loop are logically independent. i.e. there are no recurrence among the iterations, but they might access the same data on read or write.
In the latter case, additional synchronization like the use of atomics, mutual execution or thread-local storage is required.

I will reword the description to make it more clear.

///
/// @param[in] n Number of iterations in the for loop.
/// @param[in] reserved Parameter reserved for the future. Currently unused.
/// @param[in] lambda Lambda function that defines iteration's body.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// @param[in] lambda Lambda function that defines iteration's body.
/// @param[in] lambda Lambda function that defines the loop body.

/// Execute the for loop defined by the input parameters in parallel.
/// The maximal number of iterations in the loop is 2^31 - 1.
/// The work is scheduled dynamically across threads.
/// The iteration space is chunked using oneTBB ``simple_partitioner``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is TBB always going to be the threading layer? It is now, but can this be different in the future? If this function or interface is TBB specific, is this documented somewhere? Might also be worth considering changing the name of the functions, e.g. tbb_for_simple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is impossible to say whether TBB will always be a threading layer or not. For now we do not have plans to migrate to other threading technologies like OpenMP, C++ STD threads and so on, but who knows what happens in the future?

I think I need to reword this and remove TBB mentioning from the description.

The idea of this API is that the iterations are not grouped together; each iteration is considered as a separate task.
Because other APIs can group iterations in the way that a consequent block of iterations is executed by one thread. This API tries to avoid this if possible.

@@ -321,10 +397,20 @@ class tls_deleter_ : public tls_deleter
virtual void del(void * a) { delete static_cast<lambdaType *>(a); }
};

/// Thread-local storage (TLS).
/// Can change its local variable after a nested parallel constructs.
/// @note Use carefully in case of nested parallel regions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How common is it to have nested parallelism, and how much do we want users of oneDAL to use it? If at all possible, I recommend not using nested parallelism, as this gets confusing quickly, and requires a lot of care. I would suggest either putting more of a health warning on here, e.g.

@note Thread local storage in nested parallel regions is, in general, not thread local. Nested parallelism should not be used in most cases, and if it is, extra care must be taken with thread local values.

But maybe just leaving out the note is better, as this starts to open a discussion point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the confusion here is because those APIs are not intended to be used by oneDAL's users.
The APIs are just not available as part of oneDAL's release header files.
Those are only available to the developers.

Inside oneDAL we use nested parallelism in many cases. So it is important to warn the developers when using a certain constructs might lead to issues.

Will reformulate the note as well.


.. include:: ../includes/threading/dot-parallel-partial-compute.rst

To compute the final result it is requred to reduce TLS's partial results over all threads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To compute the final result it is requred to reduce TLS's partial results over all threads
To compute the final result it is required to reduce each thread's partial results


.. include:: ../includes/threading/dot-parallel-reduction.rst

Local memory of the threads should also be released when it is no longer needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Local memory of the threads should also be released when it is no longer needed.
Local memory of the threads should be released when it is no longer needed.


Local memory of the threads should also be released when it is no longer needed.

The complete parallel verision of dot product computations would look like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The complete parallel verision of dot product computations would look like:
The complete parallel version of dot product computations would look like:

This strategy is beneficial when it is difficult to estimate the amount of work performed
by each iteration.

In the cases when it is known that the iterations perform an equal amount of work, it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all cases, where you know how much work there is to do in the parallel regions, it will be faster to map the work onto threads using a dynamic schedule. Drop the use of it might be beneficial to use ... in favour of it is more performant to use ...


-- `oneTBB documentation <https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021-13/work-isolation.html>`_

In practice, this means that a thread-local variable might unexpectedly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define what you mean by thread-local? I usually think of thread local data to mean that only a single thread can read or write the value to thread-local storage. The thread-local storage class tls is intended to be used for thread local storage, but it cannot do so in the case of nested parallelism. The only thread local storage which is guaranteed to be thread local is in the leaf nodes of a graph of threads.

So something along the lines of:

If we define thread-local data to be data that can only be read and written by a single thread, then any thread which spawns other threads in general cannot have thread-local variables. This is true for variables which are defined as thread-local. This means that great care is required when using thread-local storage constructs in nested parallel regions to avoid issues caused by concurrent data access and ownership (e.g. data corruption due to races; segmentation faults; and deadlocks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By thread-local data I mean the data that is managed by a particular thread.
The issue here is that there is a difference between the threads and logical tasks (loop's iterations here).
One thread can be assigned to execute several iterations of the loop.

In case of nested loops the specifics of oneTBB is that when the thread executes an outer iteration i of a nested loop it becomes a so called waiting thread while it waits for a completion of the inner parallel loop. And then this waiting thread can be assigned to perform another iteration j of an outer loop.
So, it might first complete the iteration j of the outer loop which will change the state of thread-local values, and then return back to iteration i .

I tried to reflect this behavior in the documentation's text.

In general I do not agree that the thread that spawns other threads cannot have thread-local data. But I agree that thread-local data should be used with great care in those cases.

@Vika-F
Copy link
Contributor Author

Vika-F commented Aug 13, 2024

@keeranroth, I think I have addressed most of your feedbacks.
Can you please take a look at this PR one more time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Issue/PR related to oneDAL docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants