Add a chapter about threading layer into the docs #2848

Vika-F · 2024-07-16T10:27:11Z

Added a chapter about internal threading primitives into oneDAL documentation
Improved commenting in the header specifying those primitives APIs (threading.h)

Pull the changes from main branch

Vika-F · 2024-07-30T12:22:31Z

@emmwalsh Please review this PR. As it mostly deals with oneDAL docs.

Vika-F · 2024-07-30T12:23:49Z

Produces docs can be found here: https://dev.azure.com/daal/DAAL/_build/results?buildId=36713&view=artifacts&pathAsName=false&type=publishedArtifacts

bdmoore1 · 2024-07-30T15:28:52Z

CONTRIBUTING.md

+
+### Threading Layer
+
+In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form so called [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.


Suggested change

In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form so called [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.

In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form are called the [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.

bdmoore1 · 2024-07-30T15:44:34Z

cpp/daal/src/threading/threading.h

    virtual ~tls()
    {
        d->del(voidLambda);
        delete d;
        _daal_del_tls_ptr(tlsPtr);
    }

+    /// Access a local data of a thread by value
+    ///
+    /// @return When first ionvoced by a thread, a lambda provided to the constructor is


Suggested change

/// @return When first ionvoced by a thread, a lambda provided to the constructor is

/// @return When first invoked by a thread, a lambda provided to the constructor is

bdmoore1 · 2024-07-30T15:49:12Z

docs/source/contribution/threading.rst

+threader_for
+************
+
+Lets consider you need to compute an elementwise sum of two arrays and store the results


Suggested change

Lets consider you need to compute an elementwise sum of two arrays and store the results

Consider a case where you need to compute an elementwise sum of two arrays and store the results

bdmoore1 · 2024-07-30T15:49:34Z

docs/source/contribution/threading.rst

+
+.. include:: ../includes/threading/sum-sequential.rst
+
+There are several options available in oneDAL's threading layer to let the iterations of this code


Suggested change

There are several options available in oneDAL's threading layer to let the iterations of this code

There are several options available in the threading layer of oneDAL to let the iterations of this code

The term "oneDAL's" is not allowed by the namesdb.intel.com. There might be different rules if you are referring to the open source version, but for the most part, the open source name should follow the same rules as the branded name.

That is a new information for me. Thank you!
I had fixed all the mentions in this PR.

bdmoore1 · 2024-07-30T15:52:03Z

docs/source/contribution/threading.rst

+Parallel computations can be performed in two steps:
+
+    1. Compute partial dot product at each threaded.
+    2. Perform a reduction: Sum the partial results from all threads to compute the final dot product.


Instead of "sum the partial results" should it be "add the partial results"?

Probably yes. Not sure as I am not a native speaker. Modified to "add the partial results".

bdmoore1 · 2024-07-30T15:57:51Z

docs/source/contribution/threading.rst

+by each iteration.
+
+In the cases when it is known that the iterations perform equal amount of work it
+might be benefitial to use pre-defined mapping of the loop's iterations to threads.


Suggested change

might be benefitial to use pre-defined mapping of the loop's iterations to threads.

might be benefitial to use predefined mapping of the loop's iterations to threads.

bdmoore1 · 2024-07-30T16:07:24Z

docs/source/contribution/threading.rst

+******************
+
+It is allowed to have nested parallel loops within oneDAL.
+What is important to know is that


Suggested change

What is important to know is that

It is important to know that:

bdmoore1 · 2024-07-30T16:07:42Z

docs/source/contribution/threading.rst

+
+    -- `oneTBB documentation <https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021-13/work-isolation.html>`_
+
+In practice this means that, for example, a thread-local variable might unexpectedly


Suggested change

In practice this means that, for example, a thread-local variable might unexpectedly

In practice, this means that a thread-local variable might unexpectedly

bdmoore1 · 2024-07-30T16:08:17Z

docs/source/contribution/threading.rst

+calls another parallel algorithm, the inner one, within a parallel region.
+
+The inner algorithm in this case can also be called solely, without additional nesting.
+And we do not want to always make it isolated.


Suggested change

And we do not want to always make it isolated.

And we do not always want to make it isolated.

bdmoore1 · 2024-07-30T16:09:58Z

docs/source/contribution/threading.rst

+The inner algorithm in this case can also be called solely, without additional nesting.
+And we do not want to always make it isolated.
+
+For the cases like that oneDAL provides ``daal::ls``. Its ``local()`` method always


Suggested change

For the cases like that oneDAL provides ``daal::ls``. Its ``local()`` method always

For the cases like that, oneDAL provides ``daal::ls``. Its ``local()`` method always

emmwalsh

LGTM!!

ethanglaser · 2024-07-30T16:47:27Z

cpp/daal/src/threading/threading.h

+/// Let ``t`` be the number of threads available to oneDAL.
+///
+/// Then the number of iterations processed by each threads (except maybe the last one)
+/// is copmputed as:


Suggested change

/// is copmputed as:

/// is computed as:

ethanglaser · 2024-07-30T16:47:41Z

cpp/daal/src/threading/threading.h

+/// is copmputed as:
+/// ``nI = (n + t - 1) / t``
+///
+/// Here is how the work in split across the threads:


Suggested change

/// Here is how the work in split across the threads:

/// Here is how the work is split across the threads:

ethanglaser · 2024-07-30T16:51:37Z

docs/source/contribution/threading.rst

+computations on CPU.
+
+But oneTBB is not used in the code of oneDAL algorithms directly. The algorithms rather
+use custom primitives that either wrap oneTBB functionality or are inhome developed.


Suggested change

use custom primitives that either wrap oneTBB functionality or are inhome developed.

use custom primitives that either wrap oneTBB functionality or are developed in-house.

ethanglaser · 2024-07-30T16:52:15Z

docs/source/contribution/threading.rst

+
+The API of the layer is defined in
+`threading.h <https://github.com/oneapi-src/oneDAL/blob/main/cpp/daal/src/threading/threading.h>`_.
+Please be aware that those APIs are not publicly defined. So they can be changed at any time


Suggested change

Please be aware that those APIs are not publicly defined. So they can be changed at any time

Please be aware that those APIs are not publicly defined, so they can be changed at any time

ethanglaser · 2024-07-30T16:55:07Z

docs/source/contribution/threading.rst

+``daal::static_threader_for`` and ``daal::static_tls`` allow to implement static
+work scheduling within oneDAL.


Suggested change

``daal::static_threader_for`` and ``daal::static_tls`` allow to implement static

work scheduling within oneDAL.

``daal::static_threader_for`` and ``daal::static_tls`` allow implementation of

static work scheduling within oneDAL.

ethanglaser · 2024-07-30T16:55:49Z

docs/source/contribution/threading.rst

+Nested parallelism
+******************
+
+It is allowed to have nested parallel loops within oneDAL.


Suggested change

It is allowed to have nested parallel loops within oneDAL.

OneDAL supports nested parallel loops.

Vika-F · 2024-07-31T09:39:18Z

Thank you @bdmoore1 , @ethanglaser and @emmwalsh for thorough reviews.
I've made all the proposed modifications. Hope to merge soon.

napetrov · 2024-07-31T15:17:45Z

@keeranroth @rakshithgb-fujitsu - please check on this one - how clear topic is described and if you would like to get any additional details.

And for the fire - topics that might worth adding beyond threading topic

ethanglaser · 2024-07-31T20:57:57Z

docs/source/contribution/threading.rst

+by each iteration.
+
+In the cases when it is known that the iterations perform an equal amount of work, it
+might be benefitial to use predefined mapping of the loop's iterations to threads.


Suggested change

might be benefitial to use predefined mapping of the loop's iterations to threads.

might be beneficial to use predefined mapping of the loop's iterations to threads.

ethanglaser

Just one more minor spelling adjustment mentioned above. Otherwise this looks great!

keeranroth · 2024-08-01T08:32:42Z

cpp/daal/src/threading/threading.h

+/// @tparam F   Lambda function of type ``[/* captures */](int i) -> void``,
+///             where ``i`` is the loop's iteration index, ``0 <= i < n``.


Just double checking, but I've always used single backtick for monospaced in doxygen. Is the double backtick valid?

Thanks for noticing this. We are using not only doxygen, but also restructured text + some python pre-processing.
Double backtick is valid but it results in italic text.
I'm going to replace it with single backtick to have monospace.

keeranroth · 2024-08-01T08:41:16Z

cpp/daal/src/threading/threading.h

@@ -223,14 +223,33 @@ inline void threader_func_break(int i, bool & needBreak, const void * a)
    lambda(i, needBreak);
 }

+/// Execute the for loop defined by the input parameters in parallel.
+/// The maximal number of iterations in the loop is 2^31 - 1.
+/// The work is scheduled dynamically across threads.


The comment suggests that the work will be executed in parallel, but it may not be the case. It depends on the implementation behind the abstraction. This function passes the parameters to the threading layer, which may execute the work in parallel. And must the work be scheduled dynamically? Can the threading layer decide to just split n equally into however many threads there are (this is the static schedule in OpenMP parlance). Rewording this to something like:

Pass a function to be executed in a for loop to the threading layer. The work is scheduled over threads by the threading layer.

Also, it would be good to mention loop carried dependencies. Is the assumption here that the work in each loop iteration is independent?

The idea of this API is that it uses the default scheduling provided by the underlying implementation.
Of course, the execution might not be parallel for various reasons. For example, if only one thread is available.

It is expected that the iterations of the loop are logically independent. i.e. there are no recurrence among the iterations, but they might access the same data on read or write.
In the latter case, additional synchronization like the use of atomics, mutual execution or thread-local storage is required.

I will reword the description to make it more clear.

keeranroth · 2024-08-01T08:41:43Z

cpp/daal/src/threading/threading.h

+///
+/// @param[in] n        Number of iterations in the for loop.
+/// @param[in] reserved Parameter reserved for the future. Currently unused.
+/// @param[in] lambda   Lambda function that defines iteration's body.


Suggested change

/// @param[in] lambda Lambda function that defines iteration's body.

/// @param[in] lambda Lambda function that defines the loop body.

keeranroth · 2024-08-01T08:44:03Z

cpp/daal/src/threading/threading.h

+/// Execute the for loop defined by the input parameters in parallel.
+/// The maximal number of iterations in the loop is 2^31 - 1.
+/// The work is scheduled dynamically across threads.
+/// The iteration space is chunked using oneTBB ``simple_partitioner``


Is TBB always going to be the threading layer? It is now, but can this be different in the future? If this function or interface is TBB specific, is this documented somewhere? Might also be worth considering changing the name of the functions, e.g. tbb_for_simple

I think it is impossible to say whether TBB will always be a threading layer or not. For now we do not have plans to migrate to other threading technologies like OpenMP, C++ STD threads and so on, but who knows what happens in the future?

I think I need to reword this and remove TBB mentioning from the description.

The idea of this API is that the iterations are not grouped together; each iteration is considered as a separate task.
Because other APIs can group iterations in the way that a consequent block of iterations is executed by one thread. This API tries to avoid this if possible.

keeranroth · 2024-08-01T08:51:05Z

cpp/daal/src/threading/threading.h

@@ -321,10 +397,20 @@ class tls_deleter_ : public tls_deleter
    virtual void del(void * a) { delete static_cast<lambdaType *>(a); }
 };

+/// Thread-local storage (TLS).
+/// Can change its local variable after a nested parallel constructs.
+/// @note Use carefully in case of nested parallel regions.


How common is it to have nested parallelism, and how much do we want users of oneDAL to use it? If at all possible, I recommend not using nested parallelism, as this gets confusing quickly, and requires a lot of care. I would suggest either putting more of a health warning on here, e.g.

@note Thread local storage in nested parallel regions is, in general, not thread local. Nested parallelism should not be used in most cases, and if it is, extra care must be taken with thread local values.

But maybe just leaving out the note is better, as this starts to open a discussion point

I think the confusion here is because those APIs are not intended to be used by oneDAL's users.
The APIs are just not available as part of oneDAL's release header files.
Those are only available to the developers.

Inside oneDAL we use nested parallelism in many cases. So it is important to warn the developers when using a certain constructs might lead to issues.

Will reformulate the note as well.

keeranroth · 2024-08-01T09:19:34Z

docs/source/contribution/threading.rst

+
+.. include:: ../includes/threading/dot-parallel-partial-compute.rst
+
+To compute the final result it is requred to reduce TLS's partial results over all threads


Suggested change

To compute the final result it is requred to reduce TLS's partial results over all threads

To compute the final result it is required to reduce each thread's partial results

keeranroth · 2024-08-01T09:19:47Z

docs/source/contribution/threading.rst

+
+.. include:: ../includes/threading/dot-parallel-reduction.rst
+
+Local memory of the threads should also be released when it is no longer needed.


Suggested change

Local memory of the threads should also be released when it is no longer needed.

Local memory of the threads should be released when it is no longer needed.

keeranroth · 2024-08-01T09:19:58Z

docs/source/contribution/threading.rst

+
+Local memory of the threads should also be released when it is no longer needed.
+
+The complete parallel verision of dot product computations would look like:


Suggested change

The complete parallel verision of dot product computations would look like:

The complete parallel version of dot product computations would look like:

keeranroth · 2024-08-01T09:22:50Z

docs/source/contribution/threading.rst

+This strategy is beneficial when it is difficult to estimate the amount of work performed
+by each iteration.
+
+In the cases when it is known that the iterations perform an equal amount of work, it


In all cases, where you know how much work there is to do in the parallel regions, it will be faster to map the work onto threads using a dynamic schedule. Drop the use of it might be beneficial to use ... in favour of it is more performant to use ...

keeranroth · 2024-08-01T09:29:37Z

docs/source/contribution/threading.rst

+
+    -- `oneTBB documentation <https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021-13/work-isolation.html>`_
+
+In practice, this means that a thread-local variable might unexpectedly


Can you define what you mean by thread-local? I usually think of thread local data to mean that only a single thread can read or write the value to thread-local storage. The thread-local storage class tls is intended to be used for thread local storage, but it cannot do so in the case of nested parallelism. The only thread local storage which is guaranteed to be thread local is in the leaf nodes of a graph of threads.

So something along the lines of:

If we define thread-local data to be data that can only be read and written by a single thread, then any thread which spawns other threads in general cannot have thread-local variables. This is true for variables which are defined as thread-local. This means that great care is required when using thread-local storage constructs in nested parallel regions to avoid issues caused by concurrent data access and ownership (e.g. data corruption due to races; segmentation faults; and deadlocks)

By thread-local data I mean the data that is managed by a particular thread.
The issue here is that there is a difference between the threads and logical tasks (loop's iterations here).
One thread can be assigned to execute several iterations of the loop.

In case of nested loops the specifics of oneTBB is that when the thread executes an outer iteration i of a nested loop it becomes a so called waiting thread while it waits for a completion of the inner parallel loop. And then this waiting thread can be assigned to perform another iteration j of an outer loop.
So, it might first complete the iteration j of the outer loop which will change the state of thread-local values, and then return back to iteration i .

I tried to reflect this behavior in the documentation's text.

In general I do not agree that the thread that spawns other threads cannot have thread-local data. But I agree that thread-local data should be used with great care in those cases.

Vika-F · 2024-08-13T13:14:52Z

@keeranroth, I think I have addressed most of your feedbacks.
Can you please take a look at this PR one more time?

Vika-F added 2 commits July 16, 2024 03:25

Add a chapter about threading layer into the docs

67e40de

Add chapter about threading into table of context

67bd2b7

Vika-F added the docs Issue/PR related to oneDAL docs label Jul 16, 2024

Vika-F added 7 commits July 18, 2024 03:04

Add chapter about threader_for primitive

125d3be

Attempt to fix Sphinx unexpected indentation error

97b804b

Add section about TLS

edae16a

Fix example naming

029ec41

Improve descriptions in threading.h

1221b9c

Add section about nesting

5dae044

Merge pull request #27 from oneapi-src/main

2cfc753

Pull the changes from main branch

Vika-F marked this pull request as ready for review July 30, 2024 12:11

Vika-F requested review from Alexsandruss, samir-nasibli, Alexandr-Solovev, maria-Petrova and bdmoore1 as code owners July 30, 2024 12:11

Vika-F requested review from icfaust, md-shafiul-alam, ethanglaser and olegkkruglov July 30, 2024 12:12

bdmoore1 approved these changes Jul 30, 2024

View reviewed changes

emmwalsh approved these changes Jul 30, 2024

View reviewed changes

ethanglaser reviewed Jul 30, 2024

View reviewed changes

Improve comments in threading.h

2419cde

Vika-F removed the request for review from maria-Petrova July 31, 2024 08:55

Vika-F added 2 commits July 31, 2024 02:25

Apply reviewer's suggestions

8ae4036

Minor change: sum -> add

28fdc89

ethanglaser reviewed Jul 31, 2024

View reviewed changes

ethanglaser approved these changes Jul 31, 2024

View reviewed changes

Vika-F added 2 commits August 1, 2024 00:59

Spelling: benefitial -> beneficial

9278d4c

Fix in parallel reduction example

dc9739a

keeranroth suggested changes Aug 1, 2024

View reviewed changes

Vika-F added 3 commits August 12, 2024 00:51

Replace double backtick with single backtick in comments

54b8e94

Updated comments in threading.h

267e7b6

Apply review comments to threading.rst

74be85a


		### Threading Layer

		In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form so called [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms.

	/// @return When first ionvoced by a thread, a lambda provided to the constructor is
	/// @return When first invoked by a thread, a lambda provided to the constructor is

	Lets consider you need to compute an elementwise sum of two arrays and store the results
	Consider a case where you need to compute an elementwise sum of two arrays and store the results


		.. include:: ../includes/threading/sum-sequential.rst

		There are several options available in oneDAL's threading layer to let the iterations of this code

	There are several options available in oneDAL's threading layer to let the iterations of this code
	There are several options available in the threading layer of oneDAL to let the iterations of this code

	might be benefitial to use pre-defined mapping of the loop's iterations to threads.
	might be benefitial to use predefined mapping of the loop's iterations to threads.

	What is important to know is that
	It is important to know that:


		-- `oneTBB documentation <https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021-13/work-isolation.html>`_

		In practice this means that, for example, a thread-local variable might unexpectedly

	In practice this means that, for example, a thread-local variable might unexpectedly
	In practice, this means that a thread-local variable might unexpectedly

	And we do not want to always make it isolated.
	And we do not always want to make it isolated.

	For the cases like that oneDAL provides ``daal::ls``. Its ``local()`` method always
	For the cases like that, oneDAL provides ``daal::ls``. Its ``local()`` method always

	/// Here is how the work in split across the threads:
	/// Here is how the work is split across the threads:

	use custom primitives that either wrap oneTBB functionality or are inhome developed.
	use custom primitives that either wrap oneTBB functionality or are developed in-house.

	Please be aware that those APIs are not publicly defined. So they can be changed at any time
	Please be aware that those APIs are not publicly defined, so they can be changed at any time

		``daal::static_threader_for`` and ``daal::static_tls`` allow to implement static
		work scheduling within oneDAL.

	It is allowed to have nested parallel loops within oneDAL.
	OneDAL supports nested parallel loops.

	might be benefitial to use predefined mapping of the loop's iterations to threads.
	might be beneficial to use predefined mapping of the loop's iterations to threads.

		/// @tparam F Lambda function of type ``[/* captures */](int i) -> void``,
		/// where ``i`` is the loop's iteration index, ``0 <= i < n``.

	/// @param[in] lambda Lambda function that defines iteration's body.
	/// @param[in] lambda Lambda function that defines the loop body.


		.. include:: ../includes/threading/dot-parallel-partial-compute.rst

		To compute the final result it is requred to reduce TLS's partial results over all threads

Add a chapter about threading layer into the docs #2848

Are you sure you want to change the base?

Add a chapter about threading layer into the docs #2848

Conversation

Vika-F commented Jul 16, 2024 • edited Loading

Vika-F commented Jul 30, 2024

Vika-F commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emmwalsh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vika-F commented Jul 31, 2024

napetrov commented Jul 31, 2024

Choose a reason for hiding this comment

ethanglaser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vika-F commented Aug 13, 2024 • edited Loading

Vika-F commented Jul 16, 2024 •

edited

Loading

Vika-F commented Aug 13, 2024 •

edited

Loading

	To compute the final result it is requred to reduce TLS's partial results over all threads
	To compute the final result it is required to reduce each thread's partial results


		.. include:: ../includes/threading/dot-parallel-reduction.rst

		Local memory of the threads should also be released when it is no longer needed.

	Local memory of the threads should also be released when it is no longer needed.
	Local memory of the threads should be released when it is no longer needed.


		Local memory of the threads should also be released when it is no longer needed.

		The complete parallel verision of dot product computations would look like:

	The complete parallel verision of dot product computations would look like:
	The complete parallel version of dot product computations would look like: