Fix dpctl.tensor.put race condition #1384

PiotrekB416 · 2023-09-01T12:47:25Z

As shown in #1360 dpctl.tensor.put has a race condition if indices has non-unique values
This PR solves this issue by removing unnecessary indices

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
If this PR is a work in progress, are you opening the PR as a draft?

intel-python-devops · 2023-09-01T12:50:13Z

Can one of the admins verify this patch?

oleksandr-pavlyk

I do not think this is the right change. First, indices can be given as a usm_ndarray instance and np.unique won't work. Second, checking for uniqueness is not cheap, and the cost grows with the size of index array(s). I think it is best to leave this checking to the user. He/she can skip it altogether if indices are known to be unique.

PiotrekB416 · 2023-09-01T13:35:50Z

indices can be given as a usm_ndarray instance and np.unique won't work.

The function documentation says they are always usm_ndarray and yes, it does work as np.unique requires array_like type

Also, I used np.unique because there doesn't seem to be a dpctl implementation

ndgrigorian · 2023-09-01T18:44:34Z

I do not think this is the right change. First, indices can be given as a usm_ndarray instance and np.unique won't work. Second, checking for uniqueness is not cheap, and the cost grows with the size of index array(s). I think it is best to leave this checking to the user. He/she can skip it altogether if indices are known to be unique.

Just something to add to this opinion, I've looked around for the consensus on this matter:
Torch's equivalent to np.put warns that repeated indices are undefined behavior.
The array API specification's discussion on integer indexing regarded it as undefined behavior as well, though put is not (yet) in the standard.

There's also no real use for repeating an index value in put that we are trying to enable.

PiotrekB416 · 2023-09-01T19:00:32Z

Made the uniqueness checking optional (opt-in). So now if a user doesn't want to check or have a performance impact they can just not use it

oleksandr-pavlyk · 2023-09-05T19:42:56Z

@PiotrekB416 The reason NumPy functions work on usm_ndarray instances is that NumPy converts usm_ndarray into an ndarray of dtype=object with elements being views into original usm_ndarray:

In [1]: import numpy as np, dpctl.tensor as dpt

In [2]: x = dpt.arange(6)

In [3]: np.asarray(x)
Out[3]:
array([usm_ndarray(0), usm_ndarray(1), usm_ndarray(2), usm_ndarray(3),
       usm_ndarray(4), usm_ndarray(5)], dtype=object)

In [4]: [Out[3][i].usm_data is x.usm_data for i in range(6)]
Out[4]: [True, True, True, True, True, True]

np.unique(x) identifies unique elements by sorting its array. This is evident when using onetrace tool from pti-gpu tool suite:

(dev_dpctl) :~/repos/dpctl$ ~/pti-gpu/tools/onetrace/build/onetrace -t -v --demangle ipython
Device Timeline: start time (CLOCK_MONOTONIC_RAW) [ns] = 20090141451622
Device Timeline: start time (CLOCK_MONOTONIC) [ns] = 20090494043905
Device Timeline: start time (CLOCK_REALTIME) [ns] = 1693942195988714713
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np, dpctl.tensor as dpt
Device Timeline: start time (CLOCK_MONOTONIC_RAW) [ns] = 20097356512283
Device Timeline: start time (CLOCK_MONOTONIC) [ns] = 20097709104316
Device Timeline: start time (CLOCK_REALTIME) [ns] = 1693942203203775129

Device Timeline: start time (CLOCK_MONOTONIC_RAW) [ns] = 20097457540440
Device Timeline: start time (CLOCK_MONOTONIC) [ns] = 20097810132419
Device Timeline: start time (CLOCK_REALTIME) [ns] = 1693942203304803237

Device Timeline: start time (CLOCK_MONOTONIC_RAW) [ns] = 20097577545323
Device Timeline: start time (CLOCK_MONOTONIC) [ns] = 20097930137633
Device Timeline: start time (CLOCK_REALTIME) [ns] = 1693942203424808444

Device Timeline: start time (CLOCK_MONOTONIC_RAW) [ns] = 20097725482356
Device Timeline: start time (CLOCK_MONOTONIC) [ns] = 20098078074736
Device Timeline: start time (CLOCK_REALTIME) [ns] = 1693942203572745550


In [2]: x = dpt.arange(6)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::constructors::linear_sequence_step_kernel<long>[SIMD32 {1; 1; 1} {6; 1; 1}]<1.1> [ns] = 9800286076 (append)
9800286076 (submit) 9801676796 (start) 9802016636 (end)

In [3]: y = np.unique(x)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::less::less_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<2.1> [ns] = 18289684815 (app
end) 18289684815 (submit) 18289976255 (start) 18290311135 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<3.1> [ns] = 18290838760 (append) 18290838760 (submit) 18290857160 (start) 18290866760 (e
nd)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::greater::greater_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<4.1> [ns] = 1836933381
4 (append) 18369333814 (submit) 18369619494 (start) 18369919814 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<5.1> [ns] = 18370199755 (append) 18370199755 (submit) 18370209035 (start) 18370209995 (e
nd)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::less::less_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<6.1> [ns] = 18370487070 (app
end) 18370487070 (submit) 18370513790 (start) 18370517630 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<7.1> [ns] = 18370764853 (append) 18370764853 (submit) 18370772773 (start) 18370773653 (e
nd)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::greater::greater_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<8.1> [ns] = 1837097065
9 (append) 18370970659 (submit) 18370994179 (start) 18370998179 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<9.1> [ns] = 18371388977 (append) 18371388977 (submit) 18371397377 (start) 18371398257 (e
nd)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::less::less_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<10.1> [ns] = 18371594580 (ap
pend) 18371594580 (submit) 18371619300 (start) 18371623460 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<11.1> [ns] = 18371922174 (append) 18371922174 (submit) 18371929854 (start) 18371930734 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::greater::greater_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<12.1> [ns] = 183720964
11 (append) 18372096411 (submit) 18372120411 (start) 18372125531 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<13.1> [ns] = 18372426845 (append) 18372426845 (submit) 18372434445 (start) 18372435165 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::less::less_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<14.1> [ns] = 18372583048 (ap
pend) 18372583048 (submit) 18372607208 (start) 18372612328 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<15.1> [ns] = 18372884016 (append) 18372884016 (submit) 18372891136 (start) 18372892016 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::greater::greater_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<16.1> [ns] = 183730310
00 (append) 18373031000 (submit) 18373053560 (start) 18373058040 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<17.1> [ns] = 18373345251 (append) 18373345251 (submit) 18373352531 (start) 18373353331 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::less::less_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<18.1> [ns] = 18373500261 (ap
pend) 18373500261 (submit) 18373520421 (start) 18373524741 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<19.1> [ns] = 18373782935 (append) 18373782935 (submit) 18373790375 (start) 18373791175 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::greater::greater_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<20.1> [ns] = 183739288
81 (append) 18373928881 (submit) 18373949841 (start) 18373951601 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<21.1> [ns] = 18374233074 (append) 18374233074 (submit) 18374240434 (start) 18374241234 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::not_equal::not_equal_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<22.1> [ns] = 18452
361175 (append) 18452361175 (submit) 18452634375 (start) 18452959975 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<23.1> [ns] = 18453199924 (append) 18453199924 (submit) 18453208004 (start) 18453208884 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::not_equal::not_equal_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<24.1> [ns] = 18453
443636 (append) 18453443636 (submit) 18453468436 (start) 18453472436 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<25.1> [ns] = 18453757684 (append) 18453757684 (submit) 18453765604 (start) 18453766484 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::not_equal::not_equal_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<26.1> [ns] = 18453
929437 (append) 18453929437 (submit) 18453953757 (start) 18453957757 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<27.1> [ns] = 18454246464 (append) 18454246464 (submit) 18454254064 (start) 18454254944 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::not_equal::not_equal_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<28.1> [ns] = 18454
404528 (append) 18454404528 (submit) 18454429168 (start) 18454433168 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<29.1> [ns] = 18454717472 (append) 18454717472 (submit) 18454724592 (start) 18454725392 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::not_equal::not_equal_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<30.1> [ns] = 18454
869688 (append) 18454869688 (submit) 18454889608 (start) 18454894408 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<31.1> [ns] = 18455011524 (append) 18455011524 (submit) 18455017844 (start) 18455018644 (
end)

The pattern here is a flurry of offload of large number of light-weight kernels. The time is totally dominated by kernel submission overhead. So while np.unique(x) works, it is a rather inefficient way of accomplishing what we need.

Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np, dpctl.tensor as dpt

In [2]: x = dpt.arange(10**6, dtype="i4")

In [3]: x0, x1, x2, x3 = x[:10**3], x[:10**4], x[:10**5], x[:10**6]

In [4]: %timeit y = np.unique(x0)
1.18 s ± 249 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit y = np.unique(x1)
13.3 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %time y = np.unique(x2)
CPU times: user 2min 42s, sys: 12.8 s, total: 2min 55s
Wall time: 2min 52s

The projected wall-time for np.unique(x3) is 30-35 minutes.

In a near future dpctl.tensor will acquire SYCL-implementation of unique function which will do the job in many fewer kernel submissions, so it would be more efficient. But even then I would delegate the task of removing duplicates to the user.

oleksandr-pavlyk · 2024-01-24T14:16:16Z

@PiotrekB416 The set functions, such as tensor.unique_indices are now implemented in data-parallel fashion, with #1483, opening a performant way to address #1360 with minimal changes to this PR.

oleksandr-pavlyk · 2024-01-24T14:18:01Z

Please install pre-commit locally to ensure that pre-commit workflow passes. You can do this by running pip install pre-commit and pre-commit install. This would install git hooks to run linters on changes being committed.

oleksandr-pavlyk · 2024-04-23T23:28:31Z

I am going to close this PR, since #1653 has been merged.

Docstring provides sample code based on your idea provide example of resolving duplicates for 1D input.

The approach wont work too easily when higher dimensional array is used and axis keyword is specified.

PiotrekB416 added 2 commits September 1, 2023 14:08

Fix tensor.put possible race condition

a20ef8e

Add check if indices need to be cleaned

b4eb89b

oleksandr-pavlyk reviewed Sep 1, 2023

View reviewed changes

Add option to disable uniqness check

994b703

Add check if unique_only is bool

33db448

oleksandr-pavlyk mentioned this pull request Apr 23, 2024

Addresses gh-1360 with documentation #1653

Merged

6 tasks

oleksandr-pavlyk closed this May 6, 2024

ndgrigorian mentioned this pull request May 6, 2024

dpctl.tensor.put returns incorrect result #1360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dpctl.tensor.put race condition #1384

Fix dpctl.tensor.put race condition #1384

PiotrekB416 commented Sep 1, 2023

intel-python-devops commented Sep 1, 2023

oleksandr-pavlyk left a comment

PiotrekB416 commented Sep 1, 2023

ndgrigorian commented Sep 1, 2023

PiotrekB416 commented Sep 1, 2023

oleksandr-pavlyk commented Sep 5, 2023

oleksandr-pavlyk commented Jan 24, 2024

oleksandr-pavlyk commented Jan 24, 2024

oleksandr-pavlyk commented Apr 23, 2024

Fix dpctl.tensor.put race condition #1384

Fix dpctl.tensor.put race condition #1384

Conversation

PiotrekB416 commented Sep 1, 2023

intel-python-devops commented Sep 1, 2023

oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

PiotrekB416 commented Sep 1, 2023

ndgrigorian commented Sep 1, 2023

PiotrekB416 commented Sep 1, 2023

oleksandr-pavlyk commented Sep 5, 2023

oleksandr-pavlyk commented Jan 24, 2024

oleksandr-pavlyk commented Jan 24, 2024

oleksandr-pavlyk commented Apr 23, 2024