Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dpctl.tensor.put race condition #1384

Conversation

PiotrekB416
Copy link
Contributor

As shown in #1360 dpctl.tensor.put has a race condition if indices has non-unique values
This PR solves this issue by removing unnecessary indices

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you opening the PR as a draft?

@intel-python-devops
Copy link

Can one of the admins verify this patch?

Copy link
Collaborator

@oleksandr-pavlyk oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is the right change. First, indices can be given as a usm_ndarray instance and np.unique won't work. Second, checking for uniqueness is not cheap, and the cost grows with the size of index array(s). I think it is best to leave this checking to the user. He/she can skip it altogether if indices are known to be unique.

@PiotrekB416
Copy link
Contributor Author

indices can be given as a usm_ndarray instance and np.unique won't work.

The function documentation says they are always usm_ndarray and yes, it does work as np.unique requires array_like type

Also, I used np.unique because there doesn't seem to be a dpctl implementation

@ndgrigorian
Copy link
Collaborator

I do not think this is the right change. First, indices can be given as a usm_ndarray instance and np.unique won't work. Second, checking for uniqueness is not cheap, and the cost grows with the size of index array(s). I think it is best to leave this checking to the user. He/she can skip it altogether if indices are known to be unique.

Just something to add to this opinion, I've looked around for the consensus on this matter:
Torch's equivalent to np.put warns that repeated indices are undefined behavior.
The array API specification's discussion on integer indexing regarded it as undefined behavior as well, though put is not (yet) in the standard.

There's also no real use for repeating an index value in put that we are trying to enable.

@PiotrekB416
Copy link
Contributor Author

Made the uniqueness checking optional (opt-in). So now if a user doesn't want to check or have a performance impact they can just not use it

@oleksandr-pavlyk
Copy link
Collaborator

@PiotrekB416 The reason NumPy functions work on usm_ndarray instances is that NumPy converts usm_ndarray into an ndarray of dtype=object with elements being views into original usm_ndarray:

In [1]: import numpy as np, dpctl.tensor as dpt

In [2]: x = dpt.arange(6)

In [3]: np.asarray(x)
Out[3]:
array([usm_ndarray(0), usm_ndarray(1), usm_ndarray(2), usm_ndarray(3),
       usm_ndarray(4), usm_ndarray(5)], dtype=object)

In [4]: [Out[3][i].usm_data is x.usm_data for i in range(6)]
Out[4]: [True, True, True, True, True, True]

np.unique(x) identifies unique elements by sorting its array. This is evident when using onetrace tool from pti-gpu tool suite:

(dev_dpctl) :~/repos/dpctl$ ~/pti-gpu/tools/onetrace/build/onetrace -t -v --demangle ipython
Device Timeline: start time (CLOCK_MONOTONIC_RAW) [ns] = 20090141451622
Device Timeline: start time (CLOCK_MONOTONIC) [ns] = 20090494043905
Device Timeline: start time (CLOCK_REALTIME) [ns] = 1693942195988714713
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np, dpctl.tensor as dpt
Device Timeline: start time (CLOCK_MONOTONIC_RAW) [ns] = 20097356512283
Device Timeline: start time (CLOCK_MONOTONIC) [ns] = 20097709104316
Device Timeline: start time (CLOCK_REALTIME) [ns] = 1693942203203775129

Device Timeline: start time (CLOCK_MONOTONIC_RAW) [ns] = 20097457540440
Device Timeline: start time (CLOCK_MONOTONIC) [ns] = 20097810132419
Device Timeline: start time (CLOCK_REALTIME) [ns] = 1693942203304803237

Device Timeline: start time (CLOCK_MONOTONIC_RAW) [ns] = 20097577545323
Device Timeline: start time (CLOCK_MONOTONIC) [ns] = 20097930137633
Device Timeline: start time (CLOCK_REALTIME) [ns] = 1693942203424808444

Device Timeline: start time (CLOCK_MONOTONIC_RAW) [ns] = 20097725482356
Device Timeline: start time (CLOCK_MONOTONIC) [ns] = 20098078074736
Device Timeline: start time (CLOCK_REALTIME) [ns] = 1693942203572745550


In [2]: x = dpt.arange(6)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::constructors::linear_sequence_step_kernel<long>[SIMD32 {1; 1; 1} {6; 1; 1}]<1.1> [ns] = 9800286076 (append)
9800286076 (submit) 9801676796 (start) 9802016636 (end)

In [3]: y = np.unique(x)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::less::less_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<2.1> [ns] = 18289684815 (app
end) 18289684815 (submit) 18289976255 (start) 18290311135 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<3.1> [ns] = 18290838760 (append) 18290838760 (submit) 18290857160 (start) 18290866760 (e
nd)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::greater::greater_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<4.1> [ns] = 1836933381
4 (append) 18369333814 (submit) 18369619494 (start) 18369919814 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<5.1> [ns] = 18370199755 (append) 18370199755 (submit) 18370209035 (start) 18370209995 (e
nd)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::less::less_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<6.1> [ns] = 18370487070 (app
end) 18370487070 (submit) 18370513790 (start) 18370517630 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<7.1> [ns] = 18370764853 (append) 18370764853 (submit) 18370772773 (start) 18370773653 (e
nd)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::greater::greater_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<8.1> [ns] = 1837097065
9 (append) 18370970659 (submit) 18370994179 (start) 18370998179 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<9.1> [ns] = 18371388977 (append) 18371388977 (submit) 18371397377 (start) 18371398257 (e
nd)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::less::less_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<10.1> [ns] = 18371594580 (ap
pend) 18371594580 (submit) 18371619300 (start) 18371623460 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<11.1> [ns] = 18371922174 (append) 18371922174 (submit) 18371929854 (start) 18371930734 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::greater::greater_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<12.1> [ns] = 183720964
11 (append) 18372096411 (submit) 18372120411 (start) 18372125531 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<13.1> [ns] = 18372426845 (append) 18372426845 (submit) 18372434445 (start) 18372435165 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::less::less_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<14.1> [ns] = 18372583048 (ap
pend) 18372583048 (submit) 18372607208 (start) 18372612328 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<15.1> [ns] = 18372884016 (append) 18372884016 (submit) 18372891136 (start) 18372892016 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::greater::greater_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<16.1> [ns] = 183730310
00 (append) 18373031000 (submit) 18373053560 (start) 18373058040 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<17.1> [ns] = 18373345251 (append) 18373345251 (submit) 18373352531 (start) 18373353331 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::less::less_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<18.1> [ns] = 18373500261 (ap
pend) 18373500261 (submit) 18373520421 (start) 18373524741 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<19.1> [ns] = 18373782935 (append) 18373782935 (submit) 18373790375 (start) 18373791175 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::greater::greater_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<20.1> [ns] = 183739288
81 (append) 18373928881 (submit) 18373949841 (start) 18373951601 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<21.1> [ns] = 18374233074 (append) 18374233074 (submit) 18374240434 (start) 18374241234 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::not_equal::not_equal_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<22.1> [ns] = 18452
361175 (append) 18452361175 (submit) 18452634375 (start) 18452959975 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<23.1> [ns] = 18453199924 (append) 18453199924 (submit) 18453208004 (start) 18453208884 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::not_equal::not_equal_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<24.1> [ns] = 18453
443636 (append) 18453443636 (submit) 18453468436 (start) 18453472436 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<25.1> [ns] = 18453757684 (append) 18453757684 (submit) 18453765604 (start) 18453766484 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::not_equal::not_equal_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<26.1> [ns] = 18453
929437 (append) 18453929437 (submit) 18453953757 (start) 18453957757 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<27.1> [ns] = 18454246464 (append) 18454246464 (submit) 18454254064 (start) 18454254944 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::not_equal::not_equal_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<28.1> [ns] = 18454
404528 (append) 18454404528 (submit) 18454429168 (start) 18454433168 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<29.1> [ns] = 18454717472 (append) 18454717472 (submit) 18454724592 (start) 18454725392 (
end)
Device Timeline (queue: 0x55829e71faa0): dpctl::tensor::kernels::not_equal::not_equal_contig_kernel<long, long, bool, 4u, 2u>[SIMD32 {1; 1; 1} {64; 1; 1}]<30.1> [ns] = 18454
869688 (append) 18454869688 (submit) 18454889608 (start) 18454894408 (end)
Device Timeline (queue: 0x55829d7f4f20): zeCommandListAppendMemoryCopy(D2M)[1 bytes]<31.1> [ns] = 18455011524 (append) 18455011524 (submit) 18455017844 (start) 18455018644 (
end)

The pattern here is a flurry of offload of large number of light-weight kernels. The time is totally dominated by kernel submission overhead. So while np.unique(x) works, it is a rather inefficient way of accomplishing what we need.

Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np, dpctl.tensor as dpt

In [2]: x = dpt.arange(10**6, dtype="i4")

In [3]: x0, x1, x2, x3 = x[:10**3], x[:10**4], x[:10**5], x[:10**6]

In [4]: %timeit y = np.unique(x0)
1.18 s ± 249 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit y = np.unique(x1)
13.3 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %time y = np.unique(x2)
CPU times: user 2min 42s, sys: 12.8 s, total: 2min 55s
Wall time: 2min 52s

The projected wall-time for np.unique(x3) is 30-35 minutes.

In a near future dpctl.tensor will acquire SYCL-implementation of unique function which will do the job in many fewer kernel submissions, so it would be more efficient. But even then I would delegate the task of removing duplicates to the user.

@oleksandr-pavlyk
Copy link
Collaborator

@PiotrekB416 The set functions, such as tensor.unique_indices are now implemented in data-parallel fashion, with #1483, opening a performant way to address #1360 with minimal changes to this PR.

@oleksandr-pavlyk
Copy link
Collaborator

Please install pre-commit locally to ensure that pre-commit workflow passes. You can do this by running pip install pre-commit and pre-commit install. This would install git hooks to run linters on changes being committed.

@oleksandr-pavlyk
Copy link
Collaborator

I am going to close this PR, since #1653 has been merged.

Docstring provides sample code based on your idea provide example of resolving duplicates for 1D input.

The approach wont work too easily when higher dimensional array is used and axis keyword is specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants