Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.compute.subtract_checked overflowing for some duration arrays constructed from numpy #35088

Open
lukemanley opened this issue Apr 12, 2023 · 3 comments

Comments

@lukemanley
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

In the example below, arr2 and arr3 are duration arrays with a single null element.

arr2 is constructed from a list
arr3 is constructed from a numpy array

Once constructed, they evaluate to being equal.

However, they exhibit different behavior once passed to pyarrow.compute.subtract_checked:

import pyarrow as pa
import pyarrow.compute as pc
import numpy as np

data1 = [86400000000]
data2 = [None]
data3 = np.array([None], dtype="timedelta64[ns]")

arr1 = pa.array(data1, type=pa.duration("ns"))
arr2 = pa.array(data2, type=pa.duration("ns"))
arr3 = pa.array(data3, type=pa.duration("ns"))

assert arr2 == arr3

pc.subtract_checked(arr1, arr2)  # ok
pc.subtract_checked(arr1, arr3)  # ArrowInvalid: overflow 

Component(s)

Python

@jorisvandenbossche
Copy link
Member

@lukemanley thanks for the report. This is an interesting bug .. The difference between both arrays that appear to be the same, is that the actual data buffer is different, because of being created differently (but the data is being masked because they are null, and so the actual value "behind" that null shouldn't matter in theory).
"Viewing" the data buffer as an int64 array to see the values:

In [20]: pa.Array.from_buffers(pa.int64(), 1, [None, arr2.buffers()[1]])
Out[20]: 
<pyarrow.lib.Int64Array object at 0x7f4c1af64820>
[
  0
]

In [21]: pa.Array.from_buffers(pa.int64(), 1, [None, arr3.buffers()[1]])
Out[21]: 
<pyarrow.lib.Int64Array object at 0x7f4bf5998dc0>
[
  -9223372036854775808
]

And so my assumption is that the overflow comes from actually subtracting the values in the second case (86400000000 - (-9223372036854775808) would indeed overflow.

However, the way that the "substract_checked" is implemented, should normally only do the actual substraction for data values that are not being masked as null, exactly to avoid situations like the above. But it seems there is a bug in this mechanism to skip values behind nulls.

@jorisvandenbossche jorisvandenbossche added this to the 13.0.0 milestone Apr 13, 2023
@lukemanley
Copy link
Author

Thanks for the explanation. It looks like numpy uses that value (min int64) for NaT:

In [1]: import numpy as np

In [2]: np.datetime64("NaT").astype(int)
Out[2]: -9223372036854775808

In [3]: np.array([-9223372036854775808], dtype="m8[ns]")
Out[3]: array(['NaT'], dtype='timedelta64[ns]')

@westonpace
Copy link
Member

westonpace commented Apr 18, 2023

It's also tied to duration (e.g. you wouldn't get this behavior if you cast to int64). The fix is westonpace@ec9a5a4 although a proper PR should add tests as well as check the other checked functions (e.g. add_checked, etc.)

It turns out that the "skip nulls" behavior is something that has to be specified per-kernel and it wasn't being specified for the duration kernels. Is this something we need to fit into 12.0.0? If so I can try and carve out some time later this week for a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants