[Python] pyarrow.compute.subtract_checked overflowing for some duration arrays constructed from numpy #35088

lukemanley · 2023-04-12T22:02:06Z

Describe the bug, including details regarding any error messages, version, and platform.

In the example below, arr2 and arr3 are duration arrays with a single null element.

arr2 is constructed from a list
arr3 is constructed from a numpy array

Once constructed, they evaluate to being equal.

However, they exhibit different behavior once passed to pyarrow.compute.subtract_checked:

import pyarrow as pa
import pyarrow.compute as pc
import numpy as np

data1 = [86400000000]
data2 = [None]
data3 = np.array([None], dtype="timedelta64[ns]")

arr1 = pa.array(data1, type=pa.duration("ns"))
arr2 = pa.array(data2, type=pa.duration("ns"))
arr3 = pa.array(data3, type=pa.duration("ns"))

assert arr2 == arr3

pc.subtract_checked(arr1, arr2)  # ok
pc.subtract_checked(arr1, arr3)  # ArrowInvalid: overflow

Component(s)

Python

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-04-13T08:02:57Z

@lukemanley thanks for the report. This is an interesting bug .. The difference between both arrays that appear to be the same, is that the actual data buffer is different, because of being created differently (but the data is being masked because they are null, and so the actual value "behind" that null shouldn't matter in theory).
"Viewing" the data buffer as an int64 array to see the values:

In [20]: pa.Array.from_buffers(pa.int64(), 1, [None, arr2.buffers()[1]])
Out[20]: 
<pyarrow.lib.Int64Array object at 0x7f4c1af64820>
[
  0
]

In [21]: pa.Array.from_buffers(pa.int64(), 1, [None, arr3.buffers()[1]])
Out[21]: 
<pyarrow.lib.Int64Array object at 0x7f4bf5998dc0>
[
  -9223372036854775808
]

And so my assumption is that the overflow comes from actually subtracting the values in the second case (86400000000 - (-9223372036854775808) would indeed overflow.

However, the way that the "substract_checked" is implemented, should normally only do the actual substraction for data values that are not being masked as null, exactly to avoid situations like the above. But it seems there is a bug in this mechanism to skip values behind nulls.

lukemanley · 2023-04-15T22:46:19Z

Thanks for the explanation. It looks like numpy uses that value (min int64) for NaT:

In [1]: import numpy as np

In [2]: np.datetime64("NaT").astype(int)
Out[2]: -9223372036854775808

In [3]: np.array([-9223372036854775808], dtype="m8[ns]")
Out[3]: array(['NaT'], dtype='timedelta64[ns]')

westonpace · 2023-04-18T07:53:20Z

It's also tied to duration (e.g. you wouldn't get this behavior if you cast to int64). The fix is westonpace@ec9a5a4 although a proper PR should add tests as well as check the other checked functions (e.g. add_checked, etc.)

It turns out that the "skip nulls" behavior is something that has to be specified per-kernel and it wasn't being specified for the duration kernels. Is this something we need to fit into 12.0.0? If so I can try and carve out some time later this week for a PR.

lukemanley added the Type: bug label Apr 12, 2023

github-actions bot added the Component: Python label Apr 12, 2023

jorisvandenbossche added this to the 13.0.0 milestone Apr 13, 2023

lukemanley mentioned this issue Apr 22, 2023

BUG: pyarrow duration arrays constructed from data containing NaT can overflow pandas-dev/pandas#52843

Merged

4 tasks

raulcd modified the milestones: 13.0.0, 14.0.0 Jul 7, 2023

jorisvandenbossche modified the milestones: 14.0.0, 15.0.0 Oct 10, 2023

raulcd modified the milestones: 15.0.0, 16.0.0 Jan 8, 2024

raulcd modified the milestones: 16.0.0, 17.0.0 Apr 8, 2024

jorisvandenbossche mentioned this issue Jun 25, 2024

[C++] Overflow in subtract_checked(timestamp, timestamp) after casting to pandas and back. #43031

Open

raulcd removed this from the 17.0.0 milestone Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] pyarrow.compute.subtract_checked overflowing for some duration arrays constructed from numpy #35088

[Python] pyarrow.compute.subtract_checked overflowing for some duration arrays constructed from numpy #35088

lukemanley commented Apr 12, 2023

jorisvandenbossche commented Apr 13, 2023

lukemanley commented Apr 15, 2023

westonpace commented Apr 18, 2023 •

edited

Loading

[Python] pyarrow.compute.subtract_checked overflowing for some duration arrays constructed from numpy #35088

[Python] pyarrow.compute.subtract_checked overflowing for some duration arrays constructed from numpy #35088

Comments

lukemanley commented Apr 12, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

jorisvandenbossche commented Apr 13, 2023

lukemanley commented Apr 15, 2023

westonpace commented Apr 18, 2023 • edited Loading

westonpace commented Apr 18, 2023 •

edited

Loading