[C++] Overflow in `subtract_checked(timestamp, timestamp)` after casting to pandas and back. #43031

randolf-scholz · 2024-06-24T20:40:14Z

Describe the bug, including details regarding any error messages, version, and platform.

from datetime import datetime, timedelta
import pyarrow as pa

timestamps = [None, datetime(2022, 1, 1, 10, 0, 30), datetime(2022, 1, 1, 10, 1, 0)]
x = pa.array(timestamps, type=pa.timestamp("ms"))
pa.compute.subtract_checked(x, x)  # ✅

# convert to pandas and back
y = pa.Array.from_pandas(x.to_pandas())
assert x == y  # ✅
pa.compute.subtract_checked(y, x)  # ✅
pa.compute.subtract_checked(x, y)  # ❌ overflow

python 3.11
pyarrow 16.1.0
pandas 2.2.2
numpy 2.0.0

Component(s)

Python

jorisvandenbossche · 2024-06-25T13:26:18Z

@randolf-scholz thanks for the report.

The root cause is that subtract_checked kernel is not ignoring overflow errors for values masked by the validity bitmap (i.e. nulls). And the roundtrip to/from pandas changed this value. Using nanoarrow to quickly inspect the buffers of x and y:

In [17]: na.array(x).inspect()
<ArrowArray timestamp('ms', '')>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 01100000>
  - data <int64[24 b] 0 1641031230000 1641031260000>
- dictionary: NULL
- children[0]:

In [18]: na.array(y).inspect()
<ArrowArray timestamp('ms', '')>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 01100000>
  - data <int64[24 b] -9223372036854775808 1641031230000 1641031260000>
- dictionary: NULL
- children[0]:

So in the original array, pyarrow uses a default of 0 for the values behind the mask. But on roundtrip to pandas, pandas uses NaT as missing value sentinel, and when then converting back to pyarrow the integer representation of this (i.e. the smallest int64) is kept, as it is masked anyway)

jorisvandenbossche · 2024-06-25T13:26:30Z

And so this is a duplicate of #35088

randolf-scholz · 2024-06-25T14:58:16Z

Interesting. Could it possibly be related to performance optimizations of nullable integer arithmetic? With the convention that nulls have value 0, subtraction $z←x-y$ of nullable integers can be implemented branchless as:

z.valid = x.valid & y.valid
z.value = z.valid * (x.value - y.value) 
        = -int(z.valid) & (x.value - y.value)  (2's complement -1 = 0b11111111)

We never have to worry about overflow of invalid values. I don't know if that's what arrow is doing, just a hypothesis.

randolf-scholz added the Type: bug label Jun 24, 2024

github-actions bot added the Component: Python label Jun 24, 2024

jorisvandenbossche changed the title ~~Overflow in subtract_checked(timestamp, timestamp) after casting to pandas and back.~~ [C++] Overflow in subtract_checked(timestamp, timestamp) after casting to pandas and back. Jun 25, 2024

jorisvandenbossche added the Component: C++ label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Overflow in `subtract_checked(timestamp, timestamp)` after casting to pandas and back. #43031

[C++] Overflow in `subtract_checked(timestamp, timestamp)` after casting to pandas and back. #43031

randolf-scholz commented Jun 24, 2024 •

edited

Loading

jorisvandenbossche commented Jun 25, 2024

jorisvandenbossche commented Jun 25, 2024

randolf-scholz commented Jun 25, 2024 •

edited

Loading

[C++] Overflow in subtract_checked(timestamp, timestamp) after casting to pandas and back. #43031

[C++] Overflow in subtract_checked(timestamp, timestamp) after casting to pandas and back. #43031

Comments

randolf-scholz commented Jun 24, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

jorisvandenbossche commented Jun 25, 2024

jorisvandenbossche commented Jun 25, 2024

randolf-scholz commented Jun 25, 2024 • edited Loading

[C++] Overflow in `subtract_checked(timestamp, timestamp)` after casting to pandas and back. #43031

[C++] Overflow in `subtract_checked(timestamp, timestamp)` after casting to pandas and back. #43031

randolf-scholz commented Jun 24, 2024 •

edited

Loading

randolf-scholz commented Jun 25, 2024 •

edited

Loading