Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Overflow in subtract_checked(timestamp, timestamp) after casting to pandas and back. #43031

Open
randolf-scholz opened this issue Jun 24, 2024 · 3 comments

Comments

@randolf-scholz
Copy link

randolf-scholz commented Jun 24, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Cross post from pandas-dev/pandas#59082

from datetime import datetime, timedelta
import pyarrow as pa

timestamps = [None, datetime(2022, 1, 1, 10, 0, 30), datetime(2022, 1, 1, 10, 1, 0)]
x = pa.array(timestamps, type=pa.timestamp("ms"))
pa.compute.subtract_checked(x, x)  # ✅

# convert to pandas and back
y = pa.Array.from_pandas(x.to_pandas())
assert x == y  # ✅
pa.compute.subtract_checked(y, x)  # ✅
pa.compute.subtract_checked(x, y)  # ❌ overflow
  • python 3.11
  • pyarrow 16.1.0
  • pandas 2.2.2
  • numpy 2.0.0

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title Overflow in subtract_checked(timestamp, timestamp) after casting to pandas and back. [C++] Overflow in subtract_checked(timestamp, timestamp) after casting to pandas and back. Jun 25, 2024
@jorisvandenbossche
Copy link
Member

@randolf-scholz thanks for the report.

The root cause is that subtract_checked kernel is not ignoring overflow errors for values masked by the validity bitmap (i.e. nulls). And the roundtrip to/from pandas changed this value. Using nanoarrow to quickly inspect the buffers of x and y:

In [17]: na.array(x).inspect()
<ArrowArray timestamp('ms', '')>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 01100000>
  - data <int64[24 b] 0 1641031230000 1641031260000>
- dictionary: NULL
- children[0]:

In [18]: na.array(y).inspect()
<ArrowArray timestamp('ms', '')>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 01100000>
  - data <int64[24 b] -9223372036854775808 1641031230000 1641031260000>
- dictionary: NULL
- children[0]:

So in the original array, pyarrow uses a default of 0 for the values behind the mask. But on roundtrip to pandas, pandas uses NaT as missing value sentinel, and when then converting back to pyarrow the integer representation of this (i.e. the smallest int64) is kept, as it is masked anyway)

@jorisvandenbossche
Copy link
Member

And so this is a duplicate of #35088

@randolf-scholz
Copy link
Author

randolf-scholz commented Jun 25, 2024

Interesting. Could it possibly be related to performance optimizations of nullable integer arithmetic? With the convention that nulls have value 0, subtraction $z←x-y$ of nullable integers can be implemented branchless as:

z.valid = x.valid & y.valid
z.value = z.valid * (x.value - y.value) 
        = -int(z.valid) & (x.value - y.value)  (2's complement -1 = 0b11111111)

We never have to worry about overflow of invalid values. I don't know if that's what arrow is doing, just a hypothesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants