Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 27, 2024

Similarly like #54533 added a variant of the pyarrow-backed string dtype to use numpy nullable (NaN) semantics (now named StringDtype(storage="pyarrow", na_value=np.nan) with ArrowStringArrayNumpySemantics), this PR does the same for our object-dtype numpy array backed StringArray: a new StringDtype(storage="python", na_value=np.nan) with the corresponding StringArrayNumpySemantics array.

Illustration for discussion in #57073

xref #54792

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Apr 27, 2024
Comment on lines 808 to 812
# Specifically for StringArrayNumpySemantics, validate here we have a valid array
if isinstance(left.dtype, StringDtype) and left.dtype.storage == "python_numpy":
assert np.all(
[np.isnan(val) for val in left._ndarray[left_na]] # type: ignore[attr-defined]
), "wrong missing value sentinels"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit a custom check (and we don't do anything similarly for other types), but given I initially overlooked a case where we were creating string arrays with the wrong missing value sentinel because the tests don't actually catch that (two arrays with different missing value sentinels still pass as equal in case of EAs), I would prefer keeping this in at least on the short term.

This comment was marked as outdated.

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved
pandas/_testing/asserters.py Outdated Show resolved Hide resolved
left = repr(left)
elif isinstance(left, StringDtype):
left = f"StringDtype(storage={left.storage}, na_value={left.na_value})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the NA variants, do we really even need to expose/check against the storage or can we always just check that we have a StringDtype with a pd.NA missing value marker?

The main thing I want to decouple users from is relying heavily on things like StringDtype(storage="python"), because it makes it really hard to move them away from our internals.

Taking one of the examples we talked about for a possible implementation in 3.0, if we used nanoarrow to back a StringArray without taking on a pyarrow dependency, having a bunch of scattered calls to "python" makes that much harder than it needs to be, without a ton of benefit (?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize the storage= keyword is documented in PDEP-14 so not surprised to see it here; I just generally am hoping to minimize how often it appears to users in non-developmental / internal contexts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the NA variants, do we really even need to expose/check against the storage or can we always just check that we have a StringDtype with a pd.NA missing value marker?

You mean that our asserters (assert_frame_equal et al) would consider a column with string dtype backed by pyarrow vs python as equal? (as long as the na_value is equal)

The main thing I want to decouple users from is relying heavily on things like StringDtype(storage="python"), because it makes it really hard to move them away from our internals.

Yes, I also want to ensure that our general goal is that essentially almost no user should have to be explicit about the storage (that's also one of the reasons that I do not want to include the storage in the string alias for the new-to-be-default string dtype).
But, this is about a developer tool. Currently we regard pd.StringDtype("pyarrow") == pd.StringDtype("python") as False, and so assert_frame_equal will fail for that. And in that case, for the developer UX, the assert error message should be clear.

Right now, you can get something like:

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="col1") are different

Attribute "dtype" are different
[left]:  string[pyarrow]
[right]: string[pyarrow]

which is not very helpful ...

The reason for that is because I did not bake the pd.NA vs np.nan information in the string alias / representation.

See also #59342 for an issue about this that was just opened (we should probably continue the main discussion about an informative repr there, but short term I want to include something like the above as a short-term solution until we decide on the best way forward to address the issues in #59342)

FWIW I also added this code change in #59352 (to fix up failing tests on main) but with a comment explaining the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that our asserters (assert_frame_equal et al) would consider a column with string dtype backed by pyarrow vs python as equal? (as long as the na_value is equal)

Yea exactly. I'm not sure I 100% feel that way and I see both sides to the argument, but wanted to raise it for discussion

Understood and agreed on all of your other points

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the asserters and how to treat variants of the same dtype, that is a topic we certainly have to discuss more in general as well when expanding this topic to other dtypes (PDEP 13).

Personally, I think there is indeed something to say about treating those dtypes as equal, or at least have an option to toggle how "strict" the dtype check is.

)
return type(self)(result)
else:
# This is when the result type is object. We reach this when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this raise an error or not be possible in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some str methods are weird (i.e. what's In the comment here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And not only weird, there are some methods that genuinely return an object dtype (of course because of lack of a better proper dtype, but right not with the default dtype this is object dtype). For example ser.str.split() returns list elements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. The list-returning functions are more good use cases for PDEP-13 #58455

@@ -655,7 +663,11 @@ def test_isin(dtype, fixed_now_ts):
tm.assert_series_equal(result, expected)

result = s.isin(["a", pd.NA])
expected = pd.Series([True, False, True])
if dtype.storage == "python" and dtype.na_value is np.nan:
# TODO what do we want here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the best outcome

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, we should be consistent between the object and pyarrow backed version of the NaN-variant of this dtype (currently it is only StringDtype("python", na_value=np.nan) that deviates from the three others).

But given that with plain object dtype we currently also match pd.NA with None, I would maybe rather keep that behaviour?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK - maybe we should just have this be a branch between np.nan and pd.NA then?

But given that with plain object dtype we currently also match pd.NA with None, I would maybe rather keep that behaviour?

I see the reasoning for it but I am hesitant to keep repeating this behavior. Seems like its really a quirk of how our historical Python object storage has worked, and I don't see that aging well as we move beyond it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more: when doing ser.isin(["a", pd.NA]), the user is providing a list here, i.e. not something with already a well defined dtype (numpy array or pandas array/series).
So what I expect what would happen here is that the list of values is coerced to the calling ser.dtype. And then this should give the same result as ser.isin(pd.array("a", pd.NA], ser.dtype)).

pd.array("a", pd.NA], ser.dtype) will coerce the NA to a missing value for that dtype, so in case of the new string dtype, this is NaN.

(to be clear, what actually happens is clearly not matching what I describe above as what I would expect)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @jorisvandenbossche description of the result (and object sucks...)

pandas/tests/strings/test_find_replace.py Outdated Show resolved Hide resolved
jorisvandenbossche and others added 2 commits August 7, 2024 14:36
Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
@jorisvandenbossche
Copy link
Member Author

This PR is certainly not yet perfect (and further reviews are welcome!), but I am going to merge it: we still have to fix a lot of things in general anyway (all the xfailing tests with the current implementation), and by getting this in now we start addressing those xfails (e.g. #59430), we can fix the tests/implementations for both dtype variants at the same time.

@jorisvandenbossche jorisvandenbossche merged commit 1272cb1 into pandas-dev:main Aug 7, 2024
39 of 45 checks passed
@jorisvandenbossche jorisvandenbossche deleted the string-dtype-object branch August 7, 2024 14:07
na_value=na_value,
dtype=np.dtype(cast(type, dtype)),
)
if na_value_is_na and mask.any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method (which has now been refactored to _str_map_nan_semantics) is slightly different in StringArray vs ArrowStringArray and im trying to sort out whether the differences are intentional or just cosmetic. could use some help from the author

  1. the Arrow version handles this doing the check before map_infer_mask and changing the dtype passed there (also doesn't check for na_value_is_na)

  2. the Arrow version sets na_value = np.nan/False on the analogue to L837/839 (again without a na_value_is_na check)

  3. the Arrow version doesn't have the L831 convert = convert and not np.all(mask); AFAICT no existing tests rely on that line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, my claim in 3 about it not mattering was incorrect. it matters for test_contains_nan and test_empty_str_methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use some help from the author

Although an author who wrote this code almost 4 months ago ;)

Will take a closer look at it later today, but one quick find is that there were changes to the arrow version after I started this PR, so I might not have taken those into account in this version, eg #58483

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ive convinced myself that the arrow version doesnt need the na_value_is_na check bc it is always True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and that 'convert' is never used

WillAyd pushed a commit that referenced this pull request Aug 13, 2024
…umPy semantics (#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 14, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 21, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 27, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
WillAyd added a commit to WillAyd/pandas that referenced this pull request Sep 20, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 3, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 7, 2024
…umPy semantics (pandas-dev#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
jorisvandenbossche pushed a commit that referenced this pull request Oct 9, 2024
…umPy semantics (#58451)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants