Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringArray comparisions return BooleanArray #30231

Merged

Conversation

TomAugspurger
Copy link
Contributor

@@ -59,6 +59,37 @@
rxor,
)

# -----------------------------------------------------------------------------
# constants
ARITHMETIC_BINOPS: Set[str] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel am I reinventing any wheels here? This seems like something useful enough that it would have been done before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think so, no

doc/source/user_guide/text.rst Outdated Show resolved Hide resolved
pandas/core/arrays/string_.py Show resolved Hide resolved
pandas/tests/arrays/string_/test_string.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

Thanks! Fixed all those I think.

Tangentially related: we'll need a docs section discussing the behavior of pd.NA and arrays using pd.NA in comparison operations (propagate rather than unequal). I think it makes sense to combine user_guide/integer_na.rst and user_guide/boolean.rst into a single "nullable types" document which discusses all of this? What do you think?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 12, 2019

I think it makes sense to combine user_guide/integer_na.rst and user_guide/boolean.rst into a single "nullable types" document which discusses all of this? What do you think?

I was floating the idea in my head (but didn't make an issue or so yet) to create a "user_guide/data_types" directory and move the integer and boolean files there (so an extra nesting in the left sidebar). And then the main "data types" page could have a short intro to extension dtypes, pointing to those pages.
But, if the content is relatively limited, it might also make sense to just combine both on a single page for now.

@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 12, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you are really testing this. I think you actually need to take some of our existing indexing tests, then paramterize over the existing object based strings and then StringArray, then test indexing. I suspect a fair amount of things will break.

pandas/core/ops/__init__.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

I don't think you are really testing this. I think you actually need to take some of our existing indexing tests,

This does not yet add indexing, this only changes the comparison operations. Indexing is one of the next steps on the list in #29556

@jreback
Copy link
Contributor

jreback commented Dec 13, 2019

I don't think you are really testing this. I think you actually need to take some of our existing indexing tests,

This does not yet add indexing, this only changes the comparison operations. Indexing is one of the next steps on the list in #29556

ok that's fair.

would at like to at least in this PR put into place an xfail test that assures that if we try to do indexing with a BooleanArray it fails

@TomAugspurger
Copy link
Contributor Author

Indexing seems independent right? BooleanArray comparisons already return BooleanArray, and AFAIK those aren't tested in indexing yet.

I'm happy to write those tests / work on the implementation, but it feels like a separate piece of work.

@jorisvandenbossche
Copy link
Member

Yes, let's keep it separate

@jreback
Copy link
Contributor

jreback commented Dec 13, 2019

Indexing seems independent right? BooleanArray comparisons already return BooleanArray, and AFAIK those aren't tested in indexing yet.

I'm happy to write those tests / work on the implementation, but it feels like a separate piece of work.

they are separate until sometries to assign a BooleanArray to a column and use it. We have nothing for this, certainly do in a separate PR. What I am suggesting is a single test to validate that this raises (for now)

@TomAugspurger
Copy link
Contributor Author

Do we know what the defined behavior is yet? In #28778 it sounds like we're contemplating either raising or propagating.

@@ -94,7 +94,12 @@ l. For ``StringDtype``, :ref:`string accessor methods<api.series.str>`
2. Some string methods, like :meth:`Series.str.decode` are not available
on ``StringArray`` because ``StringArray`` only holds strings, not
bytes.

3. In comparision operations, :class:`arrays.StringArray` and ``Series`` backed
by a ``StringArray`` will return an object with :class:`arrays.BooleanDtype`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
by a ``StringArray`` will return an object with :class:`arrays.BooleanDtype`,
by a ``StringArray`` will return an object with :class:`BooleanDtype`,

3. In comparision operations, :class:`arrays.StringArray` and ``Series`` backed
by a ``StringArray`` will return an object with :class:`arrays.BooleanDtype`,
rather than a ``bool`` dtype object, depending on whether
there are missing values. Missing values in a ``StringArray`` will propagate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we are not always returning a BooleanArray, or am I mis-reading this? this would be very confusing if this is indeed the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is stated confusingly. When the dtype is "string" (not object), comparisons always return BooleanDtype.
Tom, I think you can removed the ", depending on whether there are missing values"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm

@@ -59,6 +59,37 @@
rxor,
)

# -----------------------------------------------------------------------------
# constants
ARITHMETIC_BINOPS: Set[str] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as FYI if you create a container with literal values mypy will infer that it is a Set[str] for you, so not required to specify. Have to specify if the container is empty on instantiation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know!

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Dec 17, 2019
@TomAugspurger
Copy link
Contributor Author

This should be good to go.

@jorisvandenbossche jorisvandenbossche merged commit 0badf06 into pandas-dev:master Dec 18, 2019
@jorisvandenbossche
Copy link
Member

Thanks!

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants