Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add ExtensionArray.to_numpy to have control over conversion to numpy array #30322

Merged
merged 18 commits into from
Jan 7, 2020

Conversation

jorisvandenbossche
Copy link
Member

xref #30038

Would still need to add this to the other arrays with NA (IntegerArray, StringArray), and need to pass through such option from Series.to_numpy. But already putting this up to check if we are OK with such interface and behaviour.

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Dec 18, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Dec 18, 2019
@jorisvandenbossche
Copy link
Member Author

cc @TomAugspurger

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice overall, thanks.

Whether to ensure that the returned value is a not a view on
the array. Note that ``copy=False`` does not *ensure* that
``to_numpy()`` is no-copy. Rather, ``copy=True`` ensure that
a copy is made, even if not strictly necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give some guidance on when no-copy is possible? Is it only when there are no missing values and we're going to the numpy dtype (bool in this case)?

And thinking forward, can a pyarrow array with no NAs be converted to an ndarray without any copies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is only possible with no NAs and bool dtype. I first thought int8 would also be possible, but numpy doesn't seem to do such conversion without copy.

And thinking forward, can a pyarrow array with no NAs be converted to an ndarray without any copies?

For boolean not (since it is bits, not bytes), but in general (eg for IntegerArray without nulls) yes

else:
raise ValueError(
"cannot convert to bool numpy array in presence of missing values"
)
data = self._data.astype(dtype)
data[self._mask] = na_value
if self.isna().any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this up to before L325? It seems like an isna().any() is going to happen at least once, so should avoid ever doing it twice.

@TomAugspurger
Copy link
Contributor

Just a note: The signature here differs from Series/Index.to_numpy and DataFrame.to_numpy, since this includes an na_value.

Should we add na_value to the other versions with a default of lib._no_default which means "the NA value for that dtype"? I can see that being useful for Series. I'm not sure if it's useful for DataFrame, but perhaps.

@jorisvandenbossche
Copy link
Member Author

Would that already be useful for any of the other implementations / dtypes?

Right now I was planning to pass-through additional keywords to the EA version (if the Series is backed with EA and has a to_numpy method, or something). But of course adding the keyword specifically makes it clearer to document / find it.

@TomAugspurger
Copy link
Contributor

Just passing through makes sense. I think we can actually add a default ExtensionArray.to_numpy() that just calls np.asarray(self, dtype=dtype).

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think merge whenever you're comfortable.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. do you need to add to api.rst?

@TomAugspurger
Copy link
Contributor

Shouldn't need to. I think we autodoc all the methods here.

@jorisvandenbossche
Copy link
Member Author

I added the pass-through from Series

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, but the signature change to kwargs is not great, can you directly specify it?

@@ -780,7 +780,7 @@ def array(self) -> ExtensionArray:

return result

def to_numpy(self, dtype=None, copy=False):
def to_numpy(self, dtype=None, copy=False, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we instead add na_value here?

@@ -780,7 +780,7 @@ def array(self) -> ExtensionArray:

return result

def to_numpy(self, dtype=None, copy=False):
def to_numpy(self, dtype=None, copy=False, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is changing the signature of all EA, but are you adding tests? (e.g. StringArray / IntArray), or as followup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is changing the signature of all EA, but are you adding tests? (e.g. StringArray / IntArray), or as followup?

Note this is not changing the signature of the EA method (this is the Series/Index method)

I think in general it might be useful to pass through kwargs, in that way ExtensionArray authors can have more specific control over the conversion to numpy (if we do this, we should add a test for it for one of the test EAs, like DecimalArray, with an additional keyword in the to_numpy method).

Specifically for na_value, we could maybe add it to the actual signature. Although it is right now not implemented for most of the dtypes (only for boolean EA dtype).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, actually meant that. yeah I would be explict here I think. the question is can you? e.g. you want the default to be libmissing.NA, but that is not the default for anything but BA now? I think it would be better to be explicit here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what makes it more complicated here. It's really a EA-specific keyword, and the general solution for that is passing through keywords (as done here), I think.
Of course, for our own EAs, we can make exceptions and explicitly add them.

Now, since the other dtypes don't yet implement this, the default of libmissing.NA might not necessarily be a problem (this default is not valid for other dtypes, but it's not used for those dtypes anyway). Now, having this keyword with such default might also be confusing, since the NA won't be used for most keywords.

If we add it here explicitly, I would probably use the lib._no_default trick

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add it here explicitly, I would probably use the lib._no_default trick

I think this would be the better long term soln ,yes?

@jreback
Copy link
Contributor

jreback commented Jan 1, 2020

can you rebase & update to comments

@TomAugspurger
Copy link
Contributor

I've pushed a somewhat large change that implements the discussion in #30322 (comment). It adds two main things

  • ExtensionArray.to_numpy to the base ExtensionArray class
  • na_value to Series/Index/DataFrame.to_numpy

A few things to note

  1. ExtensionArray.to_numpy doesn't accept kwargs. I don't think it needs to. Subclasses can accept additional keyword arguments if they choose to (I added one to decimal as an example)
  2. Subclasses will need to import _no_default from pandas._libs. We should make that public.

@TomAugspurger
Copy link
Contributor

The CI failure is unrelated and is being fixed in #30660

@jorisvandenbossche
Copy link
Member Author

Thanks for the updates!

Indeed, the _no_default then needs to be usable by EA authors, that's a bit unfortunate .. Put it in pd.api.types ? (I don't think there is a way to avoid using it if we want to add na_value as a keyword rather than pass it through in the kwargs)

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved
pandas/core/arrays/numpy_.py Show resolved Hide resolved
pandas/core/base.py Show resolved Hide resolved
doc/source/reference/extensions.rst Outdated Show resolved Hide resolved
@@ -2232,7 +2232,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
return objects


_no_default = object()
# Note: _no_default is exported to the public API in pandas.api.extensions
_no_default = object() #: Sentinel indicating the default value.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename this to no_default ? (no leading underscore)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, don't have a strong opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree we should just rename this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we have a number of type we use object as the marker (so should change those in a followup)

@TomAugspurger
Copy link
Contributor

This is passing again.

@TomAugspurger
Copy link
Contributor

Would like to merge this in a few hours, and put up a PR creating a base class for BooleanArray & IntegerArray later today.

@jorisvandenbossche
Copy link
Member Author

and put up a PR creating a base class for BooleanArray & IntegerArray later today.

I already have a branch lying around with an initial take on this, in case you didn't start yet.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I would do the rename of _no_default -> no_default though

@@ -2232,7 +2232,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
return objects


_no_default = object()
# Note: _no_default is exported to the public API in pandas.api.extensions
_no_default = object() #: Sentinel indicating the default value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree we should just rename this

@@ -2232,7 +2232,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
return objects


_no_default = object()
# Note: _no_default is exported to the public API in pandas.api.extensions
_no_default = object() #: Sentinel indicating the default value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we have a number of type we use object as the marker (so should change those in a followup)

@jorisvandenbossche
Copy link
Member Author

There is still a difference in the handling of NA in conversion to float between integer and boolean.

For boolean, I implemented it as strict. So a float cannot hold NA, so if there are missing values, this errors, and you need to explicitly use na_value=np.nan (but it will work without if there are no missing values):

In [8]: a1 = pd.array([True, False, pd.NA]) 

In [9]: a1.to_numpy(float)   # (not the best error message though)
...
TypeError: float() argument must be a string or a number, not 'NAType'

In [10]: a1.to_numpy(float, na_value=np.nan) 
Out[10]: array([ 1.,  0., nan])

While for integer, you implemented this less strict (it's the base EA.to_numpy implementation that is getting this behaviour of __array__):

In [11]: a2 = pd.array([1, 2, pd.NA])  

In [12]: a2.to_numpy(float) 
Out[12]: array([ 1.,  2., nan])

In [13]: a2.to_numpy(float, na_value=np.nan) 
Out[13]: array([ 1.,  2., nan])

In [14]: a2.to_numpy(float, na_value=pd.NA)
...
TypeError: float() argument must be a string or a number, not 'NAType'

It would be nice to align this. This relates to the discussion in the issue #30038

@jorisvandenbossche
Copy link
Member Author

I would do the rename of _no_default -> no_default though

Yes, +1. If we make it public to use for external EA authors, let's make that clear with the name you can use it (and the actual object itself is already in the private _lib, so no need for the name itself to have a _ there).

@jorisvandenbossche
Copy link
Member Author

It would be nice to align this. This relates to the discussion in the issue #30038

Rereading that again, it seems the conclusion was somewhat that we want to raise for this (so no automatic conversion of pd.NA to np.nan)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 7, 2020

OK with doing the _no_default -> no_default in a followup?

@TomAugspurger
Copy link
Contributor

FYI, I have a followup PR doing IntegerArray.to_numpy, which will fix the inconsistencies when converting to an ndarary (changing integer to match the behavior here). I think that needs to be its own PR, with a followup moving them to the base class.

OK to merge this?

@jorisvandenbossche
Copy link
Member Author

I think that needs to be its own PR, with a followup moving them to the base class.

You saw my comment above about this? (I have an initial branch for this, in case you didn't start it yet)

But yes, I think that is good as a follow-up.

@jorisvandenbossche jorisvandenbossche merged commit 2e6f53b into pandas-dev:master Jan 7, 2020
@jorisvandenbossche jorisvandenbossche deleted the NA-to_numpy branch January 7, 2020 16:26
@jorisvandenbossche jorisvandenbossche changed the title ENH: add BooleanArray.to_numpy to have control over conversion to numpy array ENH: add ExtensionArray.to_numpy to have control over conversion to numpy array Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants