ENH: add ExtensionArray.to_numpy to have control over conversion to numpy array #30322

jorisvandenbossche · 2019-12-18T13:45:15Z

Would still need to add this to the other arrays with NA (IntegerArray, StringArray), and need to pass through such option from Series.to_numpy. But already putting this up to check if we are OK with such interface and behaviour.

jorisvandenbossche · 2019-12-18T13:45:27Z

cc @TomAugspurger

TomAugspurger

Looks nice overall, thanks.

TomAugspurger · 2019-12-18T13:55:58Z

pandas/core/arrays/boolean.py

+            Whether to ensure that the returned value is a not a view on
+            the array. Note that ``copy=False`` does not *ensure* that
+            ``to_numpy()`` is no-copy. Rather, ``copy=True`` ensure that
+            a copy is made, even if not strictly necessary.


Can we give some guidance on when no-copy is possible? Is it only when there are no missing values and we're going to the numpy dtype (bool in this case)?

And thinking forward, can a pyarrow array with no NAs be converted to an ndarray without any copies?

Yes, I think it is only possible with no NAs and bool dtype. I first thought int8 would also be possible, but numpy doesn't seem to do such conversion without copy.

And thinking forward, can a pyarrow array with no NAs be converted to an ndarray without any copies?

For boolean not (since it is bits, not bytes), but in general (eg for IntegerArray without nulls) yes

TomAugspurger · 2019-12-18T14:01:28Z

pandas/core/arrays/boolean.py

            else:
                raise ValueError(
                    "cannot convert to bool numpy array in presence of missing values"
                )
-        data = self._data.astype(dtype)
-        data[self._mask] = na_value
+        if self.isna().any():


Can you move this up to before L325? It seems like an isna().any() is going to happen at least once, so should avoid ever doing it twice.

TomAugspurger · 2019-12-18T21:00:47Z

Just a note: The signature here differs from Series/Index.to_numpy and DataFrame.to_numpy, since this includes an na_value.

Should we add na_value to the other versions with a default of lib._no_default which means "the NA value for that dtype"? I can see that being useful for Series. I'm not sure if it's useful for DataFrame, but perhaps.

jorisvandenbossche · 2019-12-18T21:49:21Z

Would that already be useful for any of the other implementations / dtypes?

Right now I was planning to pass-through additional keywords to the EA version (if the Series is backed with EA and has a to_numpy method, or something). But of course adding the keyword specifically makes it clearer to document / find it.

TomAugspurger · 2019-12-20T15:49:22Z

Just passing through makes sense. I think we can actually add a default ExtensionArray.to_numpy() that just calls np.asarray(self, dtype=dtype).

TomAugspurger

I think merge whenever you're comfortable.

jreback

lgtm. do you need to add to api.rst?

TomAugspurger · 2019-12-20T19:26:40Z

Shouldn't need to. I think we autodoc all the methods here.

jorisvandenbossche · 2019-12-22T09:56:00Z

I added the pass-through from Series

jreback

looks good, but the signature change to kwargs is not great, can you directly specify it?

jreback · 2019-12-23T13:08:53Z

pandas/core/base.py

@@ -780,7 +780,7 @@ def array(self) -> ExtensionArray:

        return result

-    def to_numpy(self, dtype=None, copy=False):
+    def to_numpy(self, dtype=None, copy=False, **kwargs):


can we instead add na_value here?

jreback · 2019-12-23T13:10:39Z

pandas/core/base.py

@@ -780,7 +780,7 @@ def array(self) -> ExtensionArray:

        return result

-    def to_numpy(self, dtype=None, copy=False):
+    def to_numpy(self, dtype=None, copy=False, **kwargs):


this is changing the signature of all EA, but are you adding tests? (e.g. StringArray / IntArray), or as followup?

this is changing the signature of all EA, but are you adding tests? (e.g. StringArray / IntArray), or as followup?

Note this is not changing the signature of the EA method (this is the Series/Index method)

I think in general it might be useful to pass through kwargs, in that way ExtensionArray authors can have more specific control over the conversion to numpy (if we do this, we should add a test for it for one of the test EAs, like DecimalArray, with an additional keyword in the to_numpy method).

Specifically for na_value, we could maybe add it to the actual signature. Although it is right now not implemented for most of the dtypes (only for boolean EA dtype).

right, actually meant that. yeah I would be explict here I think. the question is can you? e.g. you want the default to be libmissing.NA, but that is not the default for anything but BA now? I think it would be better to be explicit here.

Yes, that's what makes it more complicated here. It's really a EA-specific keyword, and the general solution for that is passing through keywords (as done here), I think.
Of course, for our own EAs, we can make exceptions and explicitly add them.

Now, since the other dtypes don't yet implement this, the default of libmissing.NA might not necessarily be a problem (this default is not valid for other dtypes, but it's not used for those dtypes anyway). Now, having this keyword with such default might also be confusing, since the NA won't be used for most keywords.

If we add it here explicitly, I would probably use the lib._no_default trick

If we add it here explicitly, I would probably use the lib._no_default trick

I think this would be the better long term soln ,yes?

jreback · 2020-01-01T03:33:30Z

can you rebase & update to comments

* Added ExtensionArray.to_numpy * Added na_value to Series/Index.to_numpy

TomAugspurger · 2020-01-03T19:15:13Z

I've pushed a somewhat large change that implements the discussion in #30322 (comment). It adds two main things

ExtensionArray.to_numpy to the base ExtensionArray class
na_value to Series/Index/DataFrame.to_numpy

A few things to note

ExtensionArray.to_numpy doesn't accept kwargs. I don't think it needs to. Subclasses can accept additional keyword arguments if they choose to (I added one to decimal as an example)
Subclasses will need to import _no_default from pandas._libs. We should make that public.

TomAugspurger · 2020-01-03T21:40:39Z

The CI failure is unrelated and is being fixed in #30660

jorisvandenbossche · 2020-01-06T09:25:37Z

Thanks for the updates!

Indeed, the _no_default then needs to be usable by EA authors, that's a bit unfortunate .. Put it in pd.api.types ? (I don't think there is a way to avoid using it if we want to add na_value as a keyword rather than pass it through in the kwargs)

pandas/core/frame.py

pandas/core/arrays/boolean.py

pandas/core/arrays/numpy_.py

pandas/core/base.py

doc/source/reference/extensions.rst

jorisvandenbossche · 2020-01-06T19:22:08Z

pandas/_libs/lib.pyx

@@ -2232,7 +2232,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
    return objects


-_no_default = object()
+# Note: _no_default is exported to the public API in pandas.api.extensions
+_no_default = object()  #: Sentinel indicating the default value.


Should we rename this to no_default ? (no leading underscore)

I thought about it, don't have a strong opinion.

i agree we should just rename this

note that we have a number of type we use object as the marker (so should change those in a followup)

…andas into NA-to_numpy

TomAugspurger · 2020-01-07T12:16:33Z

This is passing again.

TomAugspurger · 2020-01-07T12:59:32Z

Would like to merge this in a few hours, and put up a PR creating a base class for BooleanArray & IntegerArray later today.

jorisvandenbossche · 2020-01-07T13:02:05Z

and put up a PR creating a base class for BooleanArray & IntegerArray later today.

I already have a branch lying around with an initial take on this, in case you didn't start yet.

jreback

lgtm. I would do the rename of _no_default -> no_default though

jreback · 2020-01-07T13:03:09Z

pandas/_libs/lib.pyx

@@ -2232,7 +2232,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
    return objects


-_no_default = object()
+# Note: _no_default is exported to the public API in pandas.api.extensions
+_no_default = object()  #: Sentinel indicating the default value.


i agree we should just rename this

jreback · 2020-01-07T13:03:44Z

pandas/_libs/lib.pyx

@@ -2232,7 +2232,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
    return objects


-_no_default = object()
+# Note: _no_default is exported to the public API in pandas.api.extensions
+_no_default = object()  #: Sentinel indicating the default value.


note that we have a number of type we use object as the marker (so should change those in a followup)

jorisvandenbossche · 2020-01-07T13:06:56Z

There is still a difference in the handling of NA in conversion to float between integer and boolean.

For boolean, I implemented it as strict. So a float cannot hold NA, so if there are missing values, this errors, and you need to explicitly use na_value=np.nan (but it will work without if there are no missing values):

In [8]: a1 = pd.array([True, False, pd.NA]) 

In [9]: a1.to_numpy(float)   # (not the best error message though)
...
TypeError: float() argument must be a string or a number, not 'NAType'

In [10]: a1.to_numpy(float, na_value=np.nan) 
Out[10]: array([ 1.,  0., nan])

While for integer, you implemented this less strict (it's the base EA.to_numpy implementation that is getting this behaviour of __array__):

In [11]: a2 = pd.array([1, 2, pd.NA])  

In [12]: a2.to_numpy(float) 
Out[12]: array([ 1.,  2., nan])

In [13]: a2.to_numpy(float, na_value=np.nan) 
Out[13]: array([ 1.,  2., nan])

In [14]: a2.to_numpy(float, na_value=pd.NA)
...
TypeError: float() argument must be a string or a number, not 'NAType'

It would be nice to align this. This relates to the discussion in the issue #30038

jorisvandenbossche · 2020-01-07T13:08:05Z

I would do the rename of _no_default -> no_default though

Yes, +1. If we make it public to use for external EA authors, let's make that clear with the name you can use it (and the actual object itself is already in the private _lib, so no need for the name itself to have a _ there).

jorisvandenbossche · 2020-01-07T13:11:24Z

It would be nice to align this. This relates to the discussion in the issue #30038

Rereading that again, it seems the conclusion was somewhat that we want to raise for this (so no automatic conversion of pd.NA to np.nan)

TomAugspurger · 2020-01-07T13:57:50Z

OK with doing the _no_default -> no_default in a followup?

TomAugspurger · 2020-01-07T16:21:02Z

FYI, I have a followup PR doing IntegerArray.to_numpy, which will fix the inconsistencies when converting to an ndarary (changing integer to match the behavior here). I think that needs to be its own PR, with a followup moving them to the base class.

OK to merge this?

jorisvandenbossche · 2020-01-07T16:25:27Z

I think that needs to be its own PR, with a followup moving them to the base class.

You saw my comment above about this? (I have an initial branch for this, in case you didn't start it yet)

But yes, I think that is good as a follow-up.

ENH: add to_numpy to have control over conversion to numpy array

a3dab78

jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Dec 18, 2019

jorisvandenbossche added this to the 1.0 milestone Dec 18, 2019

TomAugspurger approved these changes Dec 18, 2019

View reviewed changes

TomAugspurger approved these changes Dec 20, 2019

View reviewed changes

jreback approved these changes Dec 20, 2019

View reviewed changes

jorisvandenbossche mentioned this pull request Dec 21, 2019

API: Uses pd.NA in IntegerArray #29964

Merged

jorisvandenbossche added 3 commits December 21, 2019 12:16

Merge remote-tracking branch 'upstream/master' into NA-to_numpy

8e669fc

move isna check earlier + simplify

fd1c04a

passthrough keywords in Series.to_numpy

92f14d2

jreback requested changes Dec 23, 2019

View reviewed changes

TomAugspurger added 7 commits January 3, 2020 11:08

Merge remote-tracking branch 'upstream/master' into NA-to_numpy

bf9592c

doc note

278447d

update

1da45bd

Merge remote-tracking branch 'upstream/master' into NA-to_numpy

3889986

updates

6832f42

* Added ExtensionArray.to_numpy * Added na_value to Series/Index.to_numpy

update

e8460ac

dataframe

f1e34c6

TomAugspurger added 2 commits January 3, 2020 13:44

compat

1624712

typing

34307ca

jorisvandenbossche commented Jan 6, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jreback requested changes Jan 6, 2020

View reviewed changes

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved

pandas/core/arrays/numpy_.py Show resolved Hide resolved

pandas/core/base.py Show resolved Hide resolved

TomAugspurger added 3 commits January 6, 2020 11:00

Merge remote-tracking branch 'upstream/master' into NA-to_numpy

2b43793

fixups

afc7350

remove unused test

008b54f

jorisvandenbossche commented Jan 6, 2020

View reviewed changes

jorisvandenbossche and others added 2 commits January 6, 2020 20:31

Update doc/source/reference/extensions.rst

91a4639

Merge branch 'NA-to_numpy' of https://github.com/jorisvandenbossche/p…

fff20a0

…andas into NA-to_numpy

jreback approved these changes Jan 7, 2020

View reviewed changes

jorisvandenbossche merged commit 2e6f53b into pandas-dev:master Jan 7, 2020

jorisvandenbossche deleted the NA-to_numpy branch January 7, 2020 16:26

jorisvandenbossche changed the title ~~ENH: add BooleanArray.to_numpy to have control over conversion to numpy array~~ ENH: add ExtensionArray.to_numpy to have control over conversion to numpy array Jan 7, 2020

jorisvandenbossche mentioned this pull request Jan 7, 2020

Rename api.extensions._no_default to extension.no_default #30785

Closed

ENH: add ExtensionArray.to_numpy to have control over conversion to numpy array #30322

ENH: add ExtensionArray.to_numpy to have control over conversion to numpy array #30322

Conversation

jorisvandenbossche commented Dec 18, 2019

jorisvandenbossche commented Dec 18, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 18, 2019

jorisvandenbossche commented Dec 18, 2019

TomAugspurger commented Dec 20, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 20, 2019

jorisvandenbossche commented Dec 22, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 1, 2020

TomAugspurger commented Jan 3, 2020

TomAugspurger commented Jan 3, 2020

jorisvandenbossche commented Jan 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020 • edited Loading

TomAugspurger commented Jan 7, 2020

jorisvandenbossche commented Jan 7, 2020

TomAugspurger commented Jan 7, 2020 •

edited

Loading