Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: astype fill_value for SparseArray.astype #23547

Merged
merged 11 commits into from
Nov 12, 2018

Conversation

TomAugspurger
Copy link
Contributor

I don't think we have a specific issue for this. This is not a fix / change for #23125

This fixes strange things like

In [1]: import pandas as pd; import numpy as np

In [2]: a = pd.SparseArray([0, 1])

In [3]: a.astype(bool)
Out[3]:
[0, True]
Fill: 0
IntIndex
Indices: array([1], dtype=int32)

restoring the behavior of 0.23.x

I don't think we have a specific issue for this.

This fixes strange things like

```python
In [1]: import pandas as pd; import numpy as np

In [2]: a = pd.SparseArray([0, 1])

In [3]: a.astype(bool)
Out[3]:
[0, True]
Fill: 0
IntIndex
Indices: array([1], dtype=int32)
```
@TomAugspurger TomAugspurger added the Sparse Sparse Data Type label Nov 7, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 7, 2018
@pep8speaks
Copy link

pep8speaks commented Nov 7, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Comment last updated on November 07, 2018 at 17:12 Hours UTC

@@ -614,7 +614,7 @@ def __array__(self, dtype=None, copy=True):
# Can't put pd.NaT in a datetime64[ns]
fill_value = np.datetime64('NaT')
try:
dtype = np.result_type(self.sp_values.dtype, fill_value)
dtype = np.result_type(self.sp_values.dtype, type(fill_value))
Copy link
Contributor Author

@TomAugspurger TomAugspurger Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was having trouble with string fill values.

dtype).item()
dtype = SparseDtype(dtype, fill_value=fill_value)

# Typically we'll just astype the sp_values to dtype.subtype,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind ugly, but it's backwards compatible, consistent with the rest of pandas, and does what we need.

Basically, unless we want to support actual numpy string dtypes (which we probably don't), then we need a way of differentiating between array.astype(object) and array.astype(str).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a method on the Dtype itself to avoid cluttering this up here? maybe dtype.astype_type
alternatively . we could actually add .astype_nansafe(value, copy=False) as a Dtype method (kind of makes sense actually)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was playing around with that earlier (didn't push it though). I called it SparseDtype.astype. I'll give it another shot and see what it de-duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can you clarify what you had in mind for astype_nansafe(value, copy=False)? What would value here be? An array or a scalar?

I'll have a followup PR soon (hopefully today) for ensuring that the dtype of SparseArray.sp_values is consistent with the type of SparseArray.dtype.fill_value. I think my SparseDtype.astype is more useful there. It wouldn't make sense for using here, since we're astyping the actual array of values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right I think adding a method that returns the dtype of the .astype values on the Dtype iself is what I am looking. The conversion still happens in the Array. Basically the code you added here should be on the Dtype object.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Nov 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to add two methods

  1. SparseDtype.astype: convert from a SparseDtype to a new dtype, taking care to astype self.fill_value if needed.
  2. SparseDtype._subtype_with_str to hold the logic for determining what the "real" subtype is, if we actually want str.

SparseDtype.astype seems reasonably useful to users, so I made it public.

@codecov
Copy link

codecov bot commented Nov 7, 2018

Codecov Report

Merging #23547 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23547      +/-   ##
==========================================
+ Coverage   92.23%   92.23%   +<.01%     
==========================================
  Files         161      161              
  Lines       51324    51334      +10     
==========================================
+ Hits        47339    47349      +10     
  Misses       3985     3985
Flag Coverage Δ
#multiple 90.62% <100%> (ø) ⬆️
#single 42.32% <73.68%> (+0.02%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/sparse.py 91.82% <100%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a092e91...173a28a. Read the comment docs.

@@ -284,6 +284,83 @@ def is_dtype(cls, dtype):
return True
return isinstance(dtype, np.dtype) or dtype == 'Sparse'

def astype(self, dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a method on the base Dtype class as well which just returns .dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this will make it an offical part of the interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other types that this would be useful for? IMO it's not important enough to add to the interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything with a subtype? so Categorical and Interval?

return dtype

@property
def _subtype_with_str(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is only for Sparse which is ok

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 11, 2018 via email

@jreback
Copy link
Contributor

jreback commented Nov 11, 2018

ok that sounds good

@jreback jreback merged commit a5127b1 into pandas-dev:master Nov 12, 2018
thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018
* upstream/master:
  BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
  DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635)
  More helpful Stata string length error. (pandas-dev#23629)
  BUG: astype fill_value for SparseArray.astype (pandas-dev#23547)
  CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587)
  CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627)
  CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018
…fixed

* upstream/master:
  DOC: avoid SparseArray.take error (pandas-dev#23637)
  CLN: remove incorrect usages of com.AbstractMethodError (pandas-dev#23625)
  DOC: Adding validation of the section order in docstrings (pandas-dev#23607)
  BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
  DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635)
  More helpful Stata string length error. (pandas-dev#23629)
  BUG: astype fill_value for SparseArray.astype (pandas-dev#23547)
  CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587)
  CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627)
  CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants