Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement _most_ of the EA interface for DTA/TDA #23643

Merged
merged 11 commits into from
Nov 14, 2018

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Nov 12, 2018

@TomAugspurger as promised, this implements a handful of tests, and a few more can be ported from #23415, but the EA-specific tests haven't been started.

The big missing piece is DatetimeArray._from_sequence, which I'm getting started on now in a new branch.

Closes #23586

@pep8speaks
Copy link

pep8speaks commented Nov 12, 2018

Hello @jbrockmendel! Thanks for updating the PR.

Comment last updated on November 12, 2018 at 20:25 Hours UTC

pandas/core/indexes/datetimes.py Show resolved Hide resolved
pandas/core/arrays/datetimelike.py Show resolved Hide resolved
pandas/core/arrays/datetimelike.py Outdated Show resolved Hide resolved
pandas/core/arrays/datetimelike.py Outdated Show resolved Hide resolved

@classmethod
def _concat_same_type(cls, to_concat):
# for TimedeltaArray and PeriodArray; DatetimeArray overrides
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem being that DatetimeArray needs to pass through tz info?

For PeriodArray at least (haven't checked TimedeltaArray) you should able to implement _concat_same_type just in terms of .dtype. It's hashable and can be passed to PeriodArray.__init__.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As opposed to PeriodArray, freq is not part of the dtype for DatetimeArray/TimedeltaArray, so I am not sure this check for freq should be done here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the PeriodArray constructor allows duplicate freq and dtype, as long as the match, so

PeriodArray(data, freq='H', dtype=PeriodDtype("H")

should be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, we added dtype to the PeriodArray constructor specifically so that type(self)(values, freq=self.freq, dtype=self.dtype) would be valid for all three TDA/DTA/PA classes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that's not what I was meaning. The difference is the meaning of freq for PeriodArray vs DatetimeArray. For PeriodArray it is part of the dtype and it is defining how the stored integers are interpreted (and thus need to match to just concatenate those integers), but for DatetimeArray it is simply an informative attribute telling you about the regularity of the Array, but not essential to describe it. So I would assume that _concat_same_type needs to handle different arrays with different freqs for DatetimeArray.

pandas/core/arrays/timedeltas.py Show resolved Hide resolved
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 12, 2018
@TomAugspurger TomAugspurger added Datetime Datetime data dtype ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 12, 2018
@jbrockmendel
Copy link
Member Author

Ported the last of the relevant tests from #23415


@classmethod
def _concat_same_type(cls, to_concat):
# for TimedeltaArray and PeriodArray; DatetimeArray overrides
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As opposed to PeriodArray, freq is not part of the dtype for DatetimeArray/TimedeltaArray, so I am not sure this check for freq should be done here.

pandas/core/arrays/datetimelike.py Outdated Show resolved Hide resolved
class SharedTests(object):
index_cls = None

def test_take(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need to add such basic tests I think, as those are covered by the base Extension tests (we should of course test datetime specific aspects).

Is there anything in this test not tested by the base tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class-specific tests have tests for invalid fill_values

arr.take([0, 1], allow_fill=True,
fill_value=pd.Timestamp.now().time)

def test_concat_same_type(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think also commented on one of the previous PRs that started doing this, I don't think we should test _concat_same_type here directly. It is already tested by the base extension tests and by all the tests that actually use it under the hood.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, most of these tests were salvaged from one of those older PRs. I don't see much downside to having the tests, but am pretty happy to pawn this decision/PR off on Tom

Copy link
Contributor

@TomAugspurger TomAugspurger Nov 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still want the fail cases (different dtypes) here.

At this point, we should be able to simplify core/dtypes/concat.py::concat_datetimetz and DatetimeIndex._concat_same_dtype right? I'll take a look.


@classmethod
def _from_factorized(cls, values, original):
return cls(values, dtype=original.dtype, freq=original.freq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to pass original.freq here?
It seems a bit strange as this is creating a new array which does not necessarily have the same order as the original one.

Although in practice, if you have a freq, that means you have a regular and unique array to start with, so the factorization is kind of a no-op and the result will still have the same freq? (but might be missing corner cases)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of a possible corner case, which is currently actually broken: a sorted factorize of a DatetimeIndex with a negative freq:

In [57]: idx = pd.date_range("2012-01-01", periods=3)

In [58]: idx
Out[58]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq='D')

In [59]: pd.factorize(idx)
Out[59]: 
(array([0, 1, 2]),
 DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq='D'))

In [60]: pd.factorize(idx[::-1])
Out[60]: 
(array([0, 1, 2]),
 DatetimeIndex(['2012-01-03', '2012-01-02', '2012-01-01'], dtype='datetime64[ns]', freq='-1D'))

In [61]: pd.factorize(idx[::-1], sort=True)
Out[61]: 
(array([2, 1, 0]),
 DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq='-1D'))

@jbrockmendel
Copy link
Member Author

@TomAugspurger are you still good taking this PR over? (BTW Travis failures look like unrelated timeouts)

@TomAugspurger
Copy link
Contributor

Sure will push changes here.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 13, 2018 via email

@jorisvandenbossche
Copy link
Member

And I suppose we can't even assume that concatenating multiple DatetimeArrays with the same freq will end up with a DatetimeIndex with the same freq.

Indeed, as that would only be the case if they are nicely consecutive ranges.

_concat._concat_datetimetz -> DatetimeIndex._concat_same_dtype ->
DatetimeArray._concat_same_type
@TomAugspurger
Copy link
Contributor

Pushed some changes.

  1. DatetimelikeArray.copy now copies the underlying data
  2. changed the concatenation of DatetimeIndex, hopefully for the better.

@codecov
Copy link

codecov bot commented Nov 14, 2018

Codecov Report

Merging #23643 into master will increase coverage by <.01%.
The diff coverage is 98.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23643      +/-   ##
==========================================
+ Coverage   92.24%   92.24%   +<.01%     
==========================================
  Files         161      161              
  Lines       51318    51340      +22     
==========================================
+ Hits        47339    47361      +22     
  Misses       3979     3979
Flag Coverage Δ
#multiple 90.64% <98.48%> (ø) ⬆️
#single 42.34% <45.45%> (+0.03%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/datetimelike.py 96.14% <100%> (+0.21%) ⬆️
pandas/core/arrays/period.py 98.44% <100%> (-0.05%) ⬇️
pandas/core/arrays/timedeltas.py 95.63% <100%> (+0.55%) ⬆️
pandas/core/indexes/datetimes.py 96.11% <100%> (-0.01%) ⬇️
pandas/core/dtypes/concat.py 96.66% <100%> (+0.4%) ⬆️
pandas/core/arrays/datetimes.py 98.47% <100%> (+0.03%) ⬆️
pandas/core/indexes/datetimelike.py 97.74% <85.71%> (-0.28%) ⬇️
pandas/tseries/offsets.py 96.98% <0%> (-0.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a197837...3cb072e. Read the comment docs.

pandas/core/arrays/timedeltas.py Show resolved Hide resolved
return _concat._concat_datetimetz(to_concat, name)
# TODO(DatetimeArray)
# - remove the .asi8 here
# - remove the _maybe_box_as_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you can remove part of the comment

@jreback
Copy link
Contributor

jreback commented Nov 14, 2018

just a really minor comment. ok to merge.

@TomAugspurger
Copy link
Contributor

Going to leave the stale comment for now if that's OK. That whole if / else should be going away soon.

@jreback jreback merged commit fb4405d into pandas-dev:master Nov 14, 2018
@jreback
Copy link
Contributor

jreback commented Nov 14, 2018

ok! thanks @jbrockmendel and @TomAugspurger

@jbrockmendel jbrockmendel deleted the dlike_ea4 branch November 14, 2018 16:25
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
thoo added a commit to thoo/pandas that referenced this pull request Nov 15, 2018
* upstream/master: (25 commits)
  DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651)
  DOC: Change release and whatsnew (pandas-dev#21599)
  DOC: Fix format of the See Also descriptions (pandas-dev#23654)
  DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374)
  ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)
  CLN: Remove unnecessary code (pandas-dev#23696)
  Pin flake8-rst version (pandas-dev#23699)
  Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643)
  CI: raise clone depth limit on CI
  BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688)
  REF: Move Excel names parameter handling to CSV (pandas-dev#23690)
  DOC: Accessing files from a S3 bucket. (pandas-dev#23639)
  Fix errorbar visualization (pandas-dev#23674)
  DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678)
  DOC: Update is_sparse docstring (pandas-dev#19983)
  BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661)
  Add to_flat_index method to MultiIndex (pandas-dev#22866)
  CLN: Move to_excel to generic.py (pandas-dev#23656)
  TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
  CI: Allow to compile docs with ipython 7.11 pandas-dev#22990 (pandas-dev#23655)
  ...
thoo added a commit to thoo/pandas that referenced this pull request Nov 15, 2018
…fixed

* upstream/master:
  DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651)
  DOC: Change release and whatsnew (pandas-dev#21599)
  DOC: Fix format of the See Also descriptions (pandas-dev#23654)
  DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374)
  ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)
  CLN: Remove unnecessary code (pandas-dev#23696)
  Pin flake8-rst version (pandas-dev#23699)
  Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643)
  CI: raise clone depth limit on CI
  BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688)
  REF: Move Excel names parameter handling to CSV (pandas-dev#23690)
  DOC: Accessing files from a S3 bucket. (pandas-dev#23639)
  Fix errorbar visualization (pandas-dev#23674)
  DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678)
  DOC: Update is_sparse docstring (pandas-dev#19983)
  BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661)
  Add to_flat_index method to MultiIndex (pandas-dev#22866)
  CLN: Move to_excel to generic.py (pandas-dev#23656)
  TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants