Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitesize offsets #17318

Merged
merged 16 commits into from
Sep 23, 2017
Merged

Bitesize offsets #17318

merged 16 commits into from
Sep 23, 2017

Conversation

jbrockmendel
Copy link
Member

This is the first of several PRs cleaning up tseries.offsets. The ultimate goals of this series of PRs are:

  • Fix slow implementation of DateOffset.__eq__, as that gets called by Period.__eq__.
  • Make DateOffset immutable, since it is attached to a Period object which is supposed to be immutable (TODO: fill in the appropriate GH issue)
  • Move tseries.offsets into cython so that _libs.period and _libs.tslib can import it guilt-free and not need to do run-time imports. See TODO comment in _libs.__init__:
# TODO
# period is directly dependent on tslib and imports python
# modules, so exposing Period as an alias is currently not possible

The biggest impediment to the immutability goal is the kwds attribute, which is just a dict. The first couple of steps in this sequence is focused on whittling down the number of attributes set at runtime.

This PR is mainly fixing typos and removing redundant methods.

  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

@pep8speaks
Copy link

pep8speaks commented Aug 23, 2017

Hello @jbrockmendel! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on September 23, 2017 at 16:21 Hours UTC

@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Aug 23, 2017
plural = 's'
else:
plural = ''

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole thing should just be moved to a separate function, much more clear this way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you're referring to as "this whole thing".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this formatting function (e.g. repr should just call it, passing parameters in).

@@ -642,9 +623,6 @@ def get_str(td):
else:
return '+' + repr(self.offset)

def isAnchored(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't simply remove things. instead in a separate PR deprecate them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is still available. It's inherited verbatim from the parent class.

@codecov
Copy link

codecov bot commented Aug 28, 2017

Codecov Report

Merging #17318 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17318      +/-   ##
==========================================
+ Coverage   91.01%   91.02%   +0.01%     
==========================================
  Files         162      162              
  Lines       49558    49564       +6     
==========================================
+ Hits        45105    45116      +11     
+ Misses       4453     4448       -5
Flag Coverage Δ
#multiple 88.8% <100%> (+0.02%) ⬆️
#single 40.26% <66.66%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 97.28% <100%> (+0.13%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️
pandas/errors/__init__.py 100% <0%> (ø) ⬆️
pandas/plotting/_converter.py 65.05% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2bec750...c66b842. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 28, 2017

Codecov Report

Merging #17318 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17318      +/-   ##
==========================================
- Coverage   91.22%   91.21%   -0.01%     
==========================================
  Files         163      163              
  Lines       49655    49673      +18     
==========================================
+ Hits        45296    45308      +12     
- Misses       4359     4365       +6
Flag Coverage Δ
#multiple 89% <100%> (+0.01%) ⬆️
#single 40.19% <64%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/frequencies.py 96.11% <ø> (ø) ⬆️
pandas/tseries/offsets.py 97.14% <100%> (+0.14%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️
pandas/core/series.py 94.92% <0%> (ø) ⬆️
pandas/core/generic.py 91.99% <0%> (ø) ⬆️
pandas/core/indexes/datetimes.py 95.53% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2757a2...e52a791. Read the comment docs.

@jbrockmendel
Copy link
Member Author

I think this test error is unrelated. Pls confirm.

@jreback jreback added the Frequency DateOffsets label Sep 14, 2017
@jreback
Copy link
Contributor

jreback commented Sep 14, 2017

this was a while ago, rebase and we will see

@jreback
Copy link
Contributor

jreback commented Sep 15, 2017

needs a rebase

plural = 's'
else:
plural = ''

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this formatting function (e.g. repr should just call it, passing parameters in).

return fstr

def _offset_str(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try not to add new things which just clutter up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_offset_str is currently a method of BusinessDay, which has its own implementation of freqstr. Adding this dummy method lets us get rid of the duplicate freqstr method. Same idea with __repr__/_repr_attrs. Strictly less clutter.

@@ -710,6 +689,7 @@ def __init__(self, **kwds):
kwds['end'] = self._validate_time(kwds.get('end', '17:00'))
self.kwds = kwds
self.offset = kwds.get('offset', timedelta(0))
self._offset = self.offset # alias for backward compat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just define .offset as a property to return ._offset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely tried this, don't remember off the top why it didn't work. Maybe when the caffeine kicks in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's try that type of refactor again.

@@ -710,6 +689,7 @@ def __init__(self, **kwds):
kwds['end'] = self._validate_time(kwds.get('end', '17:00'))
self.kwds = kwds
self.offset = kwds.get('offset', timedelta(0))
self._offset = self.offset # alias for backward compat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's try that type of refactor again.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run the freq asv's to make sure nothing has changed here.

if 'offset' in state:
# Older versions have offset attribute instead of _offset
assert '_offset' not in state, list(state.keys())
state['_offset'] = state['offset']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state['_offset'] = state.pop('offset')

can you remove the assert?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on the pop. Change assert to ValueError if both keys are (somehow) there?

Copy link
Contributor

@jreback jreback Sep 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to add a 0.20.3 pickle that adds things for every frequency.

in separate PR.
you can use pandas/tests/io/generate_legacy_storage_files.py, update to add all of the offsets. Then generate and add to the repo. All tests should pass. (this is all with 0.20.3), make the modification locally but run it with the older python

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a note to do this once other fixes to offsets are done.

return out
@property
def offset(self):
# Alias for backward compat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better comment on what this attribute is (its not for backward compat, rather its the API).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand. It explicitly is for backward compat since we are trying to standardize on _offset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offsetis a user API, the _offset is merely an implementation detail (e.g. how we implement it). add a doc-string on what this returns.

Copy link
Member Author

@jbrockmendel jbrockmendel Sep 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Though to the extent that we can lock down what constitutes the user-facing API, offset probably doesn't belong in it.

@@ -507,8 +513,18 @@ def freqstr(self):
else:
fstr = code

try:
if self._offset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Not all subclasses define the _offset attribute. That is something I intend to standardize, but this PR is explicitly intended to be limited in scope.

@jbrockmendel
Copy link
Member Author

can you run the freq asv's to make sure nothing has changed here.

Begrudgingly...

$ asv continuous -f 1.1 -E virtualenv master HEAD -b freq
[...]
       before           after         ratio
     [2bec750b]       [d1e9161b]
+          72.9ms            126ms     1.73  timeseries.DatetimeIndex.time_infer_freq_none
-          57.9ms           45.0ms     0.78  timeseries.DatetimeIndex.time_infer_freq_business
-          44.7ms           34.3ms     0.77  timeseries.DatetimeIndex.time_infer_freq_daily

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

$ asv continuous -f 1.1 -E virtualenv master HEAD -b freq
[...]
     [2bec750b]       [d1e9161b]
+          31.7ms           43.8ms     1.38  timeseries.DatetimeIndex.time_infer_freq_business
+     1.52±0.04μs      1.99±0.05μs     1.30  timestamp.TimestampProperties.time_freqstr
+          43.1ms           50.4ms     1.17  timeseries.DatetimeIndex.time_infer_freq_none

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

$ asv continuous -f 1.1 -E virtualenv master HEAD -b freq
[...]
     [2bec750b]       [d1e9161b]
+          25.6ms           30.7ms     1.20  timeseries.DatetimeIndex.time_infer_freq_business
-          27.2ms           24.3ms     0.89  timeseries.DatetimeIndex.time_infer_freq_daily

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

$ asv continuous -f 1.1 -E virtualenv master HEAD -b freq
[...]
      before           after         ratio
     [2bec750b]       [d1e9161b]
-          60.6ms           53.9ms     0.89  timeseries.DatetimeIndex.time_infer_freq_none
-          34.6ms           28.7ms     0.83  timeseries.DatetimeIndex.time_infer_freq_business
-          30.9ms           23.4ms     0.76  timeseries.DatetimeIndex.time_infer_freq_daily
-      2.87±0.3μs      2.14±0.07μs     0.75  timestamp.TimestampProperties.time_freqstr

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2017

This is repeatable.

    before     after       ratio
  [f797c1dc] [d1e9161b]
+  529.00μs    12.36ms     23.36  timeseries.DatetimeIndex.time_infer_freq_business
+   20.68ms    24.21ms      1.17  timeseries.DatetimeIndex.time_infer_freq_none

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2017

[2bec750b] is a pretty old commit, FYI

@jbrockmendel
Copy link
Member Author

After updating, I get the same timings too. But it looks like this may be due to a bug in master. Manually stepping through the benchmark, timeseries.DatetimeIndex includes the code

rng8 = date_range(start='1/1/1700', freq='B', periods=100000)

which for me (py27, ubuntu...) is giving:

OverflowError: Python int too large to convert to C long
Exception OverflowError: 'Python int too large to convert to C long' in 'pandas._libs.tslib._delta_to_nanoseconds' ignored

and the resulting rng8 object is a DatetimeIndex object with 5 entries. Before anything else, can you reproduce this?

@jbrockmendel
Copy link
Member Author

Same on py35

@jbrockmendel
Copy link
Member Author

git bisect tells me the first bad commit was b59f107

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

yeah looks like this was a bug introduced there, hmm.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

let me see if I can work around: #17637

@jbrockmendel
Copy link
Member Author

After #17637, benchmarks are now unaffected.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

After #17637, benchmarks are now unaffected.

does this mean they are the same (worse) or the same as master?

@jbrockmendel
Copy link
Member Author

asv continuous -f 1.1 -E virtualenv master HEAD -b freq
[...]
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@jbrockmendel
Copy link
Member Author

Just pushed. Besides the requested docstring, had to edit the asv to avoid the new OverflowError.

@@ -56,7 +56,7 @@ def setup(self):
self.no_freq = self.rng7[:50000].append(self.rng7[50002:])
self.d_freq = self.rng7[:50000].append(self.rng7[50000:])

self.rng8 = date_range(start='1/1/1700', freq='B', periods=100000)
self.rng8 = date_range(start='1/1/1700', freq='B', periods=75000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great.

@jreback jreback added this to the 0.21.0 milestone Sep 23, 2017
@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

lgtm. ping on green.

side note, I believe you can make _params readonly_cached if i read it correctly (another PR of course :>)

@jbrockmendel
Copy link
Member Author

side note, I believe you can make _params readonly_cached if i read it correctly (another PR of course :>)

That is pretty much the original motivation here. There are a couple more steps between here and there, since until we get kwds locked down (see WIP #17458) and make relevant attrs immutable, cache invalidation is a PITA.

@jreback jreback merged commit 2eb568a into pandas-dev:master Sep 23, 2017
@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

thanks!

@jbrockmendel jbrockmendel deleted the bitesize_offsets branch October 30, 2017 16:23
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Frequency DateOffsets Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants