Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add CalendarDay ('CD') offset #22288

Merged
merged 39 commits into from
Sep 7, 2018

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Aug 12, 2018

Adds CalendarDay and 'CD' offset and alias respectively and changes Day ('D') in date_range to respect calendar day.

With regards to CalendarDay added tests for:

  • CalendarDay arithmetic with itself, datetimes, and timedeltas (which should raise)
  • 'CD' with use in date_range and timedelta_range (which should raise)
  • resampling with 'CD'

Also refactored _generate_range for date_range since we no longer have to check for Day

@codecov
Copy link

codecov bot commented Aug 14, 2018

Codecov Report

Merging #22288 into master will increase coverage by <.01%.
The diff coverage is 93.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22288      +/-   ##
==========================================
+ Coverage   92.04%   92.05%   +<.01%     
==========================================
  Files         169      169              
  Lines       50782    50787       +5     
==========================================
+ Hits        46742    46751       +9     
+ Misses       4040     4036       -4
Flag Coverage Δ
#multiple 90.46% <93.02%> (ø) ⬆️
#single 42.29% <76.74%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/period.py 93.5% <ø> (ø) ⬆️
pandas/core/indexes/interval.py 94.16% <ø> (ø) ⬆️
pandas/core/indexes/timedeltas.py 91.16% <ø> (ø) ⬆️
pandas/core/arrays/datetimes.py 96.6% <100%> (+1.07%) ⬆️
pandas/core/indexes/datetimes.py 95.65% <50%> (+0.1%) ⬆️
pandas/tseries/offsets.py 97.03% <90.9%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2e1a10...9ed681c. Read the comment docs.

@mroeschke
Copy link
Member Author

Should be ready for a look @jbrockmendel

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. I think need a bit of docs in timeseries.rst e.g. about when to use CD.

@@ -3174,3 +3180,60 @@ def test_last_week_of_month_on_offset():
slow = (ts + offset) - offset == ts
fast = offset.onOffset(ts)
assert fast == slow


@pytest.mark.parametrize('box, assert_func', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use tm.assert_equal instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am testing 2 array like and 1 scalar here, and tm.assert_equal is not defined for scalars:

https://github.com/pandas-dev/pandas/blob/master/pandas/util/testing.py#L1497

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you then split this test , it is very confusing

@@ -2,6 +2,7 @@
from datetime import date, datetime, timedelta

import pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a fixture that need to add CalendarDay, which systematically tests lots of things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will exisit here since CalendarDay is in offsets.__all__

@pytest.fixture(params=[getattr(offsets, o) for o in offsets.__all__])
def offset_types(request):

"""
Apply scalar arithmetic with CalendarDay offset. Incoming datetime
objects can be tz-aware or naive.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding a CD and a D which crosses dst?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test with CD that crosses dst exists here:

def test_CalendarDay_with_timezone(box, assert_func):

A test with D that crosses dst exists here:

Day: ['11/4/2012', '11/4/2012 23:00']}.items()

@mroeschke
Copy link
Member Author

Looks like I am running into a Windows/platform specific bug:

On OSX

In [3]: ts = date_range('1/1/2000', '4/1/2000', tz=tzlocal())

In [4]: ts.to_period()
Out[4]:
PeriodIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
             '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
             '2000-01-09', '2000-01-10', '2000-01-11', '2000-01-12',
             '2000-01-13', '2000-01-14', '2000-01-15', '2000-01-16',
             '2000-01-17', '2000-01-18', '2000-01-19', '2000-01-20',
             '2000-01-21', '2000-01-22', '2000-01-23', '2000-01-24',
             '2000-01-25', '2000-01-26', '2000-01-27', '2000-01-28',
             '2000-01-29', '2000-01-30', '2000-01-31', '2000-02-01',
             '2000-02-02', '2000-02-03', '2000-02-04', '2000-02-05',
             '2000-02-06', '2000-02-07', '2000-02-08', '2000-02-09',
             '2000-02-10', '2000-02-11', '2000-02-12', '2000-02-13',
             '2000-02-14', '2000-02-15', '2000-02-16', '2000-02-17',
             '2000-02-18', '2000-02-19', '2000-02-20', '2000-02-21',
             '2000-02-22', '2000-02-23', '2000-02-24', '2000-02-25',
             '2000-02-26', '2000-02-27', '2000-02-28', '2000-02-29',
             '2000-03-01', '2000-03-02', '2000-03-03', '2000-03-04',
             '2000-03-05', '2000-03-06', '2000-03-07', '2000-03-08',
             '2000-03-09', '2000-03-10', '2000-03-11', '2000-03-12',
             '2000-03-13', '2000-03-14', '2000-03-15', '2000-03-16',
             '2000-03-17', '2000-03-18', '2000-03-19', '2000-03-20',
             '2000-03-21', '2000-03-22', '2000-03-23', '2000-03-24',
             '2000-03-25', '2000-03-26', '2000-03-27', '2000-03-28',
             '2000-03-29', '2000-03-30', '2000-03-31', '2000-04-01'],
            dtype='period[D]', freq='D')

But on the AppVeyor test (Win32)

PeriodIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
                    '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
                    '2000-01-09', '2000-01-10', '2000-01-11', '2000-01-12',
                    '2000-01-13', '2000-01-14', '2000-01-15', '2000-01-16',
                    '2000-01-17', '2000-01-18', '2000-01-19', '2000-01-20',
                    '2000-01-21', '2000-01-22', '2000-01-23', '2000-01-24',
                    '2000-01-25', '2000-01-26', '2000-01-27', '2000-01-28',
                    '2000-01-29', '2000-01-30', '2000-01-31', '2000-02-01',
                    '2000-02-02', '2000-02-03', '2000-02-04', '2000-02-05',
                    '2000-02-06', '2000-02-07', '2000-02-08', '2000-02-09',
                    '2000-02-10', '2000-02-11', '2000-02-12', '2000-02-13',
                    '2000-02-14', '2000-02-15', '2000-02-16', '2000-02-17',
                    '2000-02-18', '2000-02-19', '2000-02-20', '2000-02-21',
                    '2000-02-22', '2000-02-23', '2000-02-24', '2000-02-25',
                    '2000-02-26', '2000-02-27', '2000-02-28', '2000-02-29',
                    '2000-03-01', '2000-03-02', '2000-03-03', '2000-03-04',
                    '2000-03-05', '2000-03-06', '2000-03-07', '2000-03-08',
                    '2000-03-09', '2000-03-10', '2000-03-11', '2000-03-12',
                    '2000-03-13', '2000-03-14', '2000-03-15', '2000-03-16',
                    '2000-03-17', '2000-03-18', '2000-03-19', '2000-03-20',
                    '2000-03-21', '2000-03-22', '2000-03-23', '2000-03-24',
                    '2000-03-25', '2000-03-26', '2000-03-27', '2000-03-28',
                    '2000-03-29', '2000-03-30', '2000-03-31'],
                   dtype='period[D]', freq='D')

Unfortunately I can't reproduce this locally. Guessing it may due to either:

  1. tzlocal
  2. Some calculation is Windows 32 bit specific.

@jbrockmendel
Copy link
Member

This may be a hassle to check, but does ts match before the to_period call?

@mroeschke
Copy link
Member Author

Hmm not sure I quite understand @jbrockmendel. Check that ts matches what before the to_period call?

@jbrockmendel
Copy link
Member

Check that ts matches what before the to_period call?

The issue is that ts.to_period('D') on Appveyor does not ts.to_period('D') locally right? So my question was if ts on Appveyor matches ts locally.

@mroeschke
Copy link
Member Author

@jreback all green.

index = cls._simple_new(arr.astype('M8[ns]'), freq=None, tz=tz)
else:
# Create a linearly spaced date_range in local time
arr = np.linspace(start.value, end.value, periods)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this (does it work?)

end = end.tz_localize(tz, ambiguous=False)
if tz is not None:
# Localize the start and end arguments
if start is not None and start.tz is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use _maybe_localize_point(start, getattr(start, 'tz' None)) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think that should work!

@mroeschke
Copy link
Member Author

@jreback Addressed your comments and all green.

# 1) freq = a Timedelta-like frequency (Tick)
# 2) freq = None i.e. generating a linspaced range
if isinstance(freq, Tick) or freq is None:
localize_args = {'tz': tz, 'ambiguous': False}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worthile to add freq to _maybe_localize_point then you don;t need localize_args; it will be inside that function

else:
# Create a linearly spaced date_range in local time
arr = np.linspace(start.value, end.value, periods)
index = cls._simple_new(arr.astype('M8[ns]'), freq=None, tz=tz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can add copy=False I think

Returns
-------
ts : Timestamp
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. figure out localize_args here (from a passed freq)

@pep8speaks
Copy link

pep8speaks commented Sep 5, 2018

Hello @mroeschke! Thanks for updating the PR.

Line 389:17: W504 line break after binary operator
Line 390:17: W504 line break after binary operator

Comment last updated on September 05, 2018 at 16:52 Hours UTC

@mroeschke
Copy link
Member Author

@jreback made the requested changes and all green

@jreback jreback merged commit 05a7229 into pandas-dev:master Sep 7, 2018
@jreback
Copy link
Contributor

jreback commented Sep 7, 2018

thanks! @mroeschke nice addition

@jorisvandenbossche
Copy link
Member

Sorry that I did not comment earlier (also didn't read all review comments), but I think this is a much bigger break in behaviour than we think / is documented.

This completely changes the default behaviour of date_range when using tz aware dates:

Now on master:

In [4]: pd.date_range("2016-10-30", periods=4, tz='Europe/Helsinki')
Out[4]: 
DatetimeIndex(['2016-10-30 00:00:00+03:00', '2016-10-30 23:00:00+02:00',
               '2016-10-31 23:00:00+02:00', '2016-11-01 23:00:00+02:00'],
              dtype='datetime64[ns, Europe/Helsinki]', freq='D')

Previously:

In [6]: pd.date_range("2016-10-30", periods=4, tz='Europe/Helsinki')
Out[6]: 
DatetimeIndex(['2016-10-30 00:00:00+03:00', '2016-10-31 00:00:00+02:00',
               '2016-11-01 00:00:00+02:00', '2016-11-02 00:00:00+02:00'],
              dtype='datetime64[ns, Europe/Helsinki]', freq='D')

This also changed the inferring of freq (now '24H' instead of 'D'), changes the behaviour of resample('D'), ...

@jorisvandenbossche
Copy link
Member

(I would personally revert this PR until we have better discussed it)

@mroeschke
Copy link
Member Author

I left a #22274 (comment) with the rational.

In terms of documentation, there's a section in the whatsnew v0.24 that notes this API change:

.. _whatsnew_0240.api_breaking.calendarday:
CalendarDay Offset
^^^^^^^^^^^^^^^^^^
:class:`Day` and associated frequency alias ``'D'`` were documented to represent
a calendar day; however, arithmetic and operations with :class:`Day` sometimes
respected absolute time instead (i.e. ``Day(n)`` and acted identically to ``Timedelta(days=n)``).
*Previous Behavior*:
.. code-block:: ipython
In [2]: ts = pd.Timestamp('2016-10-30 00:00:00', tz='Europe/Helsinki')
# Respects calendar arithmetic
In [3]: pd.date_range(start=ts, freq='D', periods=3)
Out[3]:
DatetimeIndex(['2016-10-30 00:00:00+03:00', '2016-10-31 00:00:00+02:00',
'2016-11-01 00:00:00+02:00'],
dtype='datetime64[ns, Europe/Helsinki]', freq='D')
# Respects absolute arithmetic
In [4]: ts + pd.tseries.frequencies.to_offset('D')
Out[4]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')
:class:`CalendarDay` and associated frequency alias ``'CD'`` are now available
and respect calendar day arithmetic while :class:`Day` and frequency alias ``'D'``
will now respect absolute time (:issue:`22274`, :issue:`20596`, :issue:`16980`, :issue:`8774`)
See the :ref:`documentation here <timeseries.dayvscalendarday>` for more information.
Addition with :class:`CalendarDay` across a daylight savings time transition:
.. ipython:: python
ts = pd.Timestamp('2016-10-30 00:00:00', tz='Europe/Helsinki')
ts + pd.offsets.Day(1)
ts + pd.offsets.CalendarDay(1)

And new reference in timeseries.rst that discusses the difference between Day and CalendarDay

https://github.com/pandas-dev/pandas/blob/master/doc/source/timeseries.rst#day-vs-calendarday

@jorisvandenbossche
Copy link
Member

Yeah, there were certainly nice new docs about the CalendarDay, but the illustration of the changed behaviour was only for, IMO, a minor use case of arithmetic with offset objects, while the more relevant user-facing changes as date_range and resample were not mentioned.

@mroeschke
Copy link
Member Author

That's a fair concern; I can certainly give more attention to this impact on date_range and resample in the whatsnew and the CalendarDay section in timeseries.rst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Add Calendar Day Frequency ("CD")
7 participants