Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: add shortcut to Timestamp constructor #30676

Merged
merged 12 commits into from
Jan 26, 2020
19 changes: 19 additions & 0 deletions asv_bench/benchmarks/tslibs/timestamp.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,19 @@
import datetime

import dateutil
import numpy as np
import pytz

from pandas import Timestamp


class TimestampConstruction:
def setup(self):
self.npdatetime64 = np.datetime64("2020-01-01 00:00:00")
self.dttime_unaware = datetime.datetime(2020, 1, 1, 0, 0, 0)
self.dttime_aware = datetime.datetime(2020, 1, 1, 0, 0, 0, 0, pytz.UTC)
self.ts = Timestamp("2020-01-01 00:00:00")

def time_parse_iso8601_no_tz(self):
Timestamp("2017-08-25 08:16:14")

Expand All @@ -28,6 +35,18 @@ def time_fromordinal(self):
def time_fromtimestamp(self):
Timestamp.fromtimestamp(1515448538)

def time_from_npdatetime64(self):
Timestamp(self.npdatetime64)

def time_from_datetime_unaware(self):
Timestamp(self.dttime_unaware)

def time_from_datetime_aware(self):
Timestamp(self.dttime_aware)

def time_from_pd_timestamp(self):
Timestamp(self.ts)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
Was this what you had in mind? I think we should leave only the scalar transformer for benchmarking this shortcut. Don't see much sense in adding the benchmarks for transforming a Series.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect, thanks



class TimestampProperties:
_tzs = [None, pytz.timezone("Europe/Amsterdam"), pytz.UTC, dateutil.tz.tzutc()]
Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,9 @@ Deprecations

Performance improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Performance improvement in :class:`Timedelta` constructor (:issue:`30543`)
- Performance improvement in :class:`Timestamp` constructor (:issue:`30543`)
-
-

Expand Down
13 changes: 12 additions & 1 deletion pandas/_libs/tslibs/timestamps.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -391,7 +391,18 @@ class Timestamp(_Timestamp):
# User passed tzinfo instead of tz; avoid silently ignoring
tz, tzinfo = tzinfo, None

if isinstance(ts_input, str):
# GH 30543 if pd.Timestamp already passed, return it
# check that only ts_input is passed
# checking verbosely, because cython doesn't optimize
# list comprehensions (as of cython 0.29.x)
if (isinstance(ts_input, Timestamp) and freq is None and
tz is None and unit is None and year is None and
month is None and day is None and hour is None and
minute is None and second is None and
microsecond is None and nanosecond is None and
tzinfo is None):
Copy link
Member Author

@AlexKirko AlexKirko Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel This appears the way to go if we need maximum performance.
We do lose a bit of speed (between 10 and 20 percent) because we implement the shortcut after errorchecks and _date_attributes.
Is this fine or should we hoist it after (or before) _date_attributes = [year, month, day, hour, minute, second, microsecond, nanosecond]? I think the current way is tidier, but it's a tradeoff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure all of these extra checks are worth adding to improve perf when passing a Timestamp to a Timestamp constructor - @AlexKirko how often are you expecting that to actually happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how often are you expecting that to actually happen?

For internal usage there are a lot of places where we do:

if isinstance(obj, (datetime, np.datetime64)):
    obj = Timestamp(obj)

That's not exactly the usage being checked here, but could benefit in the same way from an optimized no-other-arguments-passed check.

Copy link
Member Author

@AlexKirko AlexKirko Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly, I don't know the library well enough to comment on internal usage, but @jbrockmendel has already done that.

However, what I've done repeatedly in my own projects when on a deadline is take a Dataframe column or a list and just cast the elements into the type I need, trusting that if it already is that class, the performance loss won't be noticeable in the larger program (I do lots of ad-hoc data science modeling). I don't think it's as much of a problem in production-quality code, but a lot of people I work with use pandas to quickly preprocess data for sklearn.

You tend to rely on being able to cut corners when working with a well-supported package, and, currently, calling Timestamp on a Timestamp is more than 10 000 times slower than the proposed shortcut, which can be a nasty shock for someone expecting to just blitz through type conversions during data wrangling.

@WillAyd We could make the code a bit more compact with what was proposed in 4c9eb70, which you can look up above, but this incurs about a 25% performance loss. I believe that if we do the shortcut, might as well add extra two lines. It's a bit grating but the gain is worth it, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to the effect of "we do this verbose thing because cython wont optimize a list comprehension (as of cython 0.29.x)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel Added the comment you requested.

return ts_input
elif isinstance(ts_input, str):
# User passed a date string to parse.
# Check that the user didn't also pass a date attribute kwarg.
if any(arg is not None for arg in _date_attributes):
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/indexes/datetimes/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -957,3 +957,10 @@ def test_timedelta_constructor_identity():
expected = pd.Timedelta(np.timedelta64(1, "s"))
result = pd.Timedelta(expected)
assert result is expected


def test_timestamp_constructor_identity():
# Test for #30543
expected = pd.Timestamp("2017-01-01T12")
result = pd.Timestamp(expected)
assert result is expected
12 changes: 1 addition & 11 deletions pandas/tests/indexes/datetimes/test_timezones.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
Tests for DatetimeIndex timezone-related methods
"""
from datetime import date, datetime, time, timedelta, tzinfo
from distutils.version import LooseVersion

import dateutil
from dateutil.tz import gettz, tzlocal
Expand All @@ -11,7 +10,6 @@
import pytz

from pandas._libs.tslibs import conversion, timezones
from pandas.compat._optional import _get_version
import pandas.util._test_decorators as td

import pandas as pd
Expand Down Expand Up @@ -583,15 +581,7 @@ def test_dti_construction_ambiguous_endpoint(self, tz):
["US/Pacific", "shift_forward", "2019-03-10 03:00"],
["dateutil/US/Pacific", "shift_forward", "2019-03-10 03:00"],
["US/Pacific", "shift_backward", "2019-03-10 01:00"],
pytest.param(
"dateutil/US/Pacific",
"shift_backward",
"2019-03-10 01:00",
marks=pytest.mark.xfail(
LooseVersion(_get_version(dateutil)) < LooseVersion("2.7.0"),
reason="GH 31043",
),
),
["dateutil/US/Pacific", "shift_backward", "2019-03-10 01:00"],
Copy link
Member Author

@AlexKirko AlexKirko Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shortcut, along with the fix to #24329 makes this exception no longer necessary, as a correct value gets returned without an error.
UPDATE: #31155 made the check succeed with dateutil.__version__ >= 2.7.0 . With this shortcut, the exception is not necessary with any version of dateutil. What happened before was that during making a date_range we would call pd.Timestamp twice and this would alter the object in case of a dateutil timezone near a winter/summer DST switch, which would make the test fail. #31155 made sure that the object didn't get altered with an updated version of dateutil and this shortcut eliminates the danger of the Timestamp being altered altogether, because the shortcut simply returns that same object.

["US/Pacific", timedelta(hours=1), "2019-03-10 03:00"],
],
)
Expand Down