Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/ENH: Constructors for DatetimeArray/TimedeltaArray #23493

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -801,6 +801,25 @@ def _evaluate_compare(self, other, op):
# -------------------------------------------------------------------
# Shared Constructor Helpers

def scalar_data_error(scalar, cls):
"""
Produce the error message to issue when raising a TypeError if a scalar
is passed to an array constructor.

Parameters
----------
scalar : object
cls : class
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

Returns
-------
message : str
"""
return ('{cls}() must be called with a '
'collection of some kind, {data} was passed'
.format(cls=cls.__name__, data=repr(scalar)))


def validate_periods(periods):
"""
If a `periods` argument is passed to the Datetime/Timedelta Array/Index
Expand Down Expand Up @@ -888,6 +907,47 @@ def maybe_infer_freq(freq):
return freq, freq_infer


def maybe_define_freq(freq_infer, result):
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
"""
If appropriate, infer the frequency of the given Datetime/Timedelta Array
and pin it to the object at the end of the construction.

Parameters
----------
freq_infer : bool
result : DatetimeArray or TimedeltaArray

Notes
-----
This may alter `result` in-place, should only ever be called
from __new__/__init__.
"""
if freq_infer:
inferred = result.inferred_freq
if inferred:
result.freq = frequencies.to_offset(inferred)


def maybe_validate_freq(result, verify, freq, freq_infer, **kwargs):
"""
If a frequency was passed by the user and not inferred or extracted
from the underlying data, then validate that the data is consistent with
the user-provided frequency.

Parameters
----------
result : DatetimeIndex or TimedeltaIndex
verify : bool
freq : DateOffset or None
freq_infer : bool
**kwargs : arguments to pass to `_validate_frequency`
For DatetimeIndex this is just "ambiguous", empty for TimedeltaIndex
"""
if verify and len(result) > 0:
if freq is not None and not freq_infer:
result._validate_frequency(result, freq, **kwargs)


def validate_tz_from_dtype(dtype, tz):
"""
If the given dtype is a DatetimeTZDtype, extract the implied
Expand Down
46 changes: 37 additions & 9 deletions pandas/core/arrays/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,8 @@ def wrapper(self, other):
if isinstance(other, list):
# FIXME: This can break for object-dtype with mixed types
other = type(self)(other)
elif not isinstance(other, (np.ndarray, ABCIndexClass, ABCSeries)):
elif not isinstance(other, (np.ndarray, ABCIndexClass, ABCSeries,
DatetimeArrayMixin)):
# Following Timestamp convention, __eq__ is all-False
# and __ne__ is all True, others raise TypeError.
return ops.invalid_comparison(self, other, op)
Expand Down Expand Up @@ -170,6 +171,8 @@ class DatetimeArrayMixin(dtl.DatetimeLikeArrayMixin):
# Constructors

_attributes = ["freq", "tz"]
_freq = None
_tz = None
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

@classmethod
def _simple_new(cls, values, freq=None, tz=None, **kwargs):
Expand All @@ -193,11 +196,16 @@ def _simple_new(cls, values, freq=None, tz=None, **kwargs):
result._tz = timezones.tz_standardize(tz)
return result

def __new__(cls, values, freq=None, tz=None, dtype=None):
def __new__(cls, values, freq=None, tz=None, dtype=None, copy=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if you want to simplify this already a bit more, you could rename the current __new__ to __init__ (you might need some dummy init/new on Index overriding it to make sure it works with the current inheritance works), and then we can already merge the __init__ with _simple_new

if isinstance(values, (list, tuple)) or is_object_dtype(values):
values = cls._from_sequence(values, copy=copy)
# TODO: Can we set copy=False here to avoid re-coping?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, then yes you're OK setting copy=False here. By definition, the conversion to datetime64[ns] will involve a copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further question: it is not (yet) possible to simply remove this case? (eventually we should not call the DatetimeArray constructor with an array-like of scalars)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not (yet) possible to simply remove this case?

Not if we want to share the extant arithmetic tests (which we do)

(eventually we should not call the DatetimeArray constructor with an array-like of scalars)

I don't share this opinion, would prefer to delay this discussion until it is absolutely necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't share this opinion,

Then please raise this in the appropriate issue, as we have been discussing this before (I think it is #23212, although there is probably some more scattered discussion on other related PRs)

would prefer to delay this discussion until it is absolutely necessary.

It is here that you are redesigning the constructors for the array refactor, IIUC, so if there is a time we should discuss it, it is now I think?

Not if we want to share the extant arithmetic tests (which we do)

Can you clarify this a little bit? At what point do the arithmetic tests need to deal with array of objects?
Eg for boxing the constructed values into Series/Index/Array, there a properly dtyped array can be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this a little bit? At what point do the arithmetic tests need to deal with array of objects?

The pertinent word here is "extant". Many of the tests in tests/arithmetic pass a list into tm.box_expected or klass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the tests for a moment, I thought we were all on board with the goal of the DatetimelikeArray.__init__ being no inference and no copy.

Back to the tests, it looks like you can you add an entry to box_expected for DatetimeArray to return expected = DatetimeArray._from_sequence(expected)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the tests for a moment, I thought we were all on board with the goal of the DatetimelikeArray.init being no inference and no copy.

My comment to Joris below about mothballing this conversation applies. But short answer is no: I did not get on board with that.


if tz is None and hasattr(values, 'tz'):
# e.g. DatetimeIndex
# e.g. DatetimeArray, DatetimeIndex
tz = values.tz

# TODO: what about if freq == 'infer'?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we should also get the freq from the values as a "cheap" inference? Or would there be cases were an inferred frequency can be different than the actual frequency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we should also get the freq from the values as a "cheap" inference?

That's what I'm thinking, yah

if freq is None and hasattr(values, "freq"):
# i.e. DatetimeArray, DatetimeIndex
freq = values.freq
Expand All @@ -207,26 +215,46 @@ def __new__(cls, values, freq=None, tz=None, dtype=None):
# if dtype has an embedded tz, capture it
tz = dtl.validate_tz_from_dtype(dtype, tz)

if isinstance(values, DatetimeArrayMixin):
if lib.is_scalar(values):
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
raise TypeError(dtl.scalar_data_error(values, cls))
elif isinstance(values, ABCSeries):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would get out the _values, and then treat that the same as directly passing a DatetimeIndex/DatetimeArray ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if there is a graceful way to do this in the next pass (if I ever manage to catch up with all these comments!)

# extract nanosecond unix timestamps
if tz is None:
# TODO: Try to do this in just one place
tz = values.dt.tz
values = np.array(values.view('i8'))
elif isinstance(values, DatetimeArrayMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeArrayMixin -> cls?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you don't need to get the tz in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeArrayMixin -> cls?

No. For the moment we are still using inheritance, so this would mess up for DatetimeIndex == DatetimeArray. When we change to composition this check will have to become isinstance(values, (DatetimeArray, ABCDatetimeIndex))

# extract nanosecond unix timestamps
values = values.asi8

if values.dtype == 'i8':
values = values.view('M8[ns]')

assert isinstance(values, np.ndarray), type(values)
assert is_datetime64_dtype(values) # not yet assured nanosecond
values = conversion.ensure_datetime64ns(values, copy=False)
values = conversion.ensure_datetime64ns(values, copy=copy)

result = cls._simple_new(values, freq=freq, tz=tz)
if freq_infer:
inferred = result.inferred_freq
if inferred:
result.freq = to_offset(inferred)
dtl.maybe_define_freq(freq_infer, result)

# NB: Among other things not yet ported from the DatetimeIndex
# constructor, this does not call _deepcopy_if_needed
return result

@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
# list, tuple, or object-dtype ndarray/Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to turn into an object array here? to_datetime handles all of these cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right we could make do without it. I like doing this explicitly because to_datetime is already overloaded and circular.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is horribly inefficient and unnecessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't do it here, to_datetime is going to do this. It may be unnecessary, but it is not horribly inefficient. What is a code smell is the circularity involved in calling to_datetime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then just call array_to_datetime and don’t force the conversion to array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the root problem (referenced in your "circularity" comment, and down below in TimedeltaIndex.__new__) that to_datetime / to_timedelta returns an Index instead of an EA?

Could we have the public to_datetime just be a simple

array = _to_datetime(...)
return DatetimeIndex(array)

so the internal _to_datetime returns the array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the root problem (referenced in your "circularity" comment, and down below in TimedeltaIndex.new) that to_datetime / to_timedelta returns an Index instead of an EA?

It's not the fact that it's an Index so much as that it is a circular dependency. I think I can resolve this in an upcoming commit.

Looking through to_datetime and _convert_listlike_datetimes, I don't see a conversion to ndarray[object].

_convert_listlike_datetimes calls ensure_object.

Sorry, to_datetime has in intermediate datetime64[ns] -> object -> datetime64[ns] conversion? That seems unnecessary.

Not sure what you're referring to. As implemented _from_sequence is specifically for list, tuple, or object-dtype NDArray/Index. datetime64-dtype goes through a different path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_convert_listlike_datetimes calls ensure_object.

That's after an

    # these are shortcutable
    if is_datetime64tz_dtype(arg):
        if not isinstance(arg, DatetimeIndex):
            return DatetimeIndex(arg, tz=tz, name=name)
        if tz == 'utc':
            arg = arg.tz_convert(None).tz_localize(tz)
        return arg

    elif is_datetime64_ns_dtype(arg):
        if box and not isinstance(arg, DatetimeIndex):
            try:
                return DatetimeIndex(arg, tz=tz, name=name)
            except ValueError:
                pass

        return arg

So those both avoid conversion to object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtensionArray._from_sequence is for any sequence of scalar objects, including a ndarray with a specialized type like datetime64[ns]. It'll be used, for example, in factorize(ExtensionArray).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger thank you for clarifying; I was under the mistaken impression that it was specifically list/tuple/object-dtype.

Are there any restrictions on kwargs that can be added to it? In particular I'm thinking of freq and tz

values = np.array(scalars, dtype=np.object_, copy=copy)
if values.ndim != 1:
raise TypeError("Values must be 1-dimensional")

# TODO: See if we can decrease circularity
from pandas.core.tools.datetimes import to_datetime
values = to_datetime(values)

# pass dtype to constructor in order to convert timezone if necessary
return cls(values, dtype=dtype)

@classmethod
def _generate_range(cls, start, end, periods, freq, tz=None,
normalize=False, ambiguous='raise', closed=None):
Expand Down
4 changes: 3 additions & 1 deletion pandas/core/arrays/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,9 @@ class PeriodArray(dtl.DatetimeLikeArrayMixin, ExtensionArray):

# --------------------------------------------------------------------
# Constructors
def __init__(self, values, freq=None, copy=False):
def __init__(self, values, freq=None, dtype=None, copy=False):
freq = dtl.validate_dtype_freq(dtype, freq)

if freq is not None:
freq = Period._maybe_convert_freq(freq)

Expand Down
88 changes: 73 additions & 15 deletions pandas/core/arrays/timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,22 @@

import numpy as np

from pandas._libs import tslibs
from pandas._libs import tslibs, lib, algos
from pandas._libs.tslibs import Timedelta, Timestamp, NaT
from pandas._libs.tslibs.fields import get_timedelta_field
from pandas._libs.tslibs.timedeltas import array_to_timedelta64

from pandas import compat

from pandas.core.dtypes.common import (
_TD_DTYPE, is_list_like)
_TD_DTYPE, is_list_like, is_object_dtype, is_timedelta64_dtype)
from pandas.core.dtypes.generic import ABCSeries
from pandas.core.dtypes.missing import isna

import pandas.core.common as com
from pandas.core.algorithms import checked_add_with_arr

from pandas.tseries.offsets import Tick
from pandas.tseries.frequencies import to_offset

from . import datetimelike as dtl

Expand Down Expand Up @@ -112,9 +111,7 @@ def dtype(self):

@classmethod
def _simple_new(cls, values, freq=None, dtype=_TD_DTYPE):
# `dtype` is passed by _shallow_copy in corner cases, should always
# be timedelta64[ns] if present
assert dtype == _TD_DTYPE
_require_m8ns_dtype(dtype)
assert isinstance(values, np.ndarray), type(values)

if values.dtype == 'i8':
Expand All @@ -127,22 +124,48 @@ def _simple_new(cls, values, freq=None, dtype=_TD_DTYPE):
result._freq = freq
return result

def __new__(cls, values, freq=None):
def __new__(cls, values, freq=None, dtype=_TD_DTYPE, copy=False):
_require_m8ns_dtype(dtype)

if isinstance(values, (list, tuple)) or is_object_dtype(values):
values = cls._from_sequence(values, copy=copy)._data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot return here directly?

# TODO: can we set copy=False to avoid re-copying?

freq, freq_infer = dtl.maybe_infer_freq(freq)

values = np.array(values, copy=False)
if values.dtype == np.object_:
values = array_to_timedelta64(values)
if lib.is_scalar(values):
raise TypeError(dtl.scalar_data_error(values, cls))
elif isinstance(values, TimedeltaArrayMixin):
if freq is None and values.freq is not None:
freq = values.freq
freq_infer = False
values = values._data

result = cls._simple_new(values, freq=freq)
if freq_infer:
inferred = result.inferred_freq
if inferred:
result.freq = to_offset(inferred)
values = np.array(values, copy=copy)

if values.dtype == 'i8':
pass
elif not is_timedelta64_dtype(values):
raise TypeError(values.dtype)
elif values.dtype != _TD_DTYPE:
# i.e. non-nano unit
# TODO: use tslibs.conversion func? watch out for overflows
values = values.astype(_TD_DTYPE)

result = cls._simple_new(values, freq=freq)
dtl.maybe_define_freq(freq_infer, result)
return result

@classmethod
def _from_sequence(cls, scalars, dtype=_TD_DTYPE, copy=False):
# list, tuple, or object-dtype ndarray/Index
values = np.array(scalars, dtype=np.object_, copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't call to_timedelta, so this does require that we pass an object array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it should
let’s not reinvent the wheel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it shouldn't. to_timedelta will just end up calling array_to_timedelta64 like this does, but only after doing a bunch of unecessary dtype checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides this is what TimedeltaIndex.__new__ currently calls

if values.ndim != 1:
raise TypeError("Values must be 1-dimensional")

result = array_to_timedelta64(values)
return cls(result, dtype=dtype)

@classmethod
def _generate_range(cls, start, end, periods, freq, closed=None):

Expand Down Expand Up @@ -180,6 +203,23 @@ def _generate_range(cls, start, end, periods, freq, closed=None):

return cls._simple_new(index, freq=freq)

# ----------------------------------------------------------------
# Array-Like Methods
# NB: these are appreciably less efficient than the TimedeltaIndex versions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of (lack of) caching? This comment makes it seems like it's slower in general, when (if it's caching) it's just slower on repeated use).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW (as mentioned elsewhere), I am not sure we should add them as public methods. If we do so, we should add them to all our EAs, or actually even to the EA interface, and not only to TimedeltaArray (or datetimelike arrays).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do so, we should add them to all our EAs, or actually even to the EA interface

I'm not necessarily opposed to this, but this isn't obvious to me.

Because of (lack of) caching? This comment makes it seems like it's slower in general, when (if it's caching) it's just slower on repeated use).

Because the Index version defines monotonic_increasing, monotonic_decreasing, and is_unique in a single call via _engine.


@property
def is_monotonic_increasing(self):
return algos.is_monotonic(self.asi8, timelike=True)[0]

@property
def is_monotonic_decreasing(self):
return algos.is_monotonic(self.asi8, timelike=True)[1]

@property
def is_unique(self):
from pandas.core.algorithms import unique1d
return len(unique1d(self.asi8)) == len(self)

# ----------------------------------------------------------------
# Arithmetic Methods

Expand Down Expand Up @@ -413,3 +453,21 @@ def _generate_regular_range(start, end, periods, offset):

data = np.arange(b, e, stride, dtype=np.int64)
return data


def _require_m8ns_dtype(dtype):
"""
`dtype` is included in the constructor signature for consistency with
DatetimeArray and PeriodArray, but only timedelta64[ns] is considered
valid. Raise if anything else is passed.

Parameters
----------
dtype : dtype

Raises
------
ValueError
"""
if dtype != _TD_DTYPE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertionError no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When called from _simple_new this is internal so AssertionError would make sense, but it is also called from __new__ so is in principle user-facing.

Either way I need to add tests for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this should never happen all conversions should be before this

so it should assert

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype is part of the signature of TimedeltaArray.__new__, which is/will be user-facing. If the user passes the wrong dtype, its a ValueError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no my point is there should be and i think currently there is already conversion

if it’s wrong at this point it’s not a user error but an incorrect path taken

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that this check function is called two times, one of which is the very first thing in TimedeltaArray.__new__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the discussion above, is it worth having a 15 line function (including docstrings :-)), for a 2-liner used in two places?
I would maybe simply leave it in place how it was, I think reading something like assert dtype == _TD_DTYPE in TimedeltaArray._simple_new is clearer than calling into a helper function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. But hey, its a nice docstring.

raise ValueError("Only timedelta64[ns] dtype is valid.", dtype)
Loading