Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support freq in DatetimeIndex #14593

Merged

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Dec 7, 2023

When a DatetimeIndex has a fixed frequency offset, pandas defaults to it having a .freq attribute. Because we don't support that, we raise in pandas compatible mode.

Thus, working with datetimes is practically impossible in pandas compatible mode because so many datetime operations involve setting a datetime column as an index (resample, groupby).

This PR adds rudimentary support for the freq attribute.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Dec 7, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests to this PR?

python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
@@ -2142,6 +2141,8 @@ def __init__(
if yearfirst is not False:
raise NotImplementedError("yearfirst == True is not yet supported")

self._freq = _validate_freq(freq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking on adding freq support before, I found some APIs manipulate freq(to new values) and return new results. (I vaguely remember..but I think that happens in binops?) Should we add a TODO comment here that this is not fully functional yet and freq support needs to be added in rest of the code-base?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although maybe the default behaviour could be for DatetimeIndex to infer freq from its values. Then this should just work.

Also, we should probably only do that in compatibility mode for perf reasons.

@galipremsagar galipremsagar self-assigned this Dec 7, 2023
@galipremsagar galipremsagar marked this pull request as ready for review December 7, 2023 22:08
@galipremsagar galipremsagar requested a review from a team as a code owner December 7, 2023 22:08
@galipremsagar galipremsagar added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 7, 2023
}
)
),
reason="Nanosecond offsets being dropped by pandas, which is "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better solved by fixing the condition on the parameter, which should be "pandas < 2.0"?

https://github.com/shwina/cudf/blob/ed3ba3ff17cf686d1e6e38f01073d27b1be64799/python/cudf/cudf/tests/test_datetime.py#L1512

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to do that but it happens only for a few parameter combinations and we currently xpass/xfail strictly. That's the reason for the current approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have two diverging approaches at the same place but I plan on dropping these in pandas-2.0 feature branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. We can clean it up later.

@@ -463,13 +463,19 @@ class DateOffset:
}

_CODES_TO_UNITS = {
"N": "nanoseconds",
Copy link
Contributor

@bdice bdice Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some vague recollection that we left these out on purpose... hmm. I think there was some pandas behavior for which "L" and "ms" were okay but "N", "U", "T", etc. were not supported. We'd probably be able to tell if there are any newly failing pandas tests? I'd just check to see where _CODES_TO_UNITS is used and if there are any inconsistencies with this across different APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were a bunch of failing tests without these changes, adding these units passed the cudf pytests.

There is only slight increase in pandas-pytest failures:

# This PR:
= 12094 failed, 174794 passed, 3850 skipped, 3314 xfailed, 8 xpassed, 21406 warnings, 102 errors in 1516.39s (0:25:16) =

# `branch-24.02`:
= 11607 failed, 175286 passed, 3849 skipped, 3312 xfailed, 11 xpassed, 21414 warnings, 97 errors in 1493.35s (0:24:53) =

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for checking.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a few final comments.

python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
@@ -463,13 +463,19 @@ class DateOffset:
}

_CODES_TO_UNITS = {
"N": "nanoseconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for checking.

}
)
),
reason="Nanosecond offsets being dropped by pandas, which is "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. We can clean it up later.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by some of the validation steps.

python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the repr -- then this is good from my side.

python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
@galipremsagar
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit a9dc521 into rapidsai:branch-24.02 Dec 12, 2023
67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants