Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support timezone aware pandas inputs in cudf #15935

Merged
merged 3 commits into from
Jun 10, 2024

Conversation

mroeschke
Copy link
Contributor

Description

closes #13611

(This technically does not support pandas objects have interval types that are timezone aware)

@rjzamora let me know if the test I adapted from your PR in #15929 is adequate

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuDF (Python) labels Jun 5, 2024
@mroeschke mroeschke requested review from a team as code owners June 5, 2024 21:01
@mroeschke mroeschke requested review from bdice and isVoid June 5, 2024 21:01
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jun 5, 2024
}
)
pdf.to_parquet(path)
# cudf.read_parquet does not support reading timezone aware types yet, so check dtypes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so now that cudf.DataFrame.from_pandas works for timezone-aware pandas dtypes, dask_cudf.read_parquet(path).meta.dtypes will no longer agree with the computed result. Am I understanding correctly?

This means that we will still need to modify the dask_cudf.read_parquet logic a bit to erase the timezone information until cudf.read_parquet also supports timezone-aware types. It's probably fine to make that change after this PR is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really familiar with how dask cudf's parquet reader works under the hood, but if by "computed result" it means relative to using cudf's parquet reader then yes. I've hopefully clarified the test to show that dask_cudf result maintains the timezone in the data type while the cudf results drops the timezone in the data type (IMO the dask_cudf result is "more correct")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really familiar with how dask cudf's parquet reader works under the hood, but if by "computed result" ...

Sorry, there is no reason for the behavior of dask_cudf to be obvious to anyone. So, I should definitely clarify my statements a bit:

A dask_cudf.DataFrame collection is "lazy" in the sense that we will not actually read the parquet data when you call dask_cudf.read_parquet. However, in order to keep track of the current column names and dtypes, the collection will keep an empty pd/cudf DataFrame object in memory called meta. This meta object is currently populated in the dask_cudf.read_parquet call by reading the Parquet footer metadata with pyarrow, and then converting to cudf (via pandas).

The problem is that the meta object created by dask_cudf.read_parquet can be "wrong", because dask_cudf will simply use cudf.read_parquet to actually read in the data when compute/persist is called on the collection. In other words: The dask_cudf result is not actually more correct than the cudf result, it's eager meta just happens to be more correct by accident :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that explanation!

The dask_cudf result is not actually more correct than the cudf result, it's eager meta just happens to be more correct by accident

Ah yes I think this conclusion is correct and what you alluded to in your prior comment. In a follow up to this PR, the timezone aware type in meta probably needs to be converted to a timezone naive type in dask_cudf

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, thanks!

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit e3ba131 into rapidsai:branch-24.08 Jun 10, 2024
70 checks passed
@mroeschke mroeschke deleted the enh/pandas/tz_aware branch June 10, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add support for timezone-aware data to from_pandas
4 participants