Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dask_cudf.read_parquet regression for legacy timestamp data #15929

Merged
merged 7 commits into from
Jun 11, 2024

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Jun 5, 2024

Description

cudf does not currently support timezone-aware datetime columns. For example:

    pdf = pd.DataFrame(
        {
            "time": pd.to_datetime(
                ["1996-01-02", "1996-12-01"],
                utc=True,
            ),
            "x": [1, 2],
        }
    )
    cudf.DataFrame.from_pandas(pdf)
NotImplementedError: cuDF does not yet support timezone-aware datetimes

However, cudf.read_parquet does allow you to read this same data from a Parquet file. This PR adds a simple fix to allow the same data to be read with dask_cudf. The dask_cudf version was previously "broken" because it relies on upstream pyarrow logic to construct meta as a pandas DataFrame (and then we just convert meta from pandas to cudf). As illustrated in the example above, this direct conversion is not allowed when one or more columns contain timezone information.

Important Context
The actual motivation for this PR is to fix a regression in 24.06+ for older parquet files containing "legacy" timestamp types (e.g. TIMESTAMP_MILLIS and TIMESTAMP_MICROS). In pyarrow 14.0.2 (used by cudf-24.04), these legacy types were not automatically translated to timezone-aware dtypes by pyarrow. In pyarrow 16.1.0 (used by cudf-24.06+), the legacy types ARE automatically translated. Therefore, in moving from cudf-24.04 to cudf-24.06+, some dask_cudf users will find that they can no longer read the same parquet file containing legacy timestamp data.

I'm not entirely sure if cudf should always allow users to read Parquet data with timezone-aware dtypes (e.g. if the timezone is not utc), but it definitely makes sense for cudf to ignore automatic/unnecessary timezone translations.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Jun 5, 2024
@rjzamora rjzamora self-assigned this Jun 5, 2024
@rjzamora rjzamora requested a review from a team as a code owner June 5, 2024 15:52
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jun 5, 2024
@rjzamora
Copy link
Member Author

rjzamora commented Jun 5, 2024

Note: Tried to add this to the proper "cuDF/Dask/..." project and get an error: "Sorry! We were unable to add the pull request to the selected project. Projects cannot have more than 1200 items."

@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 5, 2024
@mroeschke
Copy link
Contributor

So IIRC, if we do a deeper fix and allow .from_pandas to allow timezone aware pandas objects this would also fix the issue? IMO that fix should be relatively straightforward since there is some timezone support already in cudf

@rjzamora
Copy link
Member Author

rjzamora commented Jun 5, 2024

Thanks for the review @mroeschke !

So IIRC, if we do a deeper fix and allow .from_pandas to allow timezone aware pandas objects this would also fix the issue? IMO that fix should be relatively straightforward since there is some timezone support already in cudf

Yes. I would prefer a fix in cudf if you think that is reasonable :)

rjzamora and others added 3 commits June 5, 2024 11:34
@mroeschke
Copy link
Contributor

I opened up #15935 to hopefully supersede this PR

@wence-
Copy link
Contributor

wence- commented Jun 6, 2024

Note: Tried to add this to the proper "cuDF/Dask/..." project and get an error: "Sorry! We were unable to add the pull request to the selected project. Projects cannot have more than 1200 items."

We had not archived any completed items recently. I just cleared out a backlog of around 600 "DONE" items, which should have helped.

rapids-bot bot pushed a commit that referenced this pull request Jun 10, 2024
closes #13611

(This technically does not support pandas objects have interval types that are timezone aware)

@rjzamora let me know if the test I adapted from your PR in #15929 is adequate

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #15935
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (to the best of my knowledge of timezones), this makes sense.

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jun 11, 2024
@rjzamora
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 8efa64e into rapidsai:branch-24.08 Jun 11, 2024
69 checks passed
@rjzamora rjzamora deleted the timezone-read-parquet branch June 11, 2024 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants