Fix `dask_cudf.read_parquet` regression for legacy timestamp data #15929

rjzamora · 2024-06-05T15:52:29Z

Description

cudf does not currently support timezone-aware datetime columns. For example:

    pdf = pd.DataFrame(
        {
            "time": pd.to_datetime(
                ["1996-01-02", "1996-12-01"],
                utc=True,
            ),
            "x": [1, 2],
        }
    )
    cudf.DataFrame.from_pandas(pdf)

NotImplementedError: cuDF does not yet support timezone-aware datetimes

However, cudf.read_parquet does allow you to read this same data from a Parquet file. This PR adds a simple fix to allow the same data to be read with dask_cudf. The dask_cudf version was previously "broken" because it relies on upstream pyarrow logic to construct meta as a pandas DataFrame (and then we just convert meta from pandas to cudf). As illustrated in the example above, this direct conversion is not allowed when one or more columns contain timezone information.

Important Context
The actual motivation for this PR is to fix a regression in 24.06+ for older parquet files containing "legacy" timestamp types (e.g. TIMESTAMP_MILLIS and TIMESTAMP_MICROS). In pyarrow 14.0.2 (used by cudf-24.04), these legacy types were not automatically translated to timezone-aware dtypes by pyarrow. In pyarrow 16.1.0 (used by cudf-24.06+), the legacy types ARE automatically translated. Therefore, in moving from cudf-24.04 to cudf-24.06+, some dask_cudf users will find that they can no longer read the same parquet file containing legacy timestamp data.

I'm not entirely sure if cudf should always allow users to read Parquet data with timezone-aware dtypes (e.g. if the timezone is not utc), but it definitely makes sense for cudf to ignore automatic/unnecessary timezone translations.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…ad-parquet

rjzamora · 2024-06-05T15:53:51Z

Note: Tried to add this to the proper "cuDF/Dask/..." project and get an error: "Sorry! We were unable to add the pull request to the selected project. Projects cannot have more than 1200 items."

python/dask_cudf/dask_cudf/io/parquet.py

mroeschke · 2024-06-05T18:20:14Z

So IIRC, if we do a deeper fix and allow .from_pandas to allow timezone aware pandas objects this would also fix the issue? IMO that fix should be relatively straightforward since there is some timezone support already in cudf

rjzamora · 2024-06-05T18:34:16Z

Thanks for the review @mroeschke !

So IIRC, if we do a deeper fix and allow .from_pandas to allow timezone aware pandas objects this would also fix the issue? IMO that fix should be relatively straightforward since there is some timezone support already in cudf

Yes. I would prefer a fix in cudf if you think that is reasonable :)

…ad-parquet

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

… timezone-read-parquet

mroeschke · 2024-06-05T21:02:04Z

I opened up #15935 to hopefully supersede this PR

wence- · 2024-06-06T10:52:19Z

Note: Tried to add this to the proper "cuDF/Dask/..." project and get an error: "Sorry! We were unable to add the pull request to the selected project. Projects cannot have more than 1200 items."

We had not archived any completed items recently. I just cleared out a backlog of around 600 "DONE" items, which should have helped.

@rjzamora

closes #13611 (This technically does not support pandas objects have interval types that are timezone aware) @rjzamora let me know if the test I adapted from your PR in #15929 is adequate Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #15935

wence-

I think (to the best of my knowledge of timezones), this makes sense.

rjzamora · 2024-06-11T15:03:15Z

/merge

rjzamora added 2 commits June 5, 2024 08:23

handle timezone-aware parquet data in dask-cudf

0d24de1

Merge remote-tracking branch 'upstream/branch-24.08' into timezone-re…

bc20707

…ad-parquet

rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Jun 5, 2024

rjzamora self-assigned this Jun 5, 2024

rjzamora requested a review from a team as a code owner June 5, 2024 15:52

github-actions bot added the Python Affects Python cuDF API. label Jun 5, 2024

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 5, 2024

mroeschke reviewed Jun 5, 2024

View reviewed changes

python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved

mroeschke reviewed Jun 5, 2024

View reviewed changes

python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved

rjzamora and others added 3 commits June 5, 2024 11:34

Merge remote-tracking branch 'upstream/branch-24.08' into timezone-re…

c07e109

…ad-parquet

Apply suggestions from code review

c49a300

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

Merge branch 'timezone-read-parquet' of github.com:rjzamora/cudf into…

514f380

… timezone-read-parquet

mroeschke mentioned this pull request Jun 5, 2024

Support timezone aware pandas inputs in cudf #15935

Merged

3 tasks

revise

99f77aa

mroeschke approved these changes Jun 10, 2024

View reviewed changes

wence- approved these changes Jun 11, 2024

View reviewed changes

Merge branch 'branch-24.08' into timezone-read-parquet

99f565d

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jun 11, 2024

rapids-bot bot merged commit 8efa64e into rapidsai:branch-24.08 Jun 11, 2024
69 checks passed

rjzamora deleted the timezone-read-parquet branch June 11, 2024 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `dask_cudf.read_parquet` regression for legacy timestamp data #15929

Fix `dask_cudf.read_parquet` regression for legacy timestamp data #15929

rjzamora commented Jun 5, 2024

rjzamora commented Jun 5, 2024

mroeschke commented Jun 5, 2024

rjzamora commented Jun 5, 2024

mroeschke commented Jun 5, 2024

wence- commented Jun 6, 2024

wence- left a comment

rjzamora commented Jun 11, 2024

Fix dask_cudf.read_parquet regression for legacy timestamp data #15929

Fix dask_cudf.read_parquet regression for legacy timestamp data #15929

Conversation

rjzamora commented Jun 5, 2024

Description

Checklist

rjzamora commented Jun 5, 2024

mroeschke commented Jun 5, 2024

rjzamora commented Jun 5, 2024

mroeschke commented Jun 5, 2024

wence- commented Jun 6, 2024

wence- left a comment

Choose a reason for hiding this comment

rjzamora commented Jun 11, 2024

Fix `dask_cudf.read_parquet` regression for legacy timestamp data #15929

Fix `dask_cudf.read_parquet` regression for legacy timestamp data #15929