Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: Test non-nanosecond datetimes in PyArrow Parquet dataframes #59393

Merged
merged 10 commits into from
Aug 19, 2024

Conversation

EduardAkhmetshin
Copy link
Contributor

This PR tests that the current version of pandas supports non-nanosecond PyArrow dataframes without using timestamp_as_object param.

@EduardAkhmetshin
Copy link
Contributor Author

@natmokval @WillAyd Please review when you have time. Thank you!

@EduardAkhmetshin
Copy link
Contributor Author

EduardAkhmetshin commented Aug 3, 2024

All tests pass in my Docker container, but the CI build fails. I'll look into it.
Update: I see other people's PRs fail with similar errors. Is there a problem with CI?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, though I'm wondering if this is that much different than the test_read_dtype_backend_pyarrow_config that already exists?

@jbrockmendel does the most with datetimes so may want to chime in here as well

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
@WillAyd
Copy link
Member

WillAyd commented Aug 3, 2024

Also this does potentially solve the example from the OP in #49236 , although we may not want to make this as closing that. Not sure if there is a need for the requested keyword beyond this one example

@EduardAkhmetshin
Copy link
Contributor Author

@WillAyd Thank you very much for your review!

I'm wondering if this is that much different than the test_read_dtype_backend_pyarrow_config that already exists?

While test_read_dtype_backend_pyarrow_config tests microseconds as well, it uses a datetime that is in the normal nanosecond date range (1678 - 2262 as mentioned in PyArrow docs). My test uses the date 1600-01-01 from the issue, which is outside the normal range.
Let me know if you think it's redundant and test_read_dtype_backend_pyarrow_config covers everything.

@WillAyd
Copy link
Member

WillAyd commented Aug 3, 2024

Sounds good and thanks for explaining. I think it's ok to add - generally more tests are good.

@mroeschke mroeschke added the Testing pandas testing functions or related to the test suite label Aug 5, 2024
@WillAyd
Copy link
Member

WillAyd commented Aug 5, 2024

Whoops sorry for the bad guidance on ensure_clean()

@EduardAkhmetshin
Copy link
Contributor Author

@mroeschke thanks for the suggestions!

@EduardAkhmetshin
Copy link
Contributor Author

EduardAkhmetshin commented Aug 9, 2024

In the failed environment, pandas 1.5.3 is used, which fails with the same error from the original issue ticket.
Is using an old version of pandas in that environment intended for testing backwards compatibility? Should I somehow isolate my test to run in pandas 2 environments only? @WillAyd

@EduardAkhmetshin
Copy link
Contributor Author

@mroeschke could you please help with my question above? Thanks!

@mroeschke
Copy link
Member

You'll probably need specify the minversion argument in pytest.importorskip("pyarrow") call. There's probably a version of pyarrow where this behavior was fixed

@mroeschke mroeschke added this to the 3.0 milestone Aug 19, 2024
@mroeschke mroeschke merged commit 3097190 into pandas-dev:main Aug 19, 2024
39 of 47 checks passed
@mroeschke
Copy link
Member

Thanks @EduardAkhmetshin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants