Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC writer API changes for granular statistics #10058

Merged
merged 34 commits into from
Jan 20, 2022

Conversation

mythrocks
Copy link
Contributor

Depends on #10041.

The erstwhile ORC writer API exposed only a binary choice to choose
the level of statistics: ENABLED/DISABLED.
This commit allows the ORC writer to further choose whether statistics
are collected at the ROW_GROUP or STRIPE level.

This commit also includes the relevant changes to java/ and python/.

@mythrocks mythrocks added improvement Improvement / enhancement to an existing function breaking Breaking change labels Jan 15, 2022
@mythrocks mythrocks requested a review from vuule January 15, 2022 02:39
@mythrocks mythrocks self-assigned this Jan 15, 2022
@mythrocks mythrocks requested review from a team as code owners January 15, 2022 02:39
@github-actions github-actions bot added Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jan 15, 2022
@mythrocks
Copy link
Contributor Author

This branch includes the changes from #10041, since it's based on vuule:fea-rowgroup-stats.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few small suggestions.

Kind of scary that tests did not require any changes. I'd like to add a test with ~1M rows and various data types that runs with different stat frequency. Would be even better to reuse an existing test, will look into this.

cpp/benchmarks/io/orc/orc_writer_benchmark.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/orc.hpp Show resolved Hide resolved
cpp/include/cudf/io/orc.hpp Show resolved Hide resolved
python/cudf/cudf/_fuzz_testing/tests/fuzz_test_orc.py Outdated Show resolved Hide resolved
@mythrocks
Copy link
Contributor Author

(Please pardon the force push.)
It appears that when the statistics frequency is set to STRIPE, there we have trouble reading back the statistics, as evidenced by the failures in test_orc_writer_statistics_frequency():

/home/mithunr/anaconda3/envs/cudf-dev-11.5-3/lib/python3.8/site-packages/pyarrow/orc.py:75: in __init__
    self.reader.open(source)
pyarrow/_orc.pyx:58: in pyarrow._orc.ORCReader.open
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   OSError: Failed to parse the footer from ArrowInputFile

pyarrow/error.pxi:114: OSError

This surfaced when attempting to add a non-default statistics frequency to test_orc_write_statistics().
Please note that test_chunked_orc_writer_statistics_frequency[STRIPE] does succeed.

@vuule vuule added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jan 19, 2022
@vuule
Copy link
Contributor

vuule commented Jan 20, 2022

rerun tests

1 similar comment
@vyasr
Copy link
Contributor

vyasr commented Jan 20, 2022

rerun tests

@mythrocks
Copy link
Contributor Author

I'm working on resolving the merge conflict in orc.hpp. Need a moment to re-run the tests locally.

@vuule vuule removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jan 20, 2022
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 🔥

@mythrocks
Copy link
Contributor Author

Merging shortly.

@mythrocks
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 53a31d1 into rapidsai:branch-22.02 Jan 20, 2022
@mythrocks
Copy link
Contributor Author

I've just merged this change.
@vuule: Thank you for the guidance, reviews, and help with the bugs!
@galipremsagar, @devavret, @jlowe, @cwharris: Thank you for the reviews.

@mythrocks mythrocks deleted the fea-orc-stats-api branch January 20, 2022 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants