Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas #16394

Merged

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Jul 25, 2024

Description

Closes #16269.

This PR adds support to read (matching) projected/selected and filter columns from Parquet files with otherwise mismatching schemas.

Solution Description

We create a std::vector<unordered_maps<int32_t, int32_t>>, one per file except 0th file. We then co-walk schema trees and populate the map with corresponding (one-to-one mapped) schema_idx of valid selected (projection and filter) column between 0th and the rest of the files. The same unordered_map is used to get the schema_idx of the same columns across files when creating ColumnChunkDesc and copying column chunk metadata into the page decoder.

Known Limitation

CC @wence-

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mhaseeb123 mhaseeb123 self-assigned this Jul 25, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jul 25, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuDF (Python) and removed Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jul 25, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jul 25, 2024
@@ -1041,18 +1068,19 @@ aggregate_reader_metadata::select_columns(
std::optional<std::vector<std::string>> const& filter_columns_names,
bool include_index,
bool strings_to_categorical,
type_id timestamp_type_id) const
type_id timestamp_type_id)
Copy link
Member Author

@mhaseeb123 mhaseeb123 Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const removed as we will now be populating schema_idx_maps in this function.

Comment on lines 286 to 290
source_info=plc.io.SourceInfo(new_bufs),
columns=columns,
row_groups=row_groups,
use_pandas_metadata=use_pandas_metadata,
read_mismatched_pq_schemas=read_mismatched_pq_schemas,
Copy link
Member Author

@mhaseeb123 mhaseeb123 Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to do this to properly propagate read_mismatched_pq_schemas to ChunkedParquetReader. Not sure why it wouldn't propagate otherwise if I do the following. Suggestions welcome. By propagate, I mean if it's true here, it would not be true in ChunkedParquetReader.

        plc.io.SourceInfo(new_bufs),
        columns,
        row_groups,
        use_pandas_metadata,
        read_mismatched_pq_schemas, <- doesn't propagate to ChunkedParquetReader

@mhaseeb123 mhaseeb123 marked this pull request as ready for review July 25, 2024 21:23
@mhaseeb123 mhaseeb123 requested review from a team as code owners July 25, 2024 21:23
@mhaseeb123 mhaseeb123 removed the 2 - In Progress Currently a work in progress label Jul 25, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, to the best of my understanding in the parquet reader, this looks good. Some minor nits around documentation.

auto const& schema_idx_map = schema_idx_maps[src_idx - 1];
CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(),
"Unmapped schema index encountered in the specified source tree",
std::out_of_range);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks good to me.

// Check the schema elements to be equal except their number of children as we only care about
// the specific column paths in the schema trees.
CUDF_EXPECTS(equal_to_except_num_children(src_schema_elem, dst_schema_elem),
"Encountered mismatching SchemaElement properties encountered for a column in "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Encountered mismatching SchemaElement properties encountered for a column in "
"Encountered mismatching SchemaElement properties for a column in "

if (col_name_info == nullptr or col_name_info->children.empty()) {
// Check the number of children to be equal here.
CUDF_EXPECTS(src_schema_elem.num_children == dst_schema_elem.num_children,
"Encountered mismatching number of children encountered for a "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Encountered mismatching number of children encountered for a "
"Encountered mismatching number of children for a "

@@ -183,7 +190,8 @@ class aggregate_reader_metadata {

public:
aggregate_reader_metadata(host_span<std::unique_ptr<datasource> const> sources,
bool use_arrow_schema);
bool use_arrow_schema,
bool has_cols_from_mismatched_srcs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the common case where all the schemas match, is this extra code a performance hit? Or, said another way, should we turn on read_mismatched_pq_schema by default?

auto const& schema_idx_map = schema_idx_maps[src_idx - 1];
CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(),
"Unmapped schema index encountered in the specified source tree",
std::out_of_range);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we document what exceptions these functions now throw?

@mhaseeb123
Copy link
Member Author

CC @etseidl

@mhaseeb123 mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Jul 31, 2024
* @return `true` if mismatched projected and filter columns will be read from mismatched Parquet
* sources.
*/
[[nodiscard]] bool is_enabled_allow_mismatched_pq_schemas() const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “pq” feels redundant because this is code for the parquet reader. Can we remove that from the name?

Suggested change
[[nodiscard]] bool is_enabled_allow_mismatched_pq_schemas() const
[[nodiscard]] bool is_enabled_allow_mismatched_schemas() const

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now also support reading (default on) arrow_schema in our Parquet reader so I added the pq keyword to better disambiguate between them (though we don't really check for mismatched arrow schemas per se but I thought it would be better to know which schema we are talking about here). I don't really have a strong feeling about pq here one way or the other.

Copy link
Contributor

@bdice bdice Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it helps disambiguate, then it’s fine to leave it. I am not familiar enough with the format to know what users expect. Is this mismatched schema feature something that other readers implement? How do they name it?

Copy link
Member Author

@mhaseeb123 mhaseeb123 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any other readers implement this feature (I should double check this statement). Afaik, this request comes from a use-case in cudf-polars where we might want to read some matching columns from otherwise mismatching parquet files. Maybe @wence- can shine some light on it's actual application.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see this being useful in the case of evolving schemas...newer files add or remove a field, but it's too costly to migrate the old data. This would at least allow queries against the common fields.

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
auto const& schema_idx_map = schema_idx_maps[src_idx - 1];
CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(),
"Unmapped schema index encountered in the specified source tree",
std::out_of_range);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t understand why a runtime error is not permitted here. I read the linked thread. This feels like it should be a RuntimeError in Python. Maybe a ValueError (invalid value in C++, I think). Out of range feels wrong (here and below).

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Show resolved Hide resolved
Copy link

copy-pr-bot bot commented Aug 6, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

auto const& schema_idx_map = schema_idx_maps[src_idx - 1];
CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(),
"Unmapped schema index encountered in the specified source tree",
std::range_error);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to std::range_error. Can't say it's any better than std::out_of_range

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only found 2 nits 😄. Thanks @mhaseeb123, this may come in very handy down the road. LGTM.

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
@mhaseeb123 mhaseeb123 closed this Aug 19, 2024
@mhaseeb123 mhaseeb123 deleted the fea-pq-reader-mismatched-schema branch August 19, 2024 18:22
@mhaseeb123 mhaseeb123 restored the fea-pq-reader-mismatched-schema branch August 19, 2024 18:22
@mhaseeb123 mhaseeb123 reopened this Aug 19, 2024
@github-actions github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Aug 19, 2024
@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit fbd6114 into rapidsai:branch-24.10 Aug 28, 2024
82 checks passed
@mhaseeb123 mhaseeb123 deleted the fea-pq-reader-mismatched-schema branch August 28, 2024 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond cuDF (Python) cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
6 participants