-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas #16394
Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas #16394
Conversation
@@ -1041,18 +1068,19 @@ aggregate_reader_metadata::select_columns( | |||
std::optional<std::vector<std::string>> const& filter_columns_names, | |||
bool include_index, | |||
bool strings_to_categorical, | |||
type_id timestamp_type_id) const | |||
type_id timestamp_type_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
removed as we will now be populating schema_idx_maps
in this function.
python/cudf/cudf/_lib/parquet.pyx
Outdated
source_info=plc.io.SourceInfo(new_bufs), | ||
columns=columns, | ||
row_groups=row_groups, | ||
use_pandas_metadata=use_pandas_metadata, | ||
read_mismatched_pq_schemas=read_mismatched_pq_schemas, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to do this to properly propagate read_mismatched_pq_schemas
to ChunkedParquetReader
. Not sure why it wouldn't propagate otherwise if I do the following. Suggestions welcome. By propagate, I mean if it's true here, it would not be true in ChunkedParquetReader.
plc.io.SourceInfo(new_bufs),
columns,
row_groups,
use_pandas_metadata,
read_mismatched_pq_schemas, <- doesn't propagate to ChunkedParquetReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, to the best of my understanding in the parquet reader, this looks good. Some minor nits around documentation.
auto const& schema_idx_map = schema_idx_maps[src_idx - 1]; | ||
CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(), | ||
"Unmapped schema index encountered in the specified source tree", | ||
std::out_of_range); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one looks good to me.
// Check the schema elements to be equal except their number of children as we only care about | ||
// the specific column paths in the schema trees. | ||
CUDF_EXPECTS(equal_to_except_num_children(src_schema_elem, dst_schema_elem), | ||
"Encountered mismatching SchemaElement properties encountered for a column in " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Encountered mismatching SchemaElement properties encountered for a column in " | |
"Encountered mismatching SchemaElement properties for a column in " |
if (col_name_info == nullptr or col_name_info->children.empty()) { | ||
// Check the number of children to be equal here. | ||
CUDF_EXPECTS(src_schema_elem.num_children == dst_schema_elem.num_children, | ||
"Encountered mismatching number of children encountered for a " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Encountered mismatching number of children encountered for a " | |
"Encountered mismatching number of children for a " |
@@ -183,7 +190,8 @@ class aggregate_reader_metadata { | |||
|
|||
public: | |||
aggregate_reader_metadata(host_span<std::unique_ptr<datasource> const> sources, | |||
bool use_arrow_schema); | |||
bool use_arrow_schema, | |||
bool has_cols_from_mismatched_srcs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the common case where all the schemas match, is this extra code a performance hit? Or, said another way, should we turn on read_mismatched_pq_schema
by default?
auto const& schema_idx_map = schema_idx_maps[src_idx - 1]; | ||
CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(), | ||
"Unmapped schema index encountered in the specified source tree", | ||
std::out_of_range); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we document what exceptions these functions now throw?
CC @etseidl |
* @return `true` if mismatched projected and filter columns will be read from mismatched Parquet | ||
* sources. | ||
*/ | ||
[[nodiscard]] bool is_enabled_allow_mismatched_pq_schemas() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The “pq” feels redundant because this is code for the parquet reader. Can we remove that from the name?
[[nodiscard]] bool is_enabled_allow_mismatched_pq_schemas() const | |
[[nodiscard]] bool is_enabled_allow_mismatched_schemas() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now also support reading (default on) arrow_schema
in our Parquet reader so I added the pq
keyword to better disambiguate between them (though we don't really check for mismatched arrow schemas per se but I thought it would be better to know which schema we are talking about here). I don't really have a strong feeling about pq
here one way or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think it helps disambiguate, then it’s fine to leave it. I am not familiar enough with the format to know what users expect. Is this mismatched schema feature something that other readers implement? How do they name it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any other readers implement this feature (I should double check this statement). Afaik, this request comes from a use-case in cudf-polars
where we might want to read some matching columns from otherwise mismatching parquet files. Maybe @wence- can shine some light on it's actual application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see this being useful in the case of evolving schemas...newer files add or remove a field, but it's too costly to migrate the old data. This would at least allow queries against the common fields.
auto const& schema_idx_map = schema_idx_maps[src_idx - 1]; | ||
CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(), | ||
"Unmapped schema index encountered in the specified source tree", | ||
std::out_of_range); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t understand why a runtime error is not permitted here. I read the linked thread. This feels like it should be a RuntimeError in Python. Maybe a ValueError (invalid value in C++, I think). Out of range feels wrong (here and below).
…o fea-pq-reader-mismatched-schema
…o fea-pq-reader-mismatched-schema
…mhaseeb123/cudf into fea-pq-reader-mismatched-schema
auto const& schema_idx_map = schema_idx_maps[src_idx - 1]; | ||
CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(), | ||
"Unmapped schema index encountered in the specified source tree", | ||
std::range_error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to std::range_error
. Can't say it's any better than std::out_of_range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only found 2 nits 😄. Thanks @mhaseeb123, this may come in very handy down the road. LGTM.
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
…o fea-pq-reader-mismatched-schema
…mhaseeb123/cudf into fea-pq-reader-mismatched-schema
/ok to test |
/merge |
Description
Closes #16269.
This PR adds support to read (matching) projected/selected and filter columns from Parquet files with otherwise mismatching schemas.
Solution Description
We create a
std::vector<unordered_maps<int32_t, int32_t>>
, one per file except 0th file. We then co-walk schema trees and populate the map with corresponding (one-to-one mapped)schema_idx
of valid selected (projection and filter) column between 0th and the rest of the files. The sameunordered_map
is used to get theschema_idx
of the same columns across files when creatingColumnChunkDesc
and copying column chunk metadata into the page decoder.Known Limitation
CC @wence-
Checklist