Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas #16394

mhaseeb123 · 2024-07-25T03:35:09Z

Description

This PR adds support to read (matching) projected/selected and filter columns from Parquet files with otherwise mismatching schemas.

Solution Description

We create a std::vector<unordered_maps<int32_t, int32_t>>, one per file except 0th file. We then co-walk schema trees and populate the map with corresponding (one-to-one mapped) schema_idx of valid selected (projection and filter) column between 0th and the rest of the files. The same unordered_map is used to get the schema_idx of the same columns across files when creating ColumnChunkDesc and copying column chunk metadata into the page decoder.

Known Limitation

Nullability across files: Each selected column must still be either nullable or non-nullable across all files. See [FEA] Allow user to control "not null" constraint on Parquet columns #12702 also described in #dask/9935

CC @wence-

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/src/io/parquet/reader_impl_helpers.cpp

mhaseeb123 · 2024-07-25T20:10:48Z

cpp/src/io/parquet/reader_impl_helpers.cpp

@@ -1041,18 +1068,19 @@ aggregate_reader_metadata::select_columns(
  std::optional<std::vector<std::string>> const& filter_columns_names,
  bool include_index,
  bool strings_to_categorical,
-  type_id timestamp_type_id) const
+  type_id timestamp_type_id)


const removed as we will now be populating schema_idx_maps in this function.

mhaseeb123 · 2024-07-25T20:14:12Z

python/cudf/cudf/_lib/parquet.pyx

+        source_info=plc.io.SourceInfo(new_bufs),
+        columns=columns,
+        row_groups=row_groups,
+        use_pandas_metadata=use_pandas_metadata,
+        read_mismatched_pq_schemas=read_mismatched_pq_schemas,


Had to do this to properly propagate read_mismatched_pq_schemas to ChunkedParquetReader. Not sure why it wouldn't propagate otherwise if I do the following. Suggestions welcome. By propagate, I mean if it's true here, it would not be true in ChunkedParquetReader.

plc.io.SourceInfo(new_bufs), columns, row_groups, use_pandas_metadata, read_mismatched_pq_schemas, <- doesn't propagate to ChunkedParquetReader

wence-

Thanks, to the best of my understanding in the parquet reader, this looks good. Some minor nits around documentation.

wence- · 2024-07-31T14:48:03Z

cpp/src/io/parquet/reader_impl_helpers.cpp

+    auto const& schema_idx_map = schema_idx_maps[src_idx - 1];
+    CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(),
+                 "Unmapped schema index encountered in the specified source tree",
+                 std::out_of_range);


This one looks good to me.

wence- · 2024-07-31T14:55:05Z

cpp/src/io/parquet/reader_impl_helpers.cpp

+      // Check the schema elements to be equal except their number of children as we only care about
+      // the specific column paths in the schema trees.
+      CUDF_EXPECTS(equal_to_except_num_children(src_schema_elem, dst_schema_elem),
+                   "Encountered mismatching SchemaElement properties encountered for a column in "


Suggested change

"Encountered mismatching SchemaElement properties encountered for a column in "

"Encountered mismatching SchemaElement properties for a column in "

wence- · 2024-07-31T14:55:32Z

cpp/src/io/parquet/reader_impl_helpers.cpp

+      if (col_name_info == nullptr or col_name_info->children.empty()) {
+        // Check the number of children to be equal here.
+        CUDF_EXPECTS(src_schema_elem.num_children == dst_schema_elem.num_children,
+                     "Encountered mismatching number of children encountered for a "


Suggested change

"Encountered mismatching number of children encountered for a "

"Encountered mismatching number of children for a "

wence- · 2024-07-31T14:57:14Z

cpp/src/io/parquet/reader_impl_helpers.hpp

@@ -183,7 +190,8 @@ class aggregate_reader_metadata {

 public:
  aggregate_reader_metadata(host_span<std::unique_ptr<datasource> const> sources,
-                            bool use_arrow_schema);
+                            bool use_arrow_schema,
+                            bool has_cols_from_mismatched_srcs);


For the common case where all the schemas match, is this extra code a performance hit? Or, said another way, should we turn on read_mismatched_pq_schema by default?

wence- · 2024-07-31T14:58:21Z

cpp/src/io/parquet/reader_impl_helpers.cpp

+    auto const& schema_idx_map = schema_idx_maps[src_idx - 1];
+    CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(),
+                 "Unmapped schema index encountered in the specified source tree",
+                 std::out_of_range);


nit: Can we document what exceptions these functions now throw?

mhaseeb123 · 2024-07-31T18:31:49Z

CC @etseidl

bdice · 2024-08-06T00:41:00Z

cpp/include/cudf/io/parquet.hpp

+   * @return `true` if mismatched projected and filter columns will be read from mismatched Parquet
+   * sources.
+   */
+  [[nodiscard]] bool is_enabled_allow_mismatched_pq_schemas() const


The “pq” feels redundant because this is code for the parquet reader. Can we remove that from the name?

Suggested change

[[nodiscard]] bool is_enabled_allow_mismatched_pq_schemas() const

[[nodiscard]] bool is_enabled_allow_mismatched_schemas() const

We now also support reading (default on) arrow_schema in our Parquet reader so I added the pq keyword to better disambiguate between them (though we don't really check for mismatched arrow schemas per se but I thought it would be better to know which schema we are talking about here). I don't really have a strong feeling about pq here one way or the other.

If you think it helps disambiguate, then it’s fine to leave it. I am not familiar enough with the format to know what users expect. Is this mismatched schema feature something that other readers implement? How do they name it?

I don't think any other readers implement this feature (I should double check this statement). Afaik, this request comes from a use-case in cudf-polars where we might want to read some matching columns from otherwise mismatching parquet files. Maybe @wence- can shine some light on it's actual application.

I could see this being useful in the case of evolving schemas...newer files add or remove a field, but it's too costly to migrate the old data. This would at least allow queries against the common fields.

cpp/src/io/parquet/reader_impl_helpers.cpp

bdice · 2024-08-06T00:48:38Z

cpp/src/io/parquet/reader_impl_helpers.cpp

+    auto const& schema_idx_map = schema_idx_maps[src_idx - 1];
+    CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(),
+                 "Unmapped schema index encountered in the specified source tree",
+                 std::out_of_range);


I don’t understand why a runtime error is not permitted here. I read the linked thread. This feels like it should be a RuntimeError in Python. Maybe a ValueError (invalid value in C++, I think). Out of range feels wrong (here and below).

cpp/src/io/parquet/reader_impl_helpers.cpp

cpp/src/io/parquet/reader_impl_helpers.hpp

…o fea-pq-reader-mismatched-schema

…mhaseeb123/cudf into fea-pq-reader-mismatched-schema

copy-pr-bot · 2024-08-06T15:45:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mhaseeb123 · 2024-08-06T17:21:38Z

cpp/src/io/parquet/reader_impl_helpers.cpp

+    auto const& schema_idx_map = schema_idx_maps[src_idx - 1];
+    CUDF_EXPECTS(schema_idx_map.find(schema_idx) != schema_idx_map.end(),
+                 "Unmapped schema index encountered in the specified source tree",
+                 std::range_error);


Changed this to std::range_error. Can't say it's any better than std::out_of_range

etseidl

Only found 2 nits 😄. Thanks @mhaseeb123, this may come in very handy down the road. LGTM.

cpp/src/io/parquet/reader_impl_helpers.cpp

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

…o fea-pq-reader-mismatched-schema

…mhaseeb123/cudf into fea-pq-reader-mismatched-schema

mhaseeb123 · 2024-08-19T21:21:08Z

/ok to test

mhaseeb123 · 2024-08-28T22:06:05Z

/merge

Add support to read select cols from mismatched PQ sources

ef0e732

mhaseeb123 self-assigned this Jul 25, 2024

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jul 25, 2024

minor refactoring and fix for possible segfault in gtests

ab5657b

github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jul 25, 2024

Add pytest with struct data

a2db0fa

mhaseeb123 commented Jul 25, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved

mhaseeb123 commented Jul 25, 2024

View reviewed changes

mhaseeb123 and others added 4 commits July 25, 2024 20:37

Revert back to auto

2c6e944

Merge branch 'branch-24.10' into fea-pq-reader-mismatched-schema

fb05d17

Minor improvements

6d5f1be

Minor improvements

56d28a9

mhaseeb123 marked this pull request as ready for review July 25, 2024 21:23

mhaseeb123 requested review from a team as code owners July 25, 2024 21:23

mhaseeb123 requested review from bdice, galipremsagar and vuule July 25, 2024 21:23

mhaseeb123 removed the 2 - In Progress Currently a work in progress label Jul 25, 2024

mhaseeb123 added 2 commits July 30, 2024 20:58

Merge branch 'branch-24.10' into fea-pq-reader-mismatched-schema

137f4b8

Merge branch 'branch-24.10' into fea-pq-reader-mismatched-schema

eb89e06

wence- approved these changes Jul 31, 2024

View reviewed changes

mhaseeb123 added 2 commits July 31, 2024 18:27

docs updates

110ca57

Merge branch 'branch-24.10' into fea-pq-reader-mismatched-schema

3fdb874

mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Jul 31, 2024

bdice reviewed Aug 6, 2024

View reviewed changes

mhaseeb123 added 3 commits August 6, 2024 15:42

Merge branch 'branch-24.10' of https://github.com/mhaseeb123/cudf int…

c5eb3b6

…o fea-pq-reader-mismatched-schema

Merge branch 'branch-24.10' of https://github.com/mhaseeb123/cudf int…

cbf181a

…o fea-pq-reader-mismatched-schema

Merge branch 'fea-pq-reader-mismatched-schema' of https://github.com/…

b020c47

…mhaseeb123/cudf into fea-pq-reader-mismatched-schema

Address PR review comments

88f2003

mhaseeb123 commented Aug 6, 2024

View reviewed changes

etseidl reviewed Aug 6, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

b94db35

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

mhaseeb123 closed this Aug 19, 2024

mhaseeb123 deleted the fea-pq-reader-mismatched-schema branch August 19, 2024 18:22

mhaseeb123 restored the fea-pq-reader-mismatched-schema branch August 19, 2024 18:22

mhaseeb123 reopened this Aug 19, 2024

github-actions bot removed the pylibcudf Issues specific to the pylibcudf package label Aug 19, 2024

mhaseeb123 and others added 3 commits August 19, 2024 18:24

Merge branch 'branch-24.10' of https://github.com/mhaseeb123/cudf int…

d461480

…o fea-pq-reader-mismatched-schema

Merge branch 'fea-pq-reader-mismatched-schema' of https://github.com/…

3132e5c

…mhaseeb123/cudf into fea-pq-reader-mismatched-schema

Merge branch 'branch-24.10' into fea-pq-reader-mismatched-schema

a2317a8

vuule approved these changes Aug 28, 2024

View reviewed changes

rapids-bot bot merged commit fbd6114 into rapidsai:branch-24.10 Aug 28, 2024
82 checks passed

mhaseeb123 deleted the fea-pq-reader-mismatched-schema branch August 28, 2024 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas #16394

Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas #16394

mhaseeb123 commented Jul 25, 2024 •

edited

Loading

mhaseeb123 Jul 25, 2024 •

edited

Loading

mhaseeb123 Jul 25, 2024 •

edited

Loading

wence- left a comment

wence- Jul 31, 2024

wence- Jul 31, 2024

wence- Jul 31, 2024

wence- Jul 31, 2024

wence- Jul 31, 2024

mhaseeb123 commented Jul 31, 2024

bdice Aug 6, 2024

mhaseeb123 Aug 6, 2024

bdice Aug 6, 2024 •

edited

Loading

mhaseeb123 Aug 6, 2024 •

edited

Loading

etseidl Aug 6, 2024

bdice Aug 6, 2024

copy-pr-bot bot commented Aug 6, 2024

mhaseeb123 Aug 6, 2024

etseidl left a comment

mhaseeb123 commented Aug 19, 2024

mhaseeb123 commented Aug 28, 2024

	"Encountered mismatching SchemaElement properties encountered for a column in "
	"Encountered mismatching SchemaElement properties for a column in "

	"Encountered mismatching number of children encountered for a "
	"Encountered mismatching number of children for a "

	[[nodiscard]] bool is_enabled_allow_mismatched_pq_schemas() const
	[[nodiscard]] bool is_enabled_allow_mismatched_schemas() const

Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas #16394

Support reading matching projected and filter cols from Parquet files with otherwise mismatched schemas #16394

Conversation

mhaseeb123 commented Jul 25, 2024 • edited Loading

Description

Solution Description

Known Limitation

Checklist

mhaseeb123 Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

mhaseeb123 Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhaseeb123 commented Jul 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

mhaseeb123 Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

copy-pr-bot bot commented Aug 6, 2024

Choose a reason for hiding this comment

etseidl left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Aug 19, 2024

mhaseeb123 commented Aug 28, 2024

mhaseeb123 commented Jul 25, 2024 •

edited

Loading

mhaseeb123 Jul 25, 2024 •

edited

Loading

mhaseeb123 Jul 25, 2024 •

edited

Loading

bdice Aug 6, 2024 •

edited

Loading

mhaseeb123 Aug 6, 2024 •

edited

Loading