Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Default to compliant nested types in Parquet writer #29781

Closed
asfimport opened this issue Oct 1, 2021 · 12 comments · Fixed by #35146
Closed

[C++][Parquet] Default to compliant nested types in Parquet writer #29781

asfimport opened this issue Oct 1, 2021 · 12 comments · Fixed by #35146
Assignees
Milestone

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 1, 2021

In C++ there is already an option to get the "compliant_nested_types" (to have the list columns follow the Parquet specification), and ARROW-11497 exposed this option in Python.

This is still set to False by default, but in the source it says "TODO: At some point we should flip this.", and in ARROW-11497 there was also some discussion about what it would take to change the default.

cc @emkornfield @pitrou

Reporter: Will Jones / @wjones127
Assignee: Will Jones / @wjones127

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-14196. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Some relevant quotes from the ARROW-11497:

[Micah] I think the main reason it isn't enabled by default is it breaks round trips for arrow data. This could potentially be fixed on the reader side as well.

[Antoine] Perhaps we could convert the field name at the Arrow<->Parquet boundary.
[Micah] This should be possible but it potentially needs another flag. I think in the short term plumbing the additional flag through to python makes sense and we can figure out a longer term solution if this becomes a larger problem.

[Antoine] It should simply be the default (and obviously right) behaviour. Am I missing something?
[Micah] Backwards compatibility? It might be possible to make some inferences (haven't thought about it deeply). But I think if we were reading a conforming java produced parquet file then we would get different column names if we transformed on the border (maybe there can be some rules around Arrow metadata being present). I think we can make the default to be conforming behavior, but we should give users some level of control to preserve the old behavior.

I am not super familiar, but so the simplest option is to just switch the default of the compliant_nested_types option in ArrowWriterProperties. What would be the (possible backwards incompatible) consequences of that?
We would start writing a different Parquet file (but actually following the spec). But I suppose that also when reading such a file, you would then get a different name for the sub-lists (which can impact selecting a sublist with a nested field reference?)
To avoid having a breaking change on the read path, we could by default also convert the names at the Parquet->Arrow boundary (like the compliant_nested_types option already does on the Arrow->Parquet boundary). However, doing that can also break code for people currently already reading compliant parquet files ..

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
One possible investigation would be to write two different files, one with the option turned on, the other with the option turned off, and find out whether they are readable by 1) current version of Arrow 2) past versions of Arrow.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I wrote this with the latest pyarrow (master):

table = pa.table({'a': [[1, 2], [3, 4, 5]]})
pq.write_table(table, "test_nested_noncompliant.parquet")
pq.write_table(table, "test_nested_compliant.parquet", use_compliant_nested_type=True)

In the latest pyarrow they both read fine, but so have different names (which can impact eg nested field refs):

>>> pq.read_table("test_nested_noncompliant.parquet")
pyarrow.Table
a: list<item: int64>
  child 0, item: int64

>>> pq.read_table("test_nested_compliant.parquet")
pyarrow.Table
a: list<element: int64>
  child 0, element: int64

So eg doing pq.read_table("test_nested_noncompliant.parquet", columns=["a.list.item"], use_legacy_dataset=True) for works the noncompliant file, but doesn't select anything with the compliant file.

Those files also read fine (and result in the same difference in list field names) with older versions of Arrow (tested down to Arrow 1.0).

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Is it really common to read only a list's child column without the list column itself? Otherwise we can live with the minor compatibility break, IMHO.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I'm also surprised that a non-existing column name wouldn't return an error instead of selecting nothing?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:

I'm also surprised that a non-existing column name wouldn't return an error instead of selecting nothing?

With the new datasets API, it actually raises an error if a column name is not found. With the legacy implementation (or the plain ParquetFile interface), it ignores those.
(and I was using use_legacy_dataset=True because with Datasets we don't yet support selecting nested fields ..)

@asfimport
Copy link
Collaborator Author

Truc Lam Nguyen / @trucnguyenlam:
[~judah.rand] sorry that I don't really understand your comment, could you please explain a little bit more? thanks

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
Yeah, I think it is really only the name change that I'm concerned about.  https://issues.apache.org/jira/browse/ARROW-13151 has another example where people where trying to reference things by path that was broken for other reasons.

 

A few item that don't really solve all cases but would make things better or at least adaptable to the long term:

1.  Add an option that translates "compliant nest type" name back to the arrow's naming schema.

2.  Make it possible to select columns by eliding the list name components.

 

Another question that is dataset specific, is if one file was written with compliant nested types and one was not, and both where read in the same dataset are the results sane (schema's get unified?)

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
I think similar to other default changes we have discussed, a very public advertisement of the intended change and consequences would be good.  At some point we should just bite the bullet.  I think with the first feature (take a compliant nested type and rename it on read) it would give users a temporary fallback.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
Also CC @jpivarski  on thoughts on how this might impact awkward arrays and ways to mitigate.

@asfimport
Copy link
Collaborator Author

Jim Pivarski / @jpivarski:
If it's changing a default, I can just set the option explicitly, right?

If it's changing how columns are named in the file (currently "fieldA.list.fieldB.list", etc.), then that would require adjustment on my side, but it would be an adjustment that depends on how the file was written, not the pyarrow version, right? If so, I'd have to support both naming conventions because both would exist in the wild.

If there is a change that Awkward Array has to adjust to, then now may be a good time to do it because we're going to be rewriting the "from_parquet" function soon, as part of integrating with Dask

https://github.com/ContinuumIO/dask-awkward

adopting fsspec, and using pyarrow's Dataset API (to replace our manual implementation). If there's something that we can set to get the new behavior, we can turn that on now while developing the new version and be ready for the change.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:

  1. Make it possible to select columns by eliding the list name components.

Currently, I think the C++ API only deals with column indices? (at least for the Python bindings, the translation of column names to field indices happens in Python) For Python that should be relatively straightforward to implement. Opened ARROW-14286 for this.

If so, I'd have to support both naming conventions because both would exist in the wild.

@jpivarski yes, but that's already the case right now as well. Parquet files written by (py)arrow will use a different name for the list element compared to parquet files written by other tools (that's actually what we are trying to harmonize). So if you select a subfield of a list field by name, you already need to take into account potentially different names at the moment.

@wjones127 wjones127 added the Breaking Change Includes a breaking change to the API label Apr 14, 2023
pitrou added a commit that referenced this issue May 12, 2023
…ult (#35146)

### Rationale for this change

This has been a long-standing TODO.

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
* Closes: #29781

Lead-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 13.0.0 milestone May 12, 2023
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…y default (apache#35146)

### Rationale for this change

This has been a long-standing TODO.

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
* Closes: apache#29781

Lead-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…y default (apache#35146)

### Rationale for this change

This has been a long-standing TODO.

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**
* Closes: apache#29781

Lead-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants