Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align buffers from Python (FFI) #6472

Merged
merged 4 commits into from
Oct 2, 2024

Conversation

EnricoMi
Copy link
Contributor

Which issue does this PR close?

Together with #6462 fixes #6471.

Rationale for this change

Data coming from Python could be misaligned. Using them will fail (see #6471):

Memory pointer from external source (e.g, FFI) is not aligned with the specified scalar type.
Before importing buffer through FFI, please make sure the allocation is aligned.

Costly copying the data is better than failing and not being able to use the data.

What changes are included in this PR?

Call existing ArrayData.align_buffers right after array data are retrieved via FFI (as suggested by the error message).

Are there any user-facing changes?

Data become useable.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 27, 2024
@tustvold
Copy link
Contributor

I think there is an expectation that FFI doesn't copy buffers, and that data provided is aligned as per the arrow specification.

This is actually a bug in pyarrow and there are some solutions on the linked issue - apache/arrow#32276 (comment)

That being said it might be good to document this further as this bug has been around for a while now

@EnricoMi
Copy link
Contributor Author

@tustvold How does Rust Arrow Flight avoid mis-aligned buffers given it also uses gRPC?

@tustvold
Copy link
Contributor

tustvold commented Sep 27, 2024

Tonic, our gRPC implementation, provides a zero-copy decode mechanism using Bytes, this internally works by taking refcounted slices of the underlying HTTP frame's payload. This is then fed into the IPC decoder which then constructs Buffer from this, copying when necessary for alignment. This does mean that potentially some data is copied twice, once from the network socket and once into the destination Buffer, but ultimately one of these copies is unavoidable with a networked protocol and practically the network throughput will be the major bottleneck regardless.

It is also worth noting we enforce alignment based on the encoded data, this means that string data, which often comprises the majority of data by volume, will never be copied more than once as there is no alignment requirements for bytes.

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Oct 1, 2024

This is then fed into the IPC decoder which then constructs Buffer from this, copying when necessary for alignment.

@tustvold Maybe C++ Flight client should do the same: copy buffers if needed when handed to IPC: apache/arrow#44279

@tustvold
Copy link
Contributor

tustvold commented Oct 1, 2024

That makes sense to me, although the final call will lie with the maintainers of the C++ implementation

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @EnricoMi -- this looks great. The only thing I think missing is a test (perhaps you can adapt some part of your wonderful reproducer from #6471?)

@@ -363,13 +363,14 @@ impl FromPyArrow for RecordBatch {

let schema_ptr = unsafe { schema_capsule.reference::<FFI_ArrowSchema>() };
let ffi_array = unsafe { FFI_ArrowArray::from_raw(array_capsule.pointer().cast()) };
let array_data = unsafe { ffi::from_ffi(ffi_array, schema_ptr) }.map_err(to_py_err)?;
let mut array_data = unsafe { ffi::from_ffi(ffi_array, schema_ptr) }.map_err(to_py_err)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @EnricoMi -- this fix looks great to me (if the alternatives are error or copy the data, I think we should copy the data).

As @tustvold says ideally the C++ should properly align things but this seems reasonable to me

@alamb
Copy link
Contributor

alamb commented Oct 1, 2024

It also appears we need to run cargo fmt to get the Ci to pass

@tustvold
Copy link
Contributor

tustvold commented Oct 1, 2024

I feel fairly strongly that we should keep the current contract where a misalignment yields an error, as it will be surprising to users if buffers get copied across an FFI boundary, which is reasonably expected to keep the buffers as-is.

That being said, should it not be possible to make progress on apache/arrow#44279 which fixes the actual upstream issue we can potentially move forward on this, perhaps gated behind a feature flag.

@alamb
Copy link
Contributor

alamb commented Oct 1, 2024

I feel fairly strongly that we should keep the current contract where a misalignment yields an error, as it will be surprising to users if buffers get copied across an FFI boundary, which is reasonably expected to keep the buffers as-is.

FWIW I think this PR is very specific to pyarrow (it will align misaligned data from pyarrow but not other arbitrary FFI boundaries).

which fixes the actual upstream issue we can potentially move forward on this, perhaps gated behind a feature flag.

It seems reasonable to me to make a feature flag like "ffi_strict" that would throw an error if pyarrow provides misaligned memory.

I think it should be disabled by default (aka accept data made by pyarrow by default) so as not to penalize arrow-rs users for a bug in pyarrow.

@tustvold
Copy link
Contributor

tustvold commented Oct 1, 2024

FWIW I think this PR is very specific to pyarrow

Aah, I missed that detail, in which case this is probably fine.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged this PR up from main, ran cargo fmt and added some comments

I plan to create the next arrow release candidate today and would like to include this fix as well

It would still be good to have tests for this scenario, but I think creating the testing framework (python flight client, server, etc) might be overly burdensome. @EnricoMi let me know your thoughts.

@tustvold tustvold merged commit 2c524f1 into apache:master Oct 2, 2024
23 checks passed
@EnricoMi EnricoMi deleted the align-buffers-from-python branch October 2, 2024 12:29
@EnricoMi
Copy link
Contributor Author

EnricoMi commented Oct 2, 2024

... but I think creating the testing framework (python flight client, server, etc) might be overly burdensome. @EnricoMi let me know your thoughts.

That is actually quite simple: https://github.com/EnricoMi/arrow-rs/pull/1/files#diff-6a2014ebd05cb0c56d1154780d5836d04ac1f4d8626de3da67edc64bb03414a0

But I don't know how to pass the data through the FromPyArrow path as that test takes a different route.

@alamb
Copy link
Contributor

alamb commented Oct 2, 2024

I will be honest I don't understand how to invoke this code either 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flight data retrieved via Python client (wrapping C++) cannot be used by Rust Arrow
3 participants