Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interchange protocol fixes and updates #2150

Merged
merged 12 commits into from
Oct 5, 2022

Conversation

honno
Copy link
Contributor

@honno honno commented Aug 4, 2022

The interchange protocol had some spec changes recently (data-apis/dataframe-api#74), so this PR namely updates vaex to conform with them.

There's new behaviour with datatimes (NaTs are now sentinel values), but they first require vaex to support interchanging datetimes generally, so would be better served in a future PR.

cc @maartenbreddels

@honno
Copy link
Contributor Author

honno commented Aug 4, 2022

I didn't realise CI has 3.6 and 3.7 jobs. I see #1802, so assume these versions want to be generally maintained for? I have used f-stirngs in this PR, although note the df protocol implementation and tests already used them.

Copy link
Member

@maartenbreddels maartenbreddels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really happy to see this PR, sorry for taking so long to review this.

I'll push two small commits, you only need to accept 1 suggestion (if you agree)

packages/vaex-core/vaex/dataframe_protocol.py Outdated Show resolved Hide resolved
tests/dataframe_protocol_test.py Outdated Show resolved Hide resolved
@honno
Copy link
Contributor Author

honno commented Sep 5, 2022

Forgot to update some refs of size (i.e. use size() instead), so made the necessary updates and wrote a test. Found a pre-existing bug with Column.get_chunks() that now has some xfailed cases... probably something to fix in a future PR I imagine.

Otherwise I think I've addressed your comments now :)

As an aside, might I ask if you still need to maintain for out-dated Python versions? From a selfish perspective, removing those jobs would bring CI to green 😅

@maartenbreddels
Copy link
Member

Yeah, Python 3.6 is still being used by clients of a client, so I'm still stuck with that :)
Note that 3.6 has f-strings, just not that = syntax in the f-string.

@maartenbreddels
Copy link
Member

pyarrow.lib.ArrowInvalid: Dictionary array invalid: Invalid: First or last binary offset out of bounds

This seems like a legit issue :/

@honno
Copy link
Contributor Author

honno commented Sep 6, 2022

pyarrow.lib.ArrowInvalid: Dictionary array invalid: Invalid: First or last binary offset out of bounds

This seems like a legit issue :/

I wasn't even scrolling down the failing jobs heh, didn't notice all win+mac jobs were failing. Seems to be a pre-existing problem, looking at the log for an older CI run, so I'd suggest we leave it for a future PR. Ideally we'd xfail those tests for those platforms—is there any platform mechanism for that already exists in vaex that I should use, or shall I try come up with something? Seems like an introduced issue in how I conform to the spec changes. Given the OS requirements and my ignorance on vaex/pyarrow here, I'm not particularly motivated to debug this haha, so platform-specific xfails still feel appropriate (seeing this PR is still a plus in that it conforms to the spec and can actually mostly work for a lot of interchange, if just for linux).

@maartenbreddels
Copy link
Member

Ok, sorry this takes so long, but apart from being very busy, this was difficult to debug.

The issue seems to be that convert_categorical_column calls convert_string_column, calling buffer_to_ndarray which passes the pointer to numpy, but but the nobody has a reference to the _VaexBuffer anymore, which releases the reference to the arrow array, which then release the underlying memory.

In this specific case, we get the Arrow buffers via bitmap_buffer, offsets, string_bytes = self._col.evaluate().buffers() which returns an arrow array Vaex' doesn't hold the reference to any more (as will be the case for virtual column as well).

So the question is, using the .ptr mechanism, who's responsible for holding a reference to the underlying memory?

@honno
Copy link
Contributor Author

honno commented Sep 30, 2022

Looking at the diff again, this seems like it was an underlying issue that's only come up because this PR handles string labels correctly for categoricals? Mind I'm pretty ignorant when it comes to memory management (seems like a nightmare to debug heh).

So if this isn't an introduced bug and what was working still works, I'd recommend to just xfail the failing test cases if you don't have the bandwidth to fix this right now. I did introduce tests covering string behaviour generally, which should help prevent regressions when attempting to fix this.

@maartenbreddels
Copy link
Member

Yeah, feel free to xfail that test for now, on all platform

@honno
Copy link
Contributor Author

honno commented Oct 4, 2022

Yeah, feel free to xfail that test for now, on all platform

Done!

Copy link
Member

@maartenbreddels maartenbreddels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks ! ❤️

@maartenbreddels maartenbreddels merged commit b3567d6 into vaexio:master Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants