Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Add how="any","all" to df.dropna, dropinf, etc #2104

Merged
merged 5 commits into from
Jun 27, 2022

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Jun 24, 2022

Fixes #2084

I just added tests for dropna(), because I thought the verbosity wouldn't be worth it to add tests for dropinf, etc. But let me know what you think.

I also thought about putting the if/else check in _filter_all outside of the loop, to save a bit of overhead, but I figured that looping through even ~1000 columns would still be overshadowed by the actual going through the data.

@NickCrews NickCrews force-pushed the dropna-all branch 2 times, most recently from 6e64d8c to f0a3d8a Compare June 24, 2022 19:31
@maartenbreddels
Copy link
Member

Love it ❤️

@NickCrews
Copy link
Contributor Author

Need me to pull in that fixup commit that addresses the pyarrow 6 failures in CI?

@JovanVeljanoski
Copy link
Member

@NickCrews yes please - please rebase on master.

Looks nice tho. I will review and test over the weekend. Thanks!

@JovanVeljanoski
Copy link
Member

Can you rebase on master again please (i think the first time I suggested this, the fix for the where was not merged yet, but i thought it was...)

Copy link
Member

@JovanVeljanoski JovanVeljanoski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Overall I like the PR and would be happy for this to go through. Some considerations tho..

  • @maartenbreddels should we use the factory thingy to test this drop stuff with both arrow and numpy data (since we;ve had various problems with it in the past). And should we add tests for dropnan and dropmissing?

Performance notes:

  • On the taxi dataset (1.1B rows, 18 cols, hdf5), dropna(how='all') takes between 4-5 minutes to run.. A lot of the time is spent while only one core is running..
  • On (our) gdelt dataset (0.6B rows, 61 cols, hdf5), dropna(how='all') takes 1m 30s on average.
  • On randomly generated datasets with sklearn (300M rows, 31 cols, in memory) dropna(how='all') takes a couple of seconds or so (2-3 on average).

Does this performance make sense to you @maartenbreddels ? Anyway I am kind of worried by what I see in the taxi dataset? And monitoring CPU it is either one core running or multiple, but at a veeeeery low rate (few % at best mostly lower). For the gdelt data, you can see all the cores near above 10-20% utilisation.. Maybe a deeper issue or something worth looking into?

In any case, great job @NickCrews !

packages/vaex-core/vaex/dataframe.py Outdated Show resolved Hide resolved
packages/vaex-core/vaex/dataframe.py Outdated Show resolved Hide resolved
packages/vaex-core/vaex/dataframe.py Outdated Show resolved Hide resolved
packages/vaex-core/vaex/dataframe.py Outdated Show resolved Hide resolved
@pytest.fixture
def df_with_missings():
x = [1.1, np.nan, np.nan, 4.4, 5.5]
y = ["dog", "dog", None, "cat", None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert back to single quotes :)

I wonder if we should use the factory here to generate a dataframe with all kind of N/A values via both arrow and numpy, so we can test this properly.. or is that an overkill?

Copy link
Contributor Author

@NickCrews NickCrews Jun 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK re: quotes :) PS have you considered applying black formatting? I love it. It can solve some merge conflicts as a in addition to (in my opinion) making it more readable.

Does this factory exist already? If so, seems like it. Even if not, this seems like a worthwhile investment to write it, seems like it would/should get used all over the place. Perhaps this should even get expanded to df_with_nan_like and it contains all of nan, NA, missing, and inf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding black: yes we've considered it, but we will probably not go for it. We (or at least I) do not agree with all its stuff, and find it too restrictive. I do use it for some other projects tho. But for vaex, nah.
The quotes thing is just a personal preference tho :P (I find the single ones more aesthetically pleasing, and since I look at this a lot.. :D ).

As for the factory stuff, please look at common.py in the test folder. We also used it in the new tests for where, which are in compute_test.py (see e.g. test_where_primitive_masked_condition)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, re black, no problem!

I'll take a look at the factory

@NickCrews
Copy link
Contributor Author

OK, that push to 349a939 has all the basic suggestions. I can try the factory test fixture too if you point me at an example that uses it.

re: performance: @JovanVeljanoski do you get the same numbers when doing how="any"? Or are you saying that this performance should be looked at regardless of this change?

@JovanVeljanoski
Copy link
Member

For the in-memory dataset, they are roughly the same,

For the gdelt dataset, how=any is 20-30s faster (consistently amongst 10)

For the taxi dataset how=any the performance is erratic.. sometimes it is even a bit slower than how=all. I don't meant to block this, just raising some concerns. Maybe it is a deeper issue, maybe I don't understand everything about whats happening.

Just wanted to make sure if this is "expected behavior". Perhaps it is a symptom of a deeper problem.

@NickCrews
Copy link
Contributor Author

Thanks @JovanVeljanoski for those detailed tests! That's very perplexing to me, hopefully @maartenbreddels has all the answers :)

If there is a fixable cause of the slowdown, then I would say we should merge this and fix it later. The public API shouldn't change, so I don't see a danger of being backwards incompatible.

If the slowdown isn't fixable, it is just inherent in what we're doing, then it's less obvious what to do. Don't offer the option because it leads to slowness? But someone might just plain need to do this, so they're screwed. Just note that it can be slower in the docstring?

Now we test against numpy, arrow, and chunked arrow arrays.
@NickCrews
Copy link
Contributor Author

OK, see if that tweak is what we're looking for with the test factory

@JovanVeljanoski JovanVeljanoski self-requested a review June 27, 2022 13:59
@maartenbreddels maartenbreddels merged commit 35c250d into vaexio:master Jun 27, 2022
@maartenbreddels
Copy link
Member

Awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Question: What does dropna() do?
3 participants