FEAT: Add how="any","all" to df.dropna, dropinf, etc #2104

NickCrews · 2022-06-24T19:25:25Z

Fixes #2084

I just added tests for dropna(), because I thought the verbosity wouldn't be worth it to add tests for dropinf, etc. But let me know what you think.

I also thought about putting the if/else check in _filter_all outside of the loop, to save a bit of overhead, but I figured that looping through even ~1000 columns would still be overshadowed by the actual going through the data.

maartenbreddels · 2022-06-24T19:35:07Z

Love it ❤️

NickCrews · 2022-06-24T20:49:46Z

Need me to pull in that fixup commit that addresses the pyarrow 6 failures in CI?

JovanVeljanoski · 2022-06-24T21:26:41Z

@NickCrews yes please - please rebase on master.

Looks nice tho. I will review and test over the weekend. Thanks!

JovanVeljanoski · 2022-06-25T08:06:32Z

Can you rebase on master again please (i think the first time I suggested this, the fix for the where was not merged yet, but i thought it was...)

JovanVeljanoski

Nice! Overall I like the PR and would be happy for this to go through. Some considerations tho..

@maartenbreddels should we use the factory thingy to test this drop stuff with both arrow and numpy data (since we;ve had various problems with it in the past). And should we add tests for dropnan and dropmissing?

Performance notes:

On the taxi dataset (1.1B rows, 18 cols, hdf5), dropna(how='all') takes between 4-5 minutes to run.. A lot of the time is spent while only one core is running..
On (our) gdelt dataset (0.6B rows, 61 cols, hdf5), dropna(how='all') takes 1m 30s on average.
On randomly generated datasets with sklearn (300M rows, 31 cols, in memory) dropna(how='all') takes a couple of seconds or so (2-3 on average).

Does this performance make sense to you @maartenbreddels ? Anyway I am kind of worried by what I see in the taxi dataset? And monitoring CPU it is either one core running or multiple, but at a veeeeery low rate (few % at best mostly lower). For the gdelt data, you can see all the cores near above 10-20% utilisation.. Maybe a deeper issue or something worth looking into?

In any case, great job @NickCrews !

packages/vaex-core/vaex/dataframe.py

JovanVeljanoski · 2022-06-25T08:16:18Z

tests/dropna_test.py

+@pytest.fixture
+def df_with_missings():
+    x = [1.1, np.nan, np.nan, 4.4, 5.5]
+    y = ["dog", "dog", None, "cat", None]


Please revert back to single quotes :)

I wonder if we should use the factory here to generate a dataframe with all kind of N/A values via both arrow and numpy, so we can test this properly.. or is that an overkill?

OK re: quotes :) PS have you considered applying black formatting? I love it. It can solve some merge conflicts as a in addition to (in my opinion) making it more readable.

Does this factory exist already? If so, seems like it. Even if not, this seems like a worthwhile investment to write it, seems like it would/should get used all over the place. Perhaps this should even get expanded to df_with_nan_like and it contains all of nan, NA, missing, and inf

Regarding black: yes we've considered it, but we will probably not go for it. We (or at least I) do not agree with all its stuff, and find it too restrictive. I do use it for some other projects tho. But for vaex, nah.
The quotes thing is just a personal preference tho :P (I find the single ones more aesthetically pleasing, and since I look at this a lot.. :D ).

As for the factory stuff, please look at common.py in the test folder. We also used it in the new tests for where, which are in compute_test.py (see e.g. test_where_primitive_masked_condition)

Cool, re black, no problem!

I'll take a look at the factory

See vaexio#2084

Co-authored-by: Jovan Veljanoski <jovan.veljanoski@gmail.com>

NickCrews · 2022-06-25T16:50:27Z

OK, that push to 349a939 has all the basic suggestions. I can try the factory test fixture too if you point me at an example that uses it.

re: performance: @JovanVeljanoski do you get the same numbers when doing how="any"? Or are you saying that this performance should be looked at regardless of this change?

JovanVeljanoski · 2022-06-26T10:04:35Z

For the in-memory dataset, they are roughly the same,

For the gdelt dataset, how=any is 20-30s faster (consistently amongst 10)

For the taxi dataset how=any the performance is erratic.. sometimes it is even a bit slower than how=all. I don't meant to block this, just raising some concerns. Maybe it is a deeper issue, maybe I don't understand everything about whats happening.

Just wanted to make sure if this is "expected behavior". Perhaps it is a symptom of a deeper problem.

NickCrews · 2022-06-26T20:33:43Z

Thanks @JovanVeljanoski for those detailed tests! That's very perplexing to me, hopefully @maartenbreddels has all the answers :)

If there is a fixable cause of the slowdown, then I would say we should merge this and fix it later. The public API shouldn't change, so I don't see a danger of being backwards incompatible.

If the slowdown isn't fixable, it is just inherent in what we're doing, then it's less obvious what to do. Don't offer the option because it leads to slowness? But someone might just plain need to do this, so they're screwed. Just note that it can be slower in the docstring?

Now we test against numpy, arrow, and chunked arrow arrays.

NickCrews · 2022-06-27T00:30:13Z

OK, see if that tweak is what we're looking for with the test factory

maartenbreddels · 2022-06-27T14:00:50Z

Awesome, thanks!

NickCrews force-pushed the dropna-all branch 2 times, most recently from 6e64d8c to f0a3d8a Compare June 24, 2022 19:31

maartenbreddels requested a review from JovanVeljanoski June 24, 2022 19:35

NickCrews force-pushed the dropna-all branch from f0a3d8a to e56cf71 Compare June 24, 2022 21:44

JovanVeljanoski requested changes Jun 25, 2022

View reviewed changes

NickCrews and others added 4 commits June 25, 2022 10:40

Make df.dropna and friends safer

3892fb1

Support how="any","all" in df.dropna, dropinf, etc

a9882bc

See vaexio#2084

Fixup grammar

79c8bbe

Co-authored-by: Jovan Veljanoski <jovan.veljanoski@gmail.com>

Fixup: Apply suggestions from review

349a939

NickCrews force-pushed the dropna-all branch from 4617163 to 349a939 Compare June 25, 2022 16:44

Use array_factory in dropna tests

9006431

Now we test against numpy, arrow, and chunked arrow arrays.

maartenbreddels approved these changes Jun 27, 2022

View reviewed changes

JovanVeljanoski self-requested a review June 27, 2022 13:59

JovanVeljanoski approved these changes Jun 27, 2022

View reviewed changes

maartenbreddels merged commit 35c250d into vaexio:master Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Add how="any","all" to df.dropna, dropinf, etc #2104

FEAT: Add how="any","all" to df.dropna, dropinf, etc #2104

NickCrews commented Jun 24, 2022 •

edited

Loading

maartenbreddels commented Jun 24, 2022

NickCrews commented Jun 24, 2022

JovanVeljanoski commented Jun 24, 2022

JovanVeljanoski commented Jun 25, 2022

JovanVeljanoski left a comment

JovanVeljanoski Jun 25, 2022

NickCrews Jun 25, 2022 •

edited

Loading

JovanVeljanoski Jun 25, 2022

NickCrews Jun 26, 2022

NickCrews commented Jun 25, 2022

JovanVeljanoski commented Jun 26, 2022

NickCrews commented Jun 26, 2022

NickCrews commented Jun 27, 2022

maartenbreddels commented Jun 27, 2022

FEAT: Add how="any","all" to df.dropna, dropinf, etc #2104

FEAT: Add how="any","all" to df.dropna, dropinf, etc #2104

Conversation

NickCrews commented Jun 24, 2022 • edited Loading

maartenbreddels commented Jun 24, 2022

NickCrews commented Jun 24, 2022

JovanVeljanoski commented Jun 24, 2022

JovanVeljanoski commented Jun 25, 2022

JovanVeljanoski left a comment

Choose a reason for hiding this comment

JovanVeljanoski Jun 25, 2022

Choose a reason for hiding this comment

NickCrews Jun 25, 2022 • edited Loading

Choose a reason for hiding this comment

JovanVeljanoski Jun 25, 2022

Choose a reason for hiding this comment

NickCrews Jun 26, 2022

Choose a reason for hiding this comment

NickCrews commented Jun 25, 2022

JovanVeljanoski commented Jun 26, 2022

NickCrews commented Jun 26, 2022

NickCrews commented Jun 27, 2022

maartenbreddels commented Jun 27, 2022

NickCrews commented Jun 24, 2022 •

edited

Loading

NickCrews Jun 25, 2022 •

edited

Loading