Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nunique for MultiIndex, DataFrame, and all NA case with dropna=False #15962

Merged
merged 17 commits into from
Jun 14, 2024

Conversation

mroeschke
Copy link
Contributor

Description

Fixes 3 bugs with nunique

  • MultiIndex.nunique returning a dict instead of an int
  • .nunique(dropna=False) with all NAs returning 0 instead of 1
  • DataFrame.nunique preserving column class and type in the resulting Series.index

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added bug Something isn't working non-breaking Non-breaking change cuDF (Python) labels Jun 10, 2024
@mroeschke mroeschke requested review from a team as code owners June 10, 2024 20:04
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jun 10, 2024
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to investigate 1+% decrease in pass rate too..

Comment on lines 2855 to 2857
data = [np.nan, np.nan]
pd_ser = pd.Series(data)
cudf_ser = cudf.Series(data, nan_as_null=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are actually testing nan case here or do you want to test only NA cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to just test the NA case here i.e. where dropna would impact the nunique count

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, okay. Can we use None then instead of np.nan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Changed to use None

@@ -2162,3 +2162,14 @@ def test_multi_index_contains_hashable():
lfunc_args_and_kwargs=((),),
rfunc_args_and_kwargs=((),),
)


@pytest.mark.parametrize("array", [[1, 2], [1, np.nan], [np.nan] * 2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as below

@mroeschke
Copy link
Contributor Author

mroeschke commented Jun 12, 2024

We will have to investigate 1+% decrease in pass rate too..

As of 72b7ac3, it looks like the diff job is reporting a 0% change. https://github.com/rapidsai/cudf/actions/runs/9486926113

(Looks like there's still some noise in test failures rate between commits e.g. https://github.com/rapidsai/cudf/actions/runs/9473554707 vs https://github.com/rapidsai/cudf/actions/runs/9486926113)

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python approval. I think we need two cpp approvals for the libcudf changes.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions, generally looks good.

python/cudf/cudf/tests/test_multiindex.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/multiindex.py Outdated Show resolved Hide resolved
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
mroeschke and others added 2 commits June 14, 2024 09:52
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 2ad502e into rapidsai:branch-24.08 Jun 14, 2024
77 checks passed
@mroeschke mroeschke deleted the bug/nunique branch July 26, 2024 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants