Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fillna to validate for CategoricalColumn.fillna #15683

Merged
merged 3 commits into from
May 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions python/cudf/cudf/core/column/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1045,9 +1045,6 @@ def fillna(
"""
Fill null values with *fill_value*
"""
if not self.nullable:
return self

if fill_value is not None:
fill_is_scalar = np.isscalar(fill_value)

Expand Down Expand Up @@ -1079,6 +1076,11 @@ def fillna(
self.codes.dtype
)

# Validation of `fill_value` will have to be performed
# before returning self.
Comment on lines +1079 to +1080
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment suggests that validation of the value must still be performed. But I think it has been in the code above.

So perhaps

Even if there are no nulls, we must validate that the provided value is valid (done above) to match pandas

WDYT?

if not self.nullable:
return self

return super().fillna(fill_value, method=method)

def indices_of(
Expand Down
15 changes: 11 additions & 4 deletions python/cudf/cudf/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -762,10 +762,17 @@ def fillna(
else:
replace_val = None
should_fill = (
col_name in value
and col.has_nulls(include_nan=True)
and not libcudf.scalar._is_null_host_scalar(replace_val)
) or method is not None
(
col_name in value
and col.has_nulls(include_nan=True)
and not libcudf.scalar._is_null_host_scalar(replace_val)
)
or method is not None
or (
isinstance(col, cudf.core.column.CategoricalColumn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This special-cases categorical columns for fillna both to not care about whether the column has nulls and whether the column name is in the provided dict-like (?) of values.

It so happens that the _is_null_host_scalar check will be true in the latter case, but it's somewhat confusing.

Do you want to just provide a carve-out for categorical columns in the first condition as:

col_name in value
and (col.has_nulls(include_nan=True) or isinstance(col, CategoricalColumn))
and not libcudf.scalar._is_null_host_scalar(replace_val)

?

and not libcudf.scalar._is_null_host_scalar(replace_val)
)
)
if should_fill:
filled_data[col_name] = col.fillna(replace_val, method)
else:
Expand Down
16 changes: 16 additions & 0 deletions python/cudf/cudf/tests/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -859,3 +859,19 @@ def test_cat_from_scalar(scalar):
gs = cudf.Series(scalar, dtype="category")

assert_eq(ps, gs)


def test_cat_groupby_fillna():
ps = pd.Series(["a", "b", "c"], dtype="category")
gs = cudf.from_pandas(ps)

with pytest.warns(FutureWarning):
pg = ps.groupby(ps)
gg = gs.groupby(gs)

assert_exceptions_equal(
lfunc=pg.fillna,
rfunc=gg.fillna,
lfunc_args_and_kwargs=(("d",), {}),
rfunc_args_and_kwargs=(("d",), {}),
)
Loading