Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow Iterable[Hashable] in drop_duplicates #59392

Merged
merged 9 commits into from
Aug 13, 2024

Conversation

kirill-bash
Copy link
Contributor

@kirill-bash kirill-bash commented Aug 2, 2024

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Comment on lines 6797 to 6800
if isinstance(subset, set):
elem = subset.pop()
else:
elem = subset[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't calling .pop mutate the passed argument? In any case, I'd recommend using next(iter(subset)) in both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Updated.

pandas/core/frame.py Outdated Show resolved Hide resolved
@rhshadrach rhshadrach added Enhancement duplicated duplicated, drop_duplicates labels Aug 5, 2024
@rhshadrach
Copy link
Member

Also - can you give this PR an informative title. Something like ENH: Allow sets in drop_duplicates.

@kirill-bash kirill-bash changed the title ENH: pandas-dev#59237 ENH: Allow sets in drop_duplicates - pandas-dev#59237 Aug 7, 2024
Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also update the docs to indicate that any iterable of labels is accepted.

@@ -6534,7 +6534,7 @@ def dropna(
@overload
def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | None = ...,
subset: Hashable | Sequence[Hashable] | set | None = ...,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just replace Sequence[Hashable] with Iterable[Hashable] here (and in the other overloads), because even a dict works now.

Also, I suggest making a separate PR in pandas-stubs once this PR is accepted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dr-Irv,

Updated. Could you specify which docs I should be updating to reflect these changes?

Also, what is the difference between pandas and pandas-stubs?
I am a new contributor and am happy to raise the PR there but would like to understand more about each repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what is the difference between pandas and pandas-stubs? I am a new contributor and am happy to raise the PR there but would like to understand more about each repo.

pandas is the code that executes.

pandas-stubs is a set of typing declarations meant for users to use when type checking their code. It is bundled within Visual Studio Code. See https://github.com/pandas-dev/pandas-stubs/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I will do as you suggested and open a PR there once this one is merged.

@kirill-bash kirill-bash changed the title ENH: Allow sets in drop_duplicates - pandas-dev#59237 ENH: Allow sets + Iterable[Hashable] in drop_duplicates - pandas-dev#59237 Aug 7, 2024
@kirill-bash kirill-bash requested a review from Dr-Irv August 7, 2024 16:39
Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this. Will let someone else make the final call.

@@ -6535,7 +6534,7 @@ def dropna(
@overload
def drop_duplicates(
self,
subset: Hashable | Sequence[Hashable] | AbstractSet | None = ...,
subset: Hashable | Iterable[Hashable] | set | None = ...,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for set anymore now that you've changed Sequence to Iterable, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Updated.

@@ -51,7 +51,7 @@ Other enhancements
- :meth:`Series.cummin` and :meth:`Series.cummax` now supports :class:`CategoricalDtype` (:issue:`52335`)
- :meth:`Series.plot` now correctly handle the ``ylabel`` parameter for pie charts, allowing for explicit control over the y-axis label (:issue:`58239`)
- Restore support for reading Stata 104-format and enable reading 103-format dta files (:issue:`58554`)
- Support passing a :class:`Set` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`)
- Support passing a :class:`Set` and :class:`Iterable[Hashable]` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - can remove Set now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@kirill-bash kirill-bash changed the title ENH: Allow sets + Iterable[Hashable] in drop_duplicates - pandas-dev#59237 ENH: Allow Iterable[Hashable] in drop_duplicates - pandas-dev#59237 Aug 9, 2024
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is failing on code checks. I think it's the line subset = self.columns # type: ignore[assignment] where the type: ignore can now be removed.

@rhshadrach rhshadrach added this to the 3.0 milestone Aug 13, 2024
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, can be merged on green.

@rhshadrach rhshadrach changed the title ENH: Allow Iterable[Hashable] in drop_duplicates - pandas-dev#59237 ENH: Allow Iterable[Hashable] in drop_duplicates Aug 13, 2024
@rhshadrach rhshadrach merged commit 614939a into pandas-dev:main Aug 13, 2024
47 checks passed
@rhshadrach
Copy link
Member

Thanks @kirill-bash!

@kirill-bash
Copy link
Contributor Author

Appreciate your help, @rhshadrach!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicated duplicated, drop_duplicates Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Reduce type requirements for the subset parameter in drop_duplicates/duplicated
3 participants