Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: improve is_unique check in Index.intersection #38154

Merged
merged 5 commits into from
Dec 2, 2020

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 29, 2020

  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
       before           after         ratio
     [e99e5ab3]       [79d6a50f]
     <cln_intersection~1>       <cln_intersection>
-        918±20ns          819±2ns     0.89  index_object.Range.time_min_trivial
-         465±7ns          414±7ns     0.89  index_object.Range.time_get_loc_dec
-         464±6ns          411±7ns     0.89  index_object.Range.time_get_loc_inc
-      16.6±0.7ms      12.6±0.03ms     0.76  index_object.SetOperations.time_operation('date_string', 'union')

Seems to have no impact.

cc @jreback

@jreback jreback added this to the 1.2 milestone Nov 29, 2020
@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

ok let's merge this for 1.2, thought it would have more of an impact.

if not result.is_unique:
result = result.unique()._values
else:
result = result._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im surprised this makes a difference. Index.unique should already be doing this check internally i guess

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback suggested to check for uniqueness before calling unique() to save a bit of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually @jbrockmendel makes a very good point here. Let's repurpose this PR to actually do a uniquess check internal to unique (of course this may only make sense for > 1000 elements or something).

@jreback jreback modified the milestones: 1.2, Contributions Welcome Nov 29, 2020
@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

moving this off 1.2 for a refactor

@phofl
Copy link
Member Author

phofl commented Nov 29, 2020

Added the unique check to Index.unique. Categeorical.unique does remove values which are not within the categories. If we want to keep this, we can not add a check for uniqueness

@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

can you run relevant benchmarks and see if this makes any difference?

@@ -2379,6 +2379,10 @@ def unique(self, level=None):
"""
if level is not None:
self._validate_index_level(level)

if self.is_unique:
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to return self._shallow_copy()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx.

@phofl
Copy link
Member Author

phofl commented Nov 29, 2020

asvs in index_object:


       before           after         ratio
     [e99e5ab3]       [aba0f2a3]
     <master>         <cln_intersection>
-        47.6±1ms       41.4±0.2ms     0.87  index_object.SetOperations.time_operation('strings', 'intersection')
-      17.0±0.2ms       13.7±0.5ms     0.80  index_object.SetOperations.time_operation('date_string', 'union')

@jbrockmendel
Copy link
Member

LGTM

@jreback
Copy link
Contributor

jreback commented Dec 2, 2020

thanks @phofl

@phofl phofl deleted the cln_intersection branch December 2, 2020 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants