Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add return_inverse to cython-unique; unify unique/factorize-code #23400

Merged
merged 28 commits into from
Nov 29, 2018

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Oct 28, 2018

This is a continuation/split-off from #22986, where I tried to deduplicate the cython code for unique/factorize, and add a return_inverse kwarg to unique at the same time. This didn't fully work because there was a performance impact that was deemed unacceptable. I've opened cython/cython#2660 to figure out why (resp. have a directive to allow force-compilation of different parameter values, which would solve this more elegantly), and @robertwb told me that it's likely the use of kwargs, but also suggested the possibility to use explicit templating on return_inverse.

First I tried unification without any kwargs in 858f54e - this still had the same perf impact as in #22986.
Then I added the explict templating in 4ed354a, which successfully avoids the perf impact, but is a bit uglier in the pxi.in. Finally, I've readded the kwargs in 906cd50 to keep the diff here smaller, and this doesn't have an impact.

I've had many unsuccessful tries to run the ASVs for all those combinations over the weekend, so I'm going back to explicitly testing the unique-code here (code at the end). The overview is as follows (for completeness I've added the #22986 and the commit immediately prior on master; all results in milliseconds):

>>> df
                  e5196aaac6 99e640186e caea25a8b9 858f54e5e4 4ed354a954  906cd50e83 0d6dad0ca1
                    #22986~1     #22986     master  no kwargs  templated tmpl+kwargs  PR+master
StringIndex        16.220276  16.023982  15.967190  15.706223  15.624902   14.952832  15.480224
CategoricalIndex    1.274368   1.289882   1.231592   1.400420   1.247882    1.250814   1.327520
IntIndex            2.743178   2.778960   2.722282   3.115781   2.818848    2.802402   2.877664
UIntIndex           2.760333   2.773203   2.714546   3.037279   2.786909    2.800675   2.914427
RangeIndex          2.765113   2.773916   2.804379   3.071501   2.801661    2.803809   2.816012
FloatIndex          4.734558   4.576122   4.575152   5.043915   4.518481    4.548134   4.732967
TimedeltaIndex      4.361350   4.238442   4.187650   4.601932   4.369402    4.409724   4.422439
StringSeries       59.779114  58.508761  57.742805  58.343838  58.266719   57.043169  58.619018
CategoricalSeries   2.708816   2.666240   2.627852   2.895955   2.664937    2.658149   2.715396
IntSeries           5.078922   4.991856   4.928836   5.480455   5.108894    5.096692   5.434841
UIntSeries          5.127376   4.984227   5.027707   5.445736   5.155703    5.080125   5.211029
RangeSeries         5.218271   5.013379   4.973759   5.531086   5.193241    5.091924   5.197651
FloatSeries         7.126269   6.959626   6.988052   7.738704   6.977395    6.978753   7.215513
TimedeltaSeries     5.146969   4.985955   5.024334   5.551364   5.153091    5.108740   5.158593

or, relatively to the commit on master I was basing myself on:

>>> df.divide(df.iloc[:, 2], axis=0)
                  e5196aaac6 99e640186e caea25a8b9 858f54e5e4 4ed354a954  906cd50e83 0d6dad0ca1
                    #22986~1     #22986     master  no kwargs  templated tmpl+kwargs  PR+master
StringIndex         1.015850   1.003557        1.0   0.983656   0.978563    0.936472   0.969502
CategoricalIndex    1.034733   1.047329        1.0   1.137081   1.013227    1.015607   1.077889
IntIndex            1.007676   1.020820        1.0   1.144547   1.035472    1.029431   1.057078
UIntIndex           1.016868   1.021608        1.0   1.118890   1.026658    1.031729   1.073633
RangeIndex          0.985999   0.989138        1.0   1.095252   0.999031    0.999797   1.004148
FloatIndex          1.034842   1.000212        1.0   1.102458   0.987613    0.994095   1.034494
TimedeltaIndex      1.041479   1.012129        1.0   1.098930   1.043402    1.053031   1.056067
StringSeries        1.035265   1.013265        1.0   1.010409   1.009073    0.987884   1.015174
CategoricalSeries   1.030810   1.014608        1.0   1.102024   1.014112    1.011529   1.033314
IntSeries           1.030450   1.012786        1.0   1.111917   1.036531    1.034056   1.102662
UIntSeries          1.019824   0.991352        1.0   1.083145   1.025458    1.010426   1.036462
RangeSeries         1.049160   1.007966        1.0   1.112053   1.044128    1.023758   1.045015
FloatSeries         1.019779   0.995932        1.0   1.107419   0.998475    0.998669   1.032550
TimedeltaSeries     1.024408   0.992361        1.0   1.104895   1.025627    1.016799   1.026722

So long story short, this PR prepares the hashtable-backend to support return_inverse=True, which plays into #21357 #21645 #22824, and will also allow to easily solve #21720.

Code for the above timings:

import pandas as pd
import numpy as np
import pandas.util.testing as tm
import timeit
hash = pd.__git_version__[:10]

k = 10 ** 5
rep = 10
number = 100
tic = timeit.default_timer()
tags = ['String', 'Categorical', 'Int', 'UInt',
         'Range', 'Float', 'Timedelta']
df = pd.DataFrame(index = [x+'Index' for x in tags], columns = ['mean', 'std'])
np.random.seed(55555)
with tm.RNGContext(55555):
    for tag in tags:
        idx = getattr(tm, f'make{tag}Index')(k=k)
        t = timeit.repeat('idx.unique()', setup='from __main__ import idx', repeat = rep+1, number = number)[1:]
        df.loc[tag+'Index', 'mean'] = pd.Series(t).mean() / number
        df.loc[tag+'Index', 'std'] = pd.Series(t).std() / number
        
        s = pd.Series(idx).sample(frac=2, replace=True)
        t = timeit.repeat('s.unique()', setup='from __main__ import s', repeat = rep+1, number = number)[1:]
        df.loc[tag+'Series', 'mean'] = pd.Series(t).mean() / number
        df.loc[tag+'Series', 'std'] = pd.Series(t).std() / number
df.to_csv(f'test_{hash}.csv')

@pep8speaks
Copy link

Hello @h-vetinari! Thanks for submitting the PR.

@h-vetinari h-vetinari changed the title Unique inverse cython ENH: Add return_inverse to cython-unique; unify unique/factorize-code Oct 28, 2018
@codecov
Copy link

codecov bot commented Oct 28, 2018

Codecov Report

Merging #23400 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23400   +/-   ##
=======================================
  Coverage   92.31%   92.31%           
=======================================
  Files         161      161           
  Lines       51483    51483           
=======================================
  Hits        47525    47525           
  Misses       3958     3958
Flag Coverage Δ
#multiple 90.7% <100%> (ø) ⬆️
#single 42.43% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/algorithms.py 95.11% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90961f2...00a304d. Read the comment docs.

@gfyoung gfyoung added Internals Related to non-user accessible pandas implementation Clean labels Oct 29, 2018
@gfyoung gfyoung added Enhancement Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Oct 29, 2018
@jbrockmendel
Copy link
Member

I'll take a look at this.

@h-vetinari a couple of preliminary, totally uninformed questions. Does it make sense to implement asvs for this? Can any of the tempita-using code be implemented with fused types (which I for one find more readable)?

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 31, 2018

Finally got an ASV to run through, and not be too noisy (compare #23412 about an earlier attempt).

     [7191af9b]       [19c7c1f8]
     <master>         <unique_inverse_cython>
+      10.2±0.8ms         109±10ms    10.77  frame_ctor.FromRecords.time_frame_from_records_generator(None)
+      1.41±0.2ms       3.12±0.8ms     2.22  index_object.Indexing.time_get_loc_non_unique_sorted('Int')
+        547±80μs         1.09±0ms     2.00  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.UInt8Engine'>, <class 'numpy.uint8'>), 'non_monotonic')
+      12.5±0.8ms         15.6±0ms     1.25  groupby.Transform.time_transform_multi_key1
+        12.5±0ms         15.6±0ms     1.25  replace.FillNa.time_replace(False)
+        14.1±0ms         15.6±0ms     1.11  io.sql.ReadSQLTableDtypes.time_read_sql_table_column('float')
-        15.6±0ms       14.1±0.6ms     0.90  groupby.Float32.time_sum
-         156±0μs          141±0μs     0.90  groupby.GroupByMethods.time_dtype_as_field('float', 'shift', 'direct')
-        156±60μs          141±0μs     0.90  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        1.56±0ms         1.41±0ms     0.90  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int8Engine'>, <class 'numpy.int8'>), 'monotonic_incr')
-        1.56±0ms         1.41±0ms     0.90  replace.FillNa.time_fillna(True)
-        15.6±0ms         12.5±0ms     0.80  binary_ops.AddOverflowScalar.time_add_overflow_scalar(-1)
-        15.6±0ms         12.5±0ms     0.80  timeseries.AsOf.time_asof('DataFrame')
-         625±0μs          188±0μs     0.30  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-       469±200μs          141±0μs     0.30  series_methods.SeriesConstructor.time_constructor(None)
-         625±0μs          109±6μs     0.17  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')

What remains is noise IMO, with the likely exception of frame_ctor.FromRecords.time_frame_from_records_generator(None). I've had a look at DataFrame.from_records, and can't really tell from first glance where this might be calling unique at all.

The reason that I'm quite confident that it's noise is that I've had a couple different runs (and specifically checking the index part again as well), where the above time increases did not show up, e.g.:
asv continuous -f 1.1 upstream/master HEAD -b "^(re)?index"

       before           after         ratio
     [360e7271]       [19c7c1f8]
     <master>         <unique_inverse_cython>
+      3.12±0.2ms         15.6±2ms     5.00  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
+        20.3±1μs         93.8±0μs     4.62  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 31, 2018

@jbrockmendel

I'll take a look at this.

Thanks! :)

Does it make sense to implement asvs for this?

Absolutely, haven't done so yet.

Can any of the tempita-using code be implemented with fused types (which I for one find more readable)?

I totally agree, but I gave up after some preliminary investigation (and may not be fully aware of all the capabilities of fused types). There's a bunch of classes

  • {{dtype}}VectorData is already declared fused (after some initial templating)
  • {{dtype}}Vector templates on {{for name, dtype, arg, idtype in dtypes}} and uses some {{if dtype == 'int64'}} conditions
  • {{name}}HashTable templates on {{for name, dtype, float_group, default_na_value in dtypes}} and uses the khash hashtable code, whose methods are explicitly typed, e.g. k = kh_put_{{dtype}}(self.table, key, &ret).

Not sure how flexible the fused type machinery is (i.e. having to use tuples or structs to catch all the required fields to template over?), but it seems to me that in any case, one would probably need to first fuse-type khash.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more amenable to this if it were templated on the type. you are duplicating lots of code for the object / non-object cases. Further you haven't proven your performance claims here. You are adding lots of paths to reduce signature and such, but it is just making it more complicated. We want less complicated code, sure performance matters, and some hotspot optimizations are worth it, but unless they can be fully templatized / use fused types they just add complexity.

Once code is added to pandas and it is complex, it is rarely maintained. This is why we bend over backwards to have code simplicity.

@h-vetinari
Copy link
Contributor Author

@jreback
Thanks for the response.
TLDR: I think it's better to have one core function definition rather than 3 slightly different ones for factorize / unique (with and without inverse).

I would be more amenable to this if it were templated on the type. you are duplicating lots of code for the object / non-object cases.

This is already templated on the dtype. The code for strings/object is already different (and uses different constructions, i.e. no const or nogil)

Further you haven't proven your performance claims here.

How can I prove it other than by ASVs and a reproducible(!) timing code just for the most sensitive unique-branch (see OP)?

You are adding lots of paths to reduce signature and such, but it is just making it more complicated.

The separate functions for unique and factorize already exist, I just moved them above the templated def {{func_name}} to make it more easily readable

We want less complicated code, sure performance matters, and some hotspot optimizations are worth it, but unless they can be fully templatized / use fused types they just add complexity.

Having return_inverse for unique without templating currently has about a 10% perf penalty. If that is seen as alright, then we're back to the situation of #22986, where you said "perf will dictate this". The dtypes-tempita is also the current state of the code - using fused types would be a substantial undertaking (see response to @jbrockmendel above), because of the typed khash methods.

Once code is added to pandas and it is complex, it is rarely maintained. This is why we bend over backwards to have code simplicity.

Fair enough - I'd love for this to work without templating (not templating on return_inverse may be an acceptable perf/maintainability tradeoff; the dtypes will be harder to get rid of).

In any case, I think one function (albeit templated) is easier to maintain than having slightly different versions for each of unique / factorize / get_labels (the latter two were already unified in #22986), and thin wrappers (3 LOC) that only slightly adapt the signature are IMO very easy to follow.

@jreback
Copy link
Contributor

jreback commented Oct 31, 2018

In any case, I think one function (albeit templated) is easier to maintain than having slightly different versions for each of unique / factorize / get_labels (the latter two were already unified in #22986), and thin wrappers (3 LOC) that only slightly adapt the signature are IMO very easy to follow.

I agree. I am willing to have 10% penalty for this, though I think this is basically what we have now?

@h-vetinari
Copy link
Contributor Author

I agree. I am willing to have 10% penalty for this, though I think this is basically what we have now?

No, not yet. Adding a return_inverse-kwarg to unique would add this penalty on top of the current situation.

@h-vetinari
Copy link
Contributor Author

@jreback
The latest version is definitely easier on the eyes. I'll try another ASV run to see the impact.

@h-vetinari
Copy link
Contributor Author

Here's the ASV output of the lastest run:

       before           after         ratio
     [9019582c]       [c7327fd0]
     <master>         <unique_inverse_cython>
!        9.38±0μs           failed      n/a  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_decr')
!         219±8ms           failed      n/a  frame_methods.SortValues.time_frame_sort_values(True)
!             n/a           failed      n/a  groupby.GroupByMethods.time_dtype_as_field('datetime', 'cummax', 'transformation')
!         367±8ms           failed      n/a  indexing.NonNumericSeriesIndexing.time_getitem_scalar('string', 'nonunique_monotonic_inc')
!      12.5±0.6ms           failed      n/a  io.csv.ReadUint64Integers.time_read_uint64_na_values
!        3.12±0ms           failed      n/a  join_merge.Merge.time_merge_dataframe_integer_key(True)
!        6.25±0ms           failed      n/a  reindex.DropDuplicates.time_frame_drop_dups_na(True)
!        39.1±3ms           failed      n/a  rolling.VariableWindowMethods.time_rolling('DataFrame', '50s', 'float', 'std')
!      4.69±0.6ms           failed      n/a  stat_ops.SeriesMultiIndexOps.time_op(1, 'sum')
!        391±80ns           failed      n/a  timestamp.TimestampProperties.time_week(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
+         203±0μs          938±0μs     4.62  indexing.NonNumericSeriesIndexing.time_getitem_label_slice('datetime', 'nonunique_monotonic_inc')
+      3.12±0.2ms         12.5±2ms     4.00  index_object.Indexing.time_get_loc_non_unique('Float')
+         141±6μs          156±0μs     1.11  groupby.GroupByMethods.time_dtype_as_field('datetime', 'last', 'transformation')
+         141±0μs          156±0μs     1.11  series_methods.SeriesConstructor.time_constructor(None)
+        14.1±0ms         15.6±0ms     1.11  timeseries.AsOf.time_asof_nan('DataFrame')
+         531±8ms          578±8ms     1.09  frame_methods.Nunique.time_frame_nunique
-        15.6±0ms         14.1±0ms     0.90  multiindex_object.Sortlevel.time_sortlevel_one
-         156±0μs          141±0μs     0.90  indexing.NumericSeriesIndexing.time_iloc_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-        938±80μs         625±80μs     0.67  inference.NumericInferOps.time_subtract(<class 'numpy.int16'>)
-         938±0μs         625±60μs     0.67  period.PeriodProperties.time_property('min', 'end_time')
-        62.5±0μs       20.3±0.6μs     0.32  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-        12.5±2ms       3.20±0.2ms     0.26  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
-        7.81±0ms         1.88±0ms     0.24  reindex.DropDuplicates.time_frame_drop_dups_int(True)
-      5.47±0.8ms         1.25±0ms     0.23  indexing.NumericSeriesIndexing.time_ix_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        93.8±0μs         20.3±0μs     0.22  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'nonunique_monotonic_inc')
-      9.38±0.8ms         1.25±0ms     0.13  frame_ctor.FromRecords.time_frame_from_records_generator(None)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@h-vetinari
Copy link
Contributor Author

Doesn't look too bad, IMO (failures in ASV are pretty arbitrary when they happen, the test suite ran fine). Another look, @jreback @jbrockmendel, please? :)

@h-vetinari
Copy link
Contributor Author

Reverting 30de418 due to the performance degradation. Was just for demo purposes.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i am onboard with the implementation. it is straightforward. change the order of the returns in unique to match factorize. you may not like it but that's how it is. having different return orderings is not going to work.

pandas/_libs/hashtable_class_helper.pxi.in Show resolved Hide resolved
pandas/_libs/hashtable_class_helper.pxi.in Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

flip the order of the return here, make this like factorize, otherwise this is very confusing.

@jreback if you are asking that only for the internal hashtable method, I don't bother that much (@h-vetinari so you can do it here as Jeff asks, for the internal code it doesn't matter that much).
But IMO, we certainly don't want to do that for the user facing pd.unique (but we can do the flip in order also there; although I don't think that makes the code necessarily clearer). For pd.unique, I think we want to follow the return order of np.unique.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Nov 20, 2018

@jorisvandenbossche @jreback

For pd.unique, I think we want to follow the return order of np.unique.

That's my point exactly, and even more so: adding the inverse as the first component of the tuple based on a kwarg is bound to lead to lots and lots of confusion. The only sane option IMO is the way np.unique goes, of appending the output.

I think we have to live with this discrepancy, and then, it just becomes a question where the signature break is in the code. I'd argue that it should be at the lowest level possible (i.e. Hashtable._unique --> Hashtable.factorize), so that the vast majority of users/developers need not be concerned with it at all.

@h-vetinari
Copy link
Contributor Author

@jbrockmendel

@h-vetinari will do, thanks for your patience

Care to take another look? :)


Returns
-------
labels : ndarray[int64]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i agree now, flip the order here and just change it where it is called (e.g. in the python code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Just so I understand, you'd like to have the signature break between Hashtable.factorize (returning uniques, labels) and pandas.factorize (returning labels, uniques)? Then there'd still be a couple of places in the pandas code where this would have to be explained.

If the break were only between Hashtable._unique and Hashtable.factorize, then the break would be contained to this module, and would not be visible to anyone who's not working on the cython code. I think that would be the cleanest solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then there'd still be a couple of places in the pandas code where this would have to be explained.

I wouldn't explain it, rather I would simply change it. The idea would be to change it everywhere internally, the only place this would not be the case would be the actual pd.factorize() where we can never change this, but that is user facing and not a big deal.


Returns
-------
labels : ndarray[int64]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

uniques, labels = self._unique(values, uniques_vector,
na_sentinel=na_sentinel,
na_value=na_value, ignore_na=True,
return_inverse=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, just the cython signature, we are not going to (nor need to) change the user facing signature

return np.asarray(labels)
if return_inverse:
return uniques.to_array(), np.asarray(labels)
return uniques.to_array()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the non-constant signature here (and above) avoidable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now that I read the conversation I see this has been addressed several times. The only change in the python code is in the tests; what part of the non-cython code is affected by this decision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
The non-constant signature follows np.unique and will eventually extend to pd.unique, Series.unique etc. This PR prepares the cython backend to have the necessary capability. This will i.a. also allow to solve #21720, and work towards #21357 / #22824

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Thanks for the review. Please have a look at my argument to contain the signature change to this module. If you disagree, I'll change as requested.

@jbrockmendel
Thanks for taking a look, I responded in the comment


Returns
-------
labels : ndarray[int64]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Just so I understand, you'd like to have the signature break between Hashtable.factorize (returning uniques, labels) and pandas.factorize (returning labels, uniques)? Then there'd still be a couple of places in the pandas code where this would have to be explained.

If the break were only between Hashtable._unique and Hashtable.factorize, then the break would be contained to this module, and would not be visible to anyone who's not working on the cython code. I think that would be the cleanest solution.

return np.asarray(labels)
if return_inverse:
return uniques.to_array(), np.asarray(labels)
return uniques.to_array()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
The non-constant signature follows np.unique and will eventually extend to pd.unique, Series.unique etc. This PR prepares the cython backend to have the necessary capability. This will i.a. also allow to solve #21720, and work towards #21357 / #22824

@h-vetinari
Copy link
Contributor Author

@jbrockmendel

@h-vetinari a couple of preliminary, totally uninformed questions. Does it make sense to implement asvs for this?

I remembered this question, and checked the ASV benchmarks for existing calls to unique. There's nothing in algorithms.py, and actually, not many occurrences at all:

(pandas-dev) C:\[...]\pddev>findstr /r /S /N "\.unique\(" asv_bench/benchmarks/*.py
asv_bench/benchmarks/period.py:116:        self.index.unique()
asv_bench/benchmarks/timeseries.py:41:        self.index.unique()

Seems one should add some ASVs to asv_bench/benchmarks.algorithms.py, but I'd suggest to leave that for a follow-up.

@h-vetinari
Copy link
Contributor Author

The failure is a flaky hypothesis test.


Returns
-------
labels : ndarray[int64]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then there'd still be a couple of places in the pandas code where this would have to be explained.

I wouldn't explain it, rather I would simply change it. The idea would be to change it everywhere internally, the only place this would not be the case would be the actual pd.factorize() where we can never change this, but that is user facing and not a big deal.

@h-vetinari
Copy link
Contributor Author

@jreback
Turns out hashtable.factorize was only used in algorithms.py, the other calls are either to _factorize_array libhashtable.Factorizer.factorize (or of course the user-facing algorithms.factorize), all of which are not affected.

@h-vetinari
Copy link
Contributor Author

@jreback All green.

@h-vetinari
Copy link
Contributor Author

@jreback
Ping

@jreback jreback added this to the 0.24.0 milestone Nov 29, 2018
@jreback jreback merged commit 0365828 into pandas-dev:master Nov 29, 2018
@jreback
Copy link
Contributor

jreback commented Nov 29, 2018

thanks @h-vetinari

@h-vetinari
Copy link
Contributor Author

Cool, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Clean Enhancement Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants