ENH: Add return_inverse to cython-unique; unify unique/factorize-code #23400

h-vetinari · 2018-10-28T17:55:35Z

This is a continuation/split-off from #22986, where I tried to deduplicate the cython code for unique/factorize, and add a return_inverse kwarg to unique at the same time. This didn't fully work because there was a performance impact that was deemed unacceptable. I've opened cython/cython#2660 to figure out why (resp. have a directive to allow force-compilation of different parameter values, which would solve this more elegantly), and @robertwb told me that it's likely the use of kwargs, but also suggested the possibility to use explicit templating on return_inverse.

First I tried unification without any kwargs in 858f54e - this still had the same perf impact as in #22986.
Then I added the explict templating in 4ed354a, which successfully avoids the perf impact, but is a bit uglier in the pxi.in. Finally, I've readded the kwargs in 906cd50 to keep the diff here smaller, and this doesn't have an impact.

I've had many unsuccessful tries to run the ASVs for all those combinations over the weekend, so I'm going back to explicitly testing the unique-code here (code at the end). The overview is as follows (for completeness I've added the #22986 and the commit immediately prior on master; all results in milliseconds):

>>> df
                  e5196aaac6 99e640186e caea25a8b9 858f54e5e4 4ed354a954  906cd50e83 0d6dad0ca1
                    #22986~1     #22986     master  no kwargs  templated tmpl+kwargs  PR+master
StringIndex        16.220276  16.023982  15.967190  15.706223  15.624902   14.952832  15.480224
CategoricalIndex    1.274368   1.289882   1.231592   1.400420   1.247882    1.250814   1.327520
IntIndex            2.743178   2.778960   2.722282   3.115781   2.818848    2.802402   2.877664
UIntIndex           2.760333   2.773203   2.714546   3.037279   2.786909    2.800675   2.914427
RangeIndex          2.765113   2.773916   2.804379   3.071501   2.801661    2.803809   2.816012
FloatIndex          4.734558   4.576122   4.575152   5.043915   4.518481    4.548134   4.732967
TimedeltaIndex      4.361350   4.238442   4.187650   4.601932   4.369402    4.409724   4.422439
StringSeries       59.779114  58.508761  57.742805  58.343838  58.266719   57.043169  58.619018
CategoricalSeries   2.708816   2.666240   2.627852   2.895955   2.664937    2.658149   2.715396
IntSeries           5.078922   4.991856   4.928836   5.480455   5.108894    5.096692   5.434841
UIntSeries          5.127376   4.984227   5.027707   5.445736   5.155703    5.080125   5.211029
RangeSeries         5.218271   5.013379   4.973759   5.531086   5.193241    5.091924   5.197651
FloatSeries         7.126269   6.959626   6.988052   7.738704   6.977395    6.978753   7.215513
TimedeltaSeries     5.146969   4.985955   5.024334   5.551364   5.153091    5.108740   5.158593

or, relatively to the commit on master I was basing myself on:

>>> df.divide(df.iloc[:, 2], axis=0)
                  e5196aaac6 99e640186e caea25a8b9 858f54e5e4 4ed354a954  906cd50e83 0d6dad0ca1
                    #22986~1     #22986     master  no kwargs  templated tmpl+kwargs  PR+master
StringIndex         1.015850   1.003557        1.0   0.983656   0.978563    0.936472   0.969502
CategoricalIndex    1.034733   1.047329        1.0   1.137081   1.013227    1.015607   1.077889
IntIndex            1.007676   1.020820        1.0   1.144547   1.035472    1.029431   1.057078
UIntIndex           1.016868   1.021608        1.0   1.118890   1.026658    1.031729   1.073633
RangeIndex          0.985999   0.989138        1.0   1.095252   0.999031    0.999797   1.004148
FloatIndex          1.034842   1.000212        1.0   1.102458   0.987613    0.994095   1.034494
TimedeltaIndex      1.041479   1.012129        1.0   1.098930   1.043402    1.053031   1.056067
StringSeries        1.035265   1.013265        1.0   1.010409   1.009073    0.987884   1.015174
CategoricalSeries   1.030810   1.014608        1.0   1.102024   1.014112    1.011529   1.033314
IntSeries           1.030450   1.012786        1.0   1.111917   1.036531    1.034056   1.102662
UIntSeries          1.019824   0.991352        1.0   1.083145   1.025458    1.010426   1.036462
RangeSeries         1.049160   1.007966        1.0   1.112053   1.044128    1.023758   1.045015
FloatSeries         1.019779   0.995932        1.0   1.107419   0.998475    0.998669   1.032550
TimedeltaSeries     1.024408   0.992361        1.0   1.104895   1.025627    1.016799   1.026722

So long story short, this PR prepares the hashtable-backend to support return_inverse=True, which plays into #21357 #21645 #22824, and will also allow to easily solve #21720.

Code for the above timings:

import pandas as pd
import numpy as np
import pandas.util.testing as tm
import timeit
hash = pd.__git_version__[:10]

k = 10 ** 5
rep = 10
number = 100
tic = timeit.default_timer()
tags = ['String', 'Categorical', 'Int', 'UInt',
         'Range', 'Float', 'Timedelta']
df = pd.DataFrame(index = [x+'Index' for x in tags], columns = ['mean', 'std'])
np.random.seed(55555)
with tm.RNGContext(55555):
    for tag in tags:
        idx = getattr(tm, f'make{tag}Index')(k=k)
        t = timeit.repeat('idx.unique()', setup='from __main__ import idx', repeat = rep+1, number = number)[1:]
        df.loc[tag+'Index', 'mean'] = pd.Series(t).mean() / number
        df.loc[tag+'Index', 'std'] = pd.Series(t).std() / number
        
        s = pd.Series(idx).sample(frac=2, replace=True)
        t = timeit.repeat('s.unique()', setup='from __main__ import s', repeat = rep+1, number = number)[1:]
        df.loc[tag+'Series', 'mean'] = pd.Series(t).mean() / number
        df.loc[tag+'Series', 'std'] = pd.Series(t).std() / number
df.to_csv(f'test_{hash}.csv')

…thon

pep8speaks · 2018-10-28T17:55:38Z

Hello @h-vetinari! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/tests/test_algos.py !

codecov · 2018-10-28T19:32:38Z

Codecov Report

Merging #23400 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #23400   +/-   ##
=======================================
  Coverage   92.31%   92.31%           
=======================================
  Files         161      161           
  Lines       51483    51483           
=======================================
  Hits        47525    47525           
  Misses       3958     3958

Flag	Coverage Δ
#multiple	`90.7% <100%> (ø)`	⬆️
#single	`42.43% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`95.11% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90961f2...00a304d. Read the comment docs.

…thon

jbrockmendel · 2018-10-31T03:53:05Z

I'll take a look at this.

@h-vetinari a couple of preliminary, totally uninformed questions. Does it make sense to implement asvs for this? Can any of the tempita-using code be implemented with fused types (which I for one find more readable)?

h-vetinari · 2018-10-31T07:05:45Z

Finally got an ASV to run through, and not be too noisy (compare #23412 about an earlier attempt).

     [7191af9b]       [19c7c1f8]
     <master>         <unique_inverse_cython>
+      10.2±0.8ms         109±10ms    10.77  frame_ctor.FromRecords.time_frame_from_records_generator(None)
+      1.41±0.2ms       3.12±0.8ms     2.22  index_object.Indexing.time_get_loc_non_unique_sorted('Int')
+        547±80μs         1.09±0ms     2.00  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.UInt8Engine'>, <class 'numpy.uint8'>), 'non_monotonic')
+      12.5±0.8ms         15.6±0ms     1.25  groupby.Transform.time_transform_multi_key1
+        12.5±0ms         15.6±0ms     1.25  replace.FillNa.time_replace(False)
+        14.1±0ms         15.6±0ms     1.11  io.sql.ReadSQLTableDtypes.time_read_sql_table_column('float')
-        15.6±0ms       14.1±0.6ms     0.90  groupby.Float32.time_sum
-         156±0μs          141±0μs     0.90  groupby.GroupByMethods.time_dtype_as_field('float', 'shift', 'direct')
-        156±60μs          141±0μs     0.90  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        1.56±0ms         1.41±0ms     0.90  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int8Engine'>, <class 'numpy.int8'>), 'monotonic_incr')
-        1.56±0ms         1.41±0ms     0.90  replace.FillNa.time_fillna(True)
-        15.6±0ms         12.5±0ms     0.80  binary_ops.AddOverflowScalar.time_add_overflow_scalar(-1)
-        15.6±0ms         12.5±0ms     0.80  timeseries.AsOf.time_asof('DataFrame')
-         625±0μs          188±0μs     0.30  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-       469±200μs          141±0μs     0.30  series_methods.SeriesConstructor.time_constructor(None)
-         625±0μs          109±6μs     0.17  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')

What remains is noise IMO, with the likely exception of frame_ctor.FromRecords.time_frame_from_records_generator(None). I've had a look at DataFrame.from_records, and can't really tell from first glance where this might be calling unique at all.

The reason that I'm quite confident that it's noise is that I've had a couple different runs (and specifically checking the index part again as well), where the above time increases did not show up, e.g.:
asv continuous -f 1.1 upstream/master HEAD -b "^(re)?index"

       before           after         ratio
     [360e7271]       [19c7c1f8]
     <master>         <unique_inverse_cython>
+      3.12±0.2ms         15.6±2ms     5.00  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
+        20.3±1μs         93.8±0μs     4.62  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

h-vetinari · 2018-10-31T07:18:33Z

@jbrockmendel

I'll take a look at this.

Thanks! :)

Does it make sense to implement asvs for this?

Absolutely, haven't done so yet.

Can any of the tempita-using code be implemented with fused types (which I for one find more readable)?

I totally agree, but I gave up after some preliminary investigation (and may not be fully aware of all the capabilities of fused types). There's a bunch of classes

{{dtype}}VectorData is already declared fused (after some initial templating)
{{dtype}}Vector templates on {{for name, dtype, arg, idtype in dtypes}} and uses some {{if dtype == 'int64'}} conditions
{{name}}HashTable templates on {{for name, dtype, float_group, default_na_value in dtypes}} and uses the khash hashtable code, whose methods are explicitly typed, e.g. k = kh_put_{{dtype}}(self.table, key, &ret).

Not sure how flexible the fused type machinery is (i.e. having to use tuples or structs to catch all the required fields to template over?), but it seems to me that in any case, one would probably need to first fuse-type khash.

jreback

I would be more amenable to this if it were templated on the type. you are duplicating lots of code for the object / non-object cases. Further you haven't proven your performance claims here. You are adding lots of paths to reduce signature and such, but it is just making it more complicated. We want less complicated code, sure performance matters, and some hotspot optimizations are worth it, but unless they can be fully templatized / use fused types they just add complexity.

Once code is added to pandas and it is complex, it is rarely maintained. This is why we bend over backwards to have code simplicity.

h-vetinari · 2018-10-31T12:48:58Z

@jreback
Thanks for the response.
TLDR: I think it's better to have one core function definition rather than 3 slightly different ones for factorize / unique (with and without inverse).

I would be more amenable to this if it were templated on the type. you are duplicating lots of code for the object / non-object cases.

This is already templated on the dtype. The code for strings/object is already different (and uses different constructions, i.e. no const or nogil)

Further you haven't proven your performance claims here.

How can I prove it other than by ASVs and a reproducible(!) timing code just for the most sensitive unique-branch (see OP)?

You are adding lots of paths to reduce signature and such, but it is just making it more complicated.

The separate functions for unique and factorize already exist, I just moved them above the templated def {{func_name}} to make it more easily readable

We want less complicated code, sure performance matters, and some hotspot optimizations are worth it, but unless they can be fully templatized / use fused types they just add complexity.

Having return_inverse for unique without templating currently has about a 10% perf penalty. If that is seen as alright, then we're back to the situation of #22986, where you said "perf will dictate this". The dtypes-tempita is also the current state of the code - using fused types would be a substantial undertaking (see response to @jbrockmendel above), because of the typed khash methods.

Once code is added to pandas and it is complex, it is rarely maintained. This is why we bend over backwards to have code simplicity.

Fair enough - I'd love for this to work without templating (not templating on return_inverse may be an acceptable perf/maintainability tradeoff; the dtypes will be harder to get rid of).

In any case, I think one function (albeit templated) is easier to maintain than having slightly different versions for each of unique / factorize / get_labels (the latter two were already unified in #22986), and thin wrappers (3 LOC) that only slightly adapt the signature are IMO very easy to follow.

jreback · 2018-10-31T12:54:17Z

In any case, I think one function (albeit templated) is easier to maintain than having slightly different versions for each of unique / factorize / get_labels (the latter two were already unified in #22986), and thin wrappers (3 LOC) that only slightly adapt the signature are IMO very easy to follow.

I agree. I am willing to have 10% penalty for this, though I think this is basically what we have now?

h-vetinari · 2018-10-31T12:57:54Z

I agree. I am willing to have 10% penalty for this, though I think this is basically what we have now?

No, not yet. Adding a return_inverse-kwarg to unique would add this penalty on top of the current situation.

h-vetinari · 2018-10-31T18:22:06Z

@jreback
The latest version is definitely easier on the eyes. I'll try another ASV run to see the impact.

h-vetinari · 2018-11-01T22:05:22Z

Here's the ASV output of the lastest run:

       before           after         ratio
     [9019582c]       [c7327fd0]
     <master>         <unique_inverse_cython>
!        9.38±0μs           failed      n/a  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_decr')
!         219±8ms           failed      n/a  frame_methods.SortValues.time_frame_sort_values(True)
!             n/a           failed      n/a  groupby.GroupByMethods.time_dtype_as_field('datetime', 'cummax', 'transformation')
!         367±8ms           failed      n/a  indexing.NonNumericSeriesIndexing.time_getitem_scalar('string', 'nonunique_monotonic_inc')
!      12.5±0.6ms           failed      n/a  io.csv.ReadUint64Integers.time_read_uint64_na_values
!        3.12±0ms           failed      n/a  join_merge.Merge.time_merge_dataframe_integer_key(True)
!        6.25±0ms           failed      n/a  reindex.DropDuplicates.time_frame_drop_dups_na(True)
!        39.1±3ms           failed      n/a  rolling.VariableWindowMethods.time_rolling('DataFrame', '50s', 'float', 'std')
!      4.69±0.6ms           failed      n/a  stat_ops.SeriesMultiIndexOps.time_op(1, 'sum')
!        391±80ns           failed      n/a  timestamp.TimestampProperties.time_week(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
+         203±0μs          938±0μs     4.62  indexing.NonNumericSeriesIndexing.time_getitem_label_slice('datetime', 'nonunique_monotonic_inc')
+      3.12±0.2ms         12.5±2ms     4.00  index_object.Indexing.time_get_loc_non_unique('Float')
+         141±6μs          156±0μs     1.11  groupby.GroupByMethods.time_dtype_as_field('datetime', 'last', 'transformation')
+         141±0μs          156±0μs     1.11  series_methods.SeriesConstructor.time_constructor(None)
+        14.1±0ms         15.6±0ms     1.11  timeseries.AsOf.time_asof_nan('DataFrame')
+         531±8ms          578±8ms     1.09  frame_methods.Nunique.time_frame_nunique
-        15.6±0ms         14.1±0ms     0.90  multiindex_object.Sortlevel.time_sortlevel_one
-         156±0μs          141±0μs     0.90  indexing.NumericSeriesIndexing.time_iloc_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-        938±80μs         625±80μs     0.67  inference.NumericInferOps.time_subtract(<class 'numpy.int16'>)
-         938±0μs         625±60μs     0.67  period.PeriodProperties.time_property('min', 'end_time')
-        62.5±0μs       20.3±0.6μs     0.32  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-        12.5±2ms       3.20±0.2ms     0.26  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
-        7.81±0ms         1.88±0ms     0.24  reindex.DropDuplicates.time_frame_drop_dups_int(True)
-      5.47±0.8ms         1.25±0ms     0.23  indexing.NumericSeriesIndexing.time_ix_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        93.8±0μs         20.3±0μs     0.22  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'nonunique_monotonic_inc')
-      9.38±0.8ms         1.25±0ms     0.13  frame_ctor.FromRecords.time_frame_from_records_generator(None)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

h-vetinari · 2018-11-01T22:12:54Z

Doesn't look too bad, IMO (failures in ASV are pretty arbitrary when they happen, the test suite ran fine). Another look, @jreback @jbrockmendel, please? :)

This reverts commit 30de418.

h-vetinari · 2018-11-19T22:38:22Z

Reverting 30de418 due to the performance degradation. Was just for demo purposes.

jreback

ok i am onboard with the implementation. it is straightforward. change the order of the returns in unique to match factorize. you may not like it but that's how it is. having different return orderings is not going to work.

pandas/_libs/hashtable_class_helper.pxi.in

jorisvandenbossche · 2018-11-20T06:59:11Z

flip the order of the return here, make this like factorize, otherwise this is very confusing.

@jreback if you are asking that only for the internal hashtable method, I don't bother that much (@h-vetinari so you can do it here as Jeff asks, for the internal code it doesn't matter that much).
But IMO, we certainly don't want to do that for the user facing pd.unique (but we can do the flip in order also there; although I don't think that makes the code necessarily clearer). For pd.unique, I think we want to follow the return order of np.unique.

h-vetinari · 2018-11-20T07:27:59Z

@jorisvandenbossche @jreback

For pd.unique, I think we want to follow the return order of np.unique.

That's my point exactly, and even more so: adding the inverse as the first component of the tuple based on a kwarg is bound to lead to lots and lots of confusion. The only sane option IMO is the way np.unique goes, of appending the output.

I think we have to live with this discrepancy, and then, it just becomes a question where the signature break is in the code. I'd argue that it should be at the lowest level possible (i.e. Hashtable._unique --> Hashtable.factorize), so that the vast majority of users/developers need not be concerned with it at all.

…thon

h-vetinari · 2018-11-23T16:54:08Z

@jbrockmendel

@h-vetinari will do, thanks for your patience

Care to take another look? :)

jreback · 2018-11-23T21:02:48Z

pandas/_libs/hashtable_class_helper.pxi.in

+
+        Returns
+        -------
+        labels : ndarray[int64]


ok i agree now, flip the order here and just change it where it is called (e.g. in the python code)

@jreback
Just so I understand, you'd like to have the signature break between Hashtable.factorize (returning uniques, labels) and pandas.factorize (returning labels, uniques)? Then there'd still be a couple of places in the pandas code where this would have to be explained.

If the break were only between Hashtable._unique and Hashtable.factorize, then the break would be contained to this module, and would not be visible to anyone who's not working on the cython code. I think that would be the cleanest solution.

Then there'd still be a couple of places in the pandas code where this would have to be explained.

I wouldn't explain it, rather I would simply change it. The idea would be to change it everywhere internally, the only place this would not be the case would be the actual pd.factorize() where we can never change this, but that is user facing and not a big deal.

jreback · 2018-11-23T21:03:31Z

pandas/_libs/hashtable_class_helper.pxi.in

+
+        Returns
+        -------
+        labels : ndarray[int64]


jreback · 2018-11-23T21:03:59Z

pandas/_libs/hashtable_class_helper.pxi.in

+        uniques, labels = self._unique(values, uniques_vector,
+                                       na_sentinel=na_sentinel,
+                                       na_value=na_value, ignore_na=True,
+                                       return_inverse=True)


see above, just the cython signature, we are not going to (nor need to) change the user facing signature

jbrockmendel · 2018-11-24T17:46:21Z

pandas/_libs/hashtable_class_helper.pxi.in

-        return np.asarray(labels)
+        if return_inverse:
+            return uniques.to_array(), np.asarray(labels)
+        return uniques.to_array()


Is the non-constant signature here (and above) avoidable?

OK, now that I read the conversation I see this has been addressed several times. The only change in the python code is in the tests; what part of the non-cython code is affected by this decision?

@jbrockmendel
The non-constant signature follows np.unique and will eventually extend to pd.unique, Series.unique etc. This PR prepares the cython backend to have the necessary capability. This will i.a. also allow to solve #21720, and work towards #21357 / #22824

h-vetinari

@jreback
Thanks for the review. Please have a look at my argument to contain the signature change to this module. If you disagree, I'll change as requested.

@jbrockmendel
Thanks for taking a look, I responded in the comment

h-vetinari · 2018-11-25T13:10:46Z

pandas/_libs/hashtable_class_helper.pxi.in

+
+        Returns
+        -------
+        labels : ndarray[int64]


@jreback
Just so I understand, you'd like to have the signature break between Hashtable.factorize (returning uniques, labels) and pandas.factorize (returning labels, uniques)? Then there'd still be a couple of places in the pandas code where this would have to be explained.

If the break were only between Hashtable._unique and Hashtable.factorize, then the break would be contained to this module, and would not be visible to anyone who's not working on the cython code. I think that would be the cleanest solution.

h-vetinari · 2018-11-25T13:10:51Z

pandas/_libs/hashtable_class_helper.pxi.in

-        return np.asarray(labels)
+        if return_inverse:
+            return uniques.to_array(), np.asarray(labels)
+        return uniques.to_array()


@jbrockmendel
The non-constant signature follows np.unique and will eventually extend to pd.unique, Series.unique etc. This PR prepares the cython backend to have the necessary capability. This will i.a. also allow to solve #21720, and work towards #21357 / #22824

h-vetinari · 2018-11-25T13:19:54Z

@jbrockmendel

@h-vetinari a couple of preliminary, totally uninformed questions. Does it make sense to implement asvs for this?

I remembered this question, and checked the ASV benchmarks for existing calls to unique. There's nothing in algorithms.py, and actually, not many occurrences at all:

(pandas-dev) C:\[...]\pddev>findstr /r /S /N "\.unique\(" asv_bench/benchmarks/*.py
asv_bench/benchmarks/period.py:116:        self.index.unique()
asv_bench/benchmarks/timeseries.py:41:        self.index.unique()

Seems one should add some ASVs to asv_bench/benchmarks.algorithms.py, but I'd suggest to leave that for a follow-up.

h-vetinari · 2018-11-26T08:16:14Z

The failure is a flaky hypothesis test.

jreback · 2018-11-27T02:40:40Z

pandas/_libs/hashtable_class_helper.pxi.in

+
+        Returns
+        -------
+        labels : ndarray[int64]


Then there'd still be a couple of places in the pandas code where this would have to be explained.

I wouldn't explain it, rather I would simply change it. The idea would be to change it everywhere internally, the only place this would not be the case would be the actual pd.factorize() where we can never change this, but that is user facing and not a big deal.

…thon

h-vetinari · 2018-11-27T07:18:35Z

@jreback
Turns out hashtable.factorize was only used in algorithms.py, the other calls are either to _factorize_array libhashtable.Factorizer.factorize (or of course the user-facing algorithms.factorize), all of which are not affected.

h-vetinari · 2018-11-28T00:32:51Z

@jreback All green.

h-vetinari · 2018-11-29T16:39:51Z

@jreback
Ping

jreback · 2018-11-29T17:21:29Z

thanks @h-vetinari

h-vetinari · 2018-11-29T20:54:21Z

Cool, thanks.

…pandas-dev#23400)

h-vetinari added 4 commits October 26, 2018 18:27

Unify unique/factorize, remove kwargs (perf); enable inverse for unique

858f54e

Template over {return_inverse, ignore_na} for perf

4ed354a

Re-add kwargs to method signature

906cd50

Merge remote-tracking branch 'upstream/master' into unique_inverse_cy…

0d6dad0

…thon

Fix small oversight

19c7c1f

h-vetinari mentioned this pull request Oct 28, 2018

Directive for force-compiling code branches for different bint-parameters? cython/cython#2660

Open

h-vetinari changed the title ~~Unique inverse cython~~ ENH: Add return_inverse to cython-unique; unify unique/factorize-code Oct 28, 2018

gfyoung added Internals Related to non-user accessible pandas implementation Clean labels Oct 29, 2018

gfyoung requested review from jreback and jbrockmendel October 29, 2018 00:40

gfyoung added Enhancement Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Oct 29, 2018

h-vetinari added 3 commits October 29, 2018 07:47

Simplify an if-condition

a8f079f

Reword comment

1c5b97a

Merge remote-tracking branch 'upstream/master' into unique_inverse_cy…

d71d68c

…thon

jreback requested changes Oct 31, 2018

View reviewed changes

Revert templating over {return_inverse, ignore_na}

c7327fd

Merge branch 'master' into unique_inverse_cython

e09f6fe

Revert "Always calculate inverse"

0b85759

This reverts commit 30de418.

jreback requested changes Nov 20, 2018

View reviewed changes

pandas/_libs/hashtable_class_helper.pxi.in Show resolved Hide resolved

pandas/_libs/hashtable_class_helper.pxi.in Show resolved Hide resolved

h-vetinari mentioned this pull request Nov 20, 2018

API/ENH: overhaul/unify/improve .unique #22824

Open

6 tasks

h-vetinari added 2 commits November 21, 2018 22:35

Merge remote-tracking branch 'upstream/master' into unique_inverse_cy…

aad4b91

…thon

Add comments to ignore_na branches (review jreback)

4451854

jreback requested changes Nov 23, 2018

View reviewed changes

jbrockmendel reviewed Nov 24, 2018

View reviewed changes

h-vetinari commented Nov 25, 2018

View reviewed changes

jreback requested changes Nov 27, 2018

View reviewed changes

h-vetinari added 2 commits November 27, 2018 07:57

Merge remote-tracking branch 'upstream/master' into unique_inverse_cy…

e6b71b5

…thon

Switch signature of hashtable.factorize (review jreback)

00a304d

jreback added this to the 0.24.0 milestone Nov 29, 2018

jreback approved these changes Nov 29, 2018

View reviewed changes

jreback merged commit 0365828 into pandas-dev:master Nov 29, 2018

h-vetinari deleted the unique_inverse_cython branch November 29, 2018 20:54

This was referenced Dec 3, 2018

ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645

Closed

API/ENH/DEPR: Series.unique returns Series #24108

Closed

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ENH: Add return_inverse to cython-unique; unify unique/factorize-code (…

b56d4bb

…pandas-dev#23400)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ENH: Add return_inverse to cython-unique; unify unique/factorize-code (…

aa546c4

…pandas-dev#23400)

ENH: Add return_inverse to cython-unique; unify unique/factorize-code #23400

ENH: Add return_inverse to cython-unique; unify unique/factorize-code #23400

Conversation

h-vetinari commented Oct 28, 2018 • edited Loading

pep8speaks commented Oct 28, 2018

codecov bot commented Oct 28, 2018 • edited Loading

Codecov Report

jbrockmendel commented Oct 31, 2018

h-vetinari commented Oct 31, 2018 • edited Loading

h-vetinari commented Oct 31, 2018 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

h-vetinari commented Oct 31, 2018

jreback commented Oct 31, 2018

h-vetinari commented Oct 31, 2018

h-vetinari commented Oct 31, 2018

h-vetinari commented Nov 1, 2018

h-vetinari commented Nov 1, 2018

h-vetinari commented Nov 19, 2018

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 20, 2018

h-vetinari commented Nov 20, 2018 • edited Loading

h-vetinari commented Nov 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Nov 25, 2018

h-vetinari commented Nov 26, 2018

Choose a reason for hiding this comment

h-vetinari commented Nov 27, 2018

h-vetinari commented Nov 28, 2018

h-vetinari commented Nov 29, 2018

jreback commented Nov 29, 2018

h-vetinari commented Nov 29, 2018

h-vetinari commented Oct 28, 2018 •

edited

Loading

codecov bot commented Oct 28, 2018 •

edited

Loading

h-vetinari commented Oct 31, 2018 •

edited

Loading

h-vetinari commented Oct 31, 2018 •

edited

Loading

h-vetinari commented Nov 20, 2018 •

edited

Loading