Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: prepare unifying hashtable.factorize and .unique; add doc-strings #22986

Merged
merged 33 commits into from
Oct 18, 2018

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Oct 3, 2018

For adding a return_inverse-kwarg to unique (#21357 / #21645), I originally didn't want to have to add something to the templated cython code, preferring to add a solution using numpy.unique as a first step (with the idea of improving the performance later).

@jorisvandenbossche then remarked:

Which led me think: pd.unique with a return_inverse argument is actually basically the same as pd.factorize?

I didn't know what factorize was doing, so I had no idea about this connection. From looking at the cython code in hashtable_class_helper.pxi.in, I found that there's substantial overlap between unique and get_labels (the core of factorize) - the only difference is the handling of NaN/None/etc.

I think that these could/should be unified for less code duplication. Alternatively, the first commit of this PR could be split off into a separate PR - it is just adding a return_inverse-kwarg to unique -- though effectively by duplicating the factorize code.

Both variants pass the the test suite locally, the only question is whether there's appreciable differences in the ASVs. Since the only extra complexity of unifying the code is reading the value of some bint within the loop, I don't think that it'll be relevant. I got some noisy data the first time 'round, will let it run again when I have the possibility.

So long story short, this PR prepares the hashtable-backend to support return_inverse=True, which plays into #21357 #21645 #22824, and will also allow to easily solve #21720.

Side note: this already builds on #22978 to be able to run the ASVs at all.

@pep8speaks
Copy link

pep8speaks commented Oct 3, 2018

Hello @h-vetinari! Thanks for updating the PR.

Comment last updated on October 11, 2018 at 20:09 Hours UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this absolutely needs performance tests run
you might be able to always compute the labels
and remove the return_inverse keyword (and just always return to the calling function) as these are all private functions

but perf will dictate this

@h-vetinari
Copy link
Contributor Author

Here's the full ASV run on a cleanly restarted machine with nothing else running.

Bit surprised at the spread (both positive and negative):

      before           after         ratio
     [640162fa]       [9918d52b]
+        488±30μs         967±70μs     1.98  inference.NumericInferOps.time_subtract(<class 'numpy.uint16'>)
+         622±0μs         1.14±0ms     1.83  inference.NumericInferOps.time_multiply(<class 'numpy.int16'>)
+        579±30μs         977±80μs     1.69  inference.NumericInferOps.time_multiply(<class 'numpy.uint16'>)
+      25.7±0.5μs         40.6±3μs     1.58  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_decr')
+      1.98±0.1ms       3.08±0.2ms     1.55  indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_decr')
+         110±8μs          171±0μs     1.55  indexing.DataFrameNumericIndexing.time_iloc
+      2.08±0.1ms       3.12±0.4ms     1.51  groupby.RankWithTies.time_rank_ties('int64', 'max')
+      9.94±0.7μs         14.6±2μs     1.47  indexing.DataFrameStringIndexing.time_getitem_scalar
+      1.95±0.1ms       2.86±0.2ms     1.47  groupby.RankWithTies.time_rank_ties('int64', 'min')
+        2.67±0ms       3.91±0.8ms     1.46  indexing.DataFrameNumericIndexing.time_loc_dups
+         613±0μs          893±0μs     1.46  indexing.DataFrameNumericIndexing.time_bool_indexer
+     4.20±0.05μs       6.10±0.1μs     1.45  indexing.CategoricalIndexIndexing.time_getitem_scalar('monotonic_incr')
+          2.48ms           3.57ms     1.44  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
+        1.23±0ms       1.75±0.1ms     1.42  indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_incr')
+        26.0±2μs         36.6±4μs     1.41  indexing.CategoricalIndexIndexing.time_getitem_list_like('non_monotonic')
+        1.83±0μs       2.57±0.3μs     1.41  indexing.GetItemSingleColumn.time_frame_getitem_single_column_int
+        977±70μs      1.36±0.03ms     1.39  indexing.CategoricalIndexIndexing.time_getitem_list('non_monotonic')
+        220±10μs         306±20μs     1.39  indexing.DataFrameNumericIndexing.time_iloc_dups
+          7.50ms           10.4ms     1.39  index_object.Indexing.time_boolean_array('Float')
+        97.1±0μs          135±8μs     1.39  indexing.DataFrameNumericIndexing.time_loc
+        4.24±0μs       5.86±0.2μs     1.38  indexing.CategoricalIndexIndexing.time_getitem_scalar('non_monotonic')
+         268±0μs         368±20μs     1.37  indexing.DataFrameStringIndexing.time_boolean_rows_object
+        14.3±1μs       19.5±0.9μs     1.36  indexing.DataFrameStringIndexing.time_get_value
+        24.4±0μs         33.3±2μs     1.36  indexing.DataFrameStringIndexing.time_ix
+        1.07±0ms      1.46±0.03ms     1.36  indexing.CategoricalIndexIndexing.time_getitem_list('monotonic_decr')
+     4.20±0.05μs       5.69±0.1μs     1.36  indexing.CategoricalIndexIndexing.time_getitem_scalar('monotonic_decr')
+          1.83μs           2.44μs     1.33  index_object.Indexing.time_get('Float')
+        28.2±2μs         36.6±2μs     1.30  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_incr')
+        13.4±0μs         17.2±1μs     1.29  indexing.CategoricalIndexIndexing.time_getitem_slice('non_monotonic')
+        12.2±0ms         15.6±1ms     1.29  algorithms.Duplicated.time_duplicated_int(False)
+         368±0μs         467±60μs     1.27  groupby.GroupByMethods.time_dtype_as_group('int', 'mean', 'direct')
+         342±0μs         431±30μs     1.26  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_decr')
+     1.07±0.08ms         1.35±0ms     1.26  indexing.CategoricalIndexIndexing.time_getitem_list('monotonic_incr')
+     1.09±0.06ms      1.37±0.02ms     1.25  indexing.CategoricalIndexIndexing.time_getitem_bool_array('non_monotonic')
+         342±0μs          426±8μs     1.25  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_incr')
+      18.3±0.4μs       22.6±0.4μs     1.23  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'nonunique_monotonic_inc')
+      8.01±0.4μs         9.77±1μs     1.22  indexing.DataFrameStringIndexing.time_loc
+        2.14±0ms       2.60±0.1ms     1.21  indexing.CategoricalIndexIndexing.time_get_loc_scalar('non_monotonic')
+        14.1±0μs       17.1±0.3μs     1.21  indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_decr')
+           832ns            994ns     1.19  index_object.Indexing.time_get('String')
+        2.68±0ms         3.12±0ms     1.17  index_object.SetOperations.time_operation('datetime', 'intersection')
+        3.12±0ms         3.65±0ms     1.17  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
+          7.06μs           8.07μs     1.14  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
+        27.3±0ms         31.2±0ms     1.14  binary_ops.Timeseries.time_timestamp_ops_diff('US/Eastern')
+          13.7ms           15.6ms     1.14  categoricals.Isin.time_isin_categorical('int64')
+           109ms            125ms     1.14  index_object.IndexAppend.time_append_obj_list
+         109±0ms          125±0ms     1.14  index_object.SetOperations.time_operation('strings', 'union')
+           518μs            587μs     1.13  categoricals.CategoricalSlicing.time_getitem_list('non_monotonic')
+        62.5±0ms         70.3±0ms     1.12  eval.Eval.time_and('numexpr', 'all')
+          4.58μs           5.13μs     1.12  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')
+          85.4μs           95.6μs     1.12  frame_methods.XS.time_frame_xs(0)
+           141ms            156ms     1.11  index_object.Indexing.time_get_loc_non_unique('String')
+        85.2±0μs         94.7±6μs     1.11  indexing.NumericSeriesIndexing.time_iloc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+      3.29±0.2ms         3.65±0ms     1.11  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
+         383±4ms          422±0ms     1.10  io.stata.Stata.time_read_stata('tq')
-          52.1ms           46.9ms     0.90  eval.Query.time_query_with_boolean_selection
-         375±8ms          336±4ms     0.90  io.excel.Excel.time_write_excel('xlsxwriter')
-          22.0μs           19.5μs     0.89  ctors.SeriesDtypesConstructors.time_index_from_array_floats
-        440±40μs          391±0μs     0.89  groupby.GroupByMethods.time_dtype_as_field('int', 'sum', 'direct')
-      2.44±0.2ms         2.14±0ms     0.88  replace.FillNa.time_replace(True)
-          15.6μs           13.6μs     0.87  index_object.Indexing.time_slice_step('Float')
-          4.84μs           4.20μs     0.87  frame_methods.XS.time_frame_xs(1)
-          5.13μs           4.39μs     0.86  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
-         183±0μs         155±10μs     0.85  groupby.GroupByMethods.time_dtype_as_field('int', 'shift', 'transformation')
-           1.33s            1.12s     0.85  groupby.GroupByMethods.time_dtype_as_group('float', 'mad', 'direct')
-         406±8ms          344±0ms     0.85  indexing.NonNumericSeriesIndexing.time_getitem_scalar('string', 'unique_monotonic_inc')
-     1.16±0.07ms          977±0μs     0.84  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'direct')
-         188±8ms          156±0ms     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'direct')
-         188±4ms          156±0ms     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'transformation')
-        188±70ms          156±8ms     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'direct')
-          5.86μs           4.86μs     0.83  index_object.Indexing.time_slice_step('String')
-          10.1μs           8.32μs     0.83  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_incr')
-        419±30μs         344±40μs     0.82  groupby.GroupByMethods.time_dtype_as_field('int', 'bfill', 'direct')
-         574±0μs          469±0μs     0.82  frame_methods.Iteration.time_iteritems_cached
-        426±20μs          342±0μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'mean', 'transformation')
-        426±30μs          342±0μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'prod', 'transformation')
-        133±10μs          107±6μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'size', 'direct')
-         334±0μs         267±20μs     0.80  groupby.GroupByMethods.time_dtype_as_field('float', 'bfill', 'direct')
-           1.41s            1.12s     0.80  groupby.Datelike.time_sum('period_range')
-        488±30μs          391±0μs     0.80  groupby.GroupByMethods.time_dtype_as_field('int', 'min', 'direct')
-          26.0ms           20.8ms     0.80  eval.Query.time_query_datetime_column
-          30.3μs           24.2μs     0.80  index_object.Indexing.time_get_loc('Float')
-        91.6±0μs         72.9±3μs     0.80  indexing.NonNumericSeriesIndexing.time_getitem_pos_slice('datetime', 'nonunique_monotonic_inc')
-        537±50μs          426±0μs     0.79  groupby.GroupByMethods.time_dtype_as_field('int', 'tail', 'direct')
-         536±0μs         426±20μs     0.79  groupby.GroupByMethods.time_dtype_as_field('int', 'nunique', 'transformation')
-        513±30μs         402±20μs     0.78  groupby.GroupByMethods.time_dtype_as_field('int', 'tail', 'transformation')
-         141±0μs          110±0μs     0.78  groupby.GroupByMethods.time_dtype_as_field('int', 'size', 'transformation')
-        868±30μs         671±40μs     0.77  groupby.GroupByMethods.time_dtype_as_field('int', 'pct_change', 'transformation')
-        868±30μs          671±0μs     0.77  groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'direct')
-        509±20μs          391±0μs     0.77  groupby.GroupByMethods.time_dtype_as_field('int', 'min', 'transformation')
-         407±5μs         311±20μs     0.76  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'transformation')
-         107±8μs         80.1±0μs     0.75  groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'direct')
-        458±10μs          344±0μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'mean', 'direct')
-          4.29ms           3.22ms     0.75  index_object.Indexing.time_get_loc_non_unique('Float')
-        897±50μs         671±40μs     0.75  groupby.GroupByMethods.time_dtype_as_field('int', 'pct_change', 'direct')
-      3.57±0.2ms         2.67±0ms     0.75  groupby.Datelike.time_sum('date_range_tz')
-      2.88±0.2ms       2.14±0.2ms     0.75  groupby.Datelike.time_sum('date_range')
-      8.88±0.9ms         6.58±0ms     0.74  io.hdf.HDFStoreDataFrame.time_query_store_table
-        556±30μs          407±0μs     0.73  groupby.GroupByMethods.time_dtype_as_field('int', 'median', 'transformation')
-        446±20μs          326±0μs     0.73  groupby.GroupByMethods.time_dtype_as_field('int', 'ffill', 'transformation')
-        340±50μs         244±10μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'var', 'transformation')
-     1.20±0.09ms         862±60μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'cummax', 'direct')
-        119±10μs         85.4±5μs     0.72  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'transformation')
-        570±30μs         407±20μs     0.71  groupby.GroupByMethods.time_dtype_as_field('int', 'head', 'direct')
-        570±40μs         407±40μs     0.71  groupby.GroupByMethods.time_dtype_as_field('int', 'last', 'direct')
-        951±70μs          679±0μs     0.71  inference.NumericInferOps.time_add(<class 'numpy.int16'>)
-        586±50μs          407±0μs     0.69  groupby.GroupByMethods.time_dtype_as_field('int', 'median', 'direct')
-     1.22±0.09ms         837±60μs     0.69  groupby.GroupByMethods.time_dtype_as_field('int', 'cumprod', 'direct')
-        536±70μs          367±0μs     0.68  groupby.GroupByMethods.time_dtype_as_field('int', 'last', 'transformation')
-           719ms          492±4ms     0.68  groupby.GroupByMethods.time_dtype_as_field('int', 'mad', 'transformation')
-         471±0μs          322±0μs     0.68  groupby.GroupByMethods.time_dtype_as_field('int', 'bfill', 'transformation')
-         114±7μs         77.7±0μs     0.68  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'direct')
-        575±50μs          388±5μs     0.67  groupby.GroupByMethods.time_dtype_as_field('int', 'max', 'direct')
-        586±70μs          391±0μs     0.67  groupby.GroupByMethods.time_dtype_as_field('int', 'max', 'transformation')
-      1.29±0.1ms         854±60μs     0.66  groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'transformation')
-           1.94s            1.28s     0.66  groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'transformation')
-      1.22±0.1ms         804±60μs     0.66  groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'direct')
-        20.8±0ms         13.7±0ms     0.66  groupby.Float32.time_sum
-     1.25±0.08ms         804±40μs     0.64  groupby.GroupByMethods.time_dtype_as_field('int', 'cumprod', 'transformation')
-         220±0μs          141±8μs     0.64  groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'direct')
-      1.71±0.1ms         1.09±0ms     0.64  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'transformation')
-         382±0μs         244±10μs     0.64  groupby.GroupByMethods.time_dtype_as_field('int', 'var', 'direct')
-           2.03s            1.28s     0.63  groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'direct')
-        467±30μs         293±20μs     0.63  groupby.GroupByMethods.time_dtype_as_field('int', 'ffill', 'direct')
-         250±8ms          156±8ms     0.62  groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'transformation')
-        613±30μs          382±0μs     0.62  groupby.GroupByMethods.time_dtype_as_field('int', 'first', 'transformation')
-     1.34±0.08ms         822±60μs     0.61  groupby.GroupByMethods.time_dtype_as_field('int', 'cumsum', 'transformation')
-           844ms          500±0ms     0.59  groupby.GroupByMethods.time_dtype_as_field('int', 'mad', 'direct')
-        710±40μs         420±20μs     0.59  groupby.GroupByMethods.time_dtype_as_field('int', 'head', 'transformation')
-        239±10μs          141±0μs     0.59  groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'transformation')
-        1.82±0ms         1.07±0ms     0.59  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'direct')
-        1.44±0ms         839±60μs     0.58  groupby.GroupByMethods.time_dtype_as_field('int', 'cummax', 'transformation')
-     1.52±0.08ms          854±0μs     0.56  groupby.GroupByMethods.time_dtype_as_field('int', 'cumsum', 'direct')
-         153±0μs         80.1±0μs     0.53  groupby.GroupByMethods.time_dtype_as_field('int', 'all', 'transformation')
-         168±0μs         77.7±0μs     0.46  groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'transformation')
-         171±9μs         73.2±6μs     0.43  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'direct')
-         171±9μs         73.2±6μs     0.43  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'transformation')
-     1.17±0.08ms          412±0μs     0.35  frame_ctor.FromRecords.time_frame_from_records_generator(1000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Since #22978 wasn't merged yet 10h ago, I tested against master+#22978 see
640162f
9918d52

@h-vetinari
Copy link
Contributor Author

I'm guessing that the perf decrease for some indexing methods will be a dealbreaker, and I don't believe that always returning the inverse will be faster than reading the bint. Might have to have separate codes for unique without inverse vs. factorize/unique with inverse...

@jorisvandenbossche
Copy link
Member

Might have to have separate codes for unique without inverse vs. factorize/unique with inverse...

Yes (which is basically what it was before? but with some additions to get_labels to satisfy the behaviour of unique)

But first, in addition to asv, I would recommend to run a few targetted benchmarks (eg unique and factorize on a single array of some dtype) manually using %timeit before / after. I think that will give you a better idea of the performance cost.

@codecov
Copy link

codecov bot commented Oct 4, 2018

Codecov Report

Merging #22986 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22986      +/-   ##
==========================================
- Coverage   92.19%   92.19%   -0.01%     
==========================================
  Files         169      169              
  Lines       50954    50952       -2     
==========================================
- Hits        46975    46973       -2     
  Misses       3979     3979
Flag Coverage Δ
#multiple 90.61% <100%> (-0.01%) ⬇️
#single 42.27% <100%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/algorithms.py 95.11% <100%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1546c35...6d0e86b. Read the comment docs.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 4, 2018

After trying to force compilation of the different bint code paths (unsuccessfully...) I now added the original (no-inverse-)unique code back in.

But first, in addition to asv, I would recommend to run a few targetted benchmarks (eg unique and factorize on a single array of some dtype)

Following the advice of @jorisvandenbossche, I ran the following benchmarks for the commits in question (first is master + #22978; then the fully unified function; then the different functions for different bint, and reduplicating the no-inverse code, and finally, adding @cython.wraparound(False)). The hashes are the same as in this PR

Relative to master:

                   640162fab9  9918d52b96  52ae84e7f0  dbe4e0ed81  8481e19619
StringIndex               1.0    0.987317    0.977104    0.983180    0.970452
CategoricalIndex          1.0    1.049310    1.050426    0.983520    0.988915
IntIndex                  1.0    1.127943    1.143860    1.013043    0.986293
UIntIndex                 1.0    1.149187    1.144719    1.003053    0.989413
RangeIndex                1.0    1.126137    1.137957    0.996779    0.990026
FloatIndex                1.0    1.127896    1.097702    1.026405    1.026518
TimedeltaIndex            1.0    1.075186    1.078420    1.006917    0.988882
StringSeries              1.0    0.959835    0.983661    0.968425    0.954817
CategoricalSeries         1.0    1.074586    1.084046    1.023071    1.032467
IntSeries                 1.0    1.045707    1.043808    0.951452    0.944722
UIntSeries                1.0    1.100339    1.117216    1.001963    0.983911
RangeSeries               1.0    1.099110    1.107786    0.991575    0.980506
FloatSeries               1.0    1.085278    1.123008    0.994891    1.018042
TimedeltaSeries           1.0    1.067808    1.099913    0.996012    0.992843

Absolute values:

                        640162fab9       9918d52b96       52ae84e7f0       dbe4e0ed81       8481e19619
StringIndex        16.08 ± 0.19 ms  15.88 ± 0.34 ms  15.71 ± 0.20 ms  15.81 ± 0.21 ms  15.61 ± 0.14 ms
CategoricalIndex    1.28 ± 0.05 ms   1.35 ± 0.07 ms   1.35 ± 0.07 ms   1.26 ± 0.06 ms   1.27 ± 0.06 ms
IntIndex            2.75 ± 0.03 ms   3.10 ± 0.04 ms   3.14 ± 0.05 ms   2.78 ± 0.08 ms   2.71 ± 0.05 ms
UIntIndex           2.71 ± 0.03 ms   3.12 ± 0.03 ms   3.11 ± 0.08 ms   2.72 ± 0.03 ms   2.69 ± 0.02 ms
RangeIndex          2.75 ± 0.04 ms   3.10 ± 0.03 ms   3.13 ± 0.05 ms   2.74 ± 0.06 ms   2.72 ± 0.08 ms
FloatIndex          4.62 ± 0.08 ms   5.21 ± 0.33 ms   5.07 ± 0.08 ms   4.74 ± 0.43 ms   4.74 ± 0.32 ms
TimedeltaIndex      4.29 ± 0.04 ms   4.61 ± 0.09 ms   4.63 ± 0.06 ms   4.32 ± 0.08 ms   4.24 ± 0.08 ms
StringSeries       34.96 ± 0.50 ms  33.56 ± 0.36 ms  34.39 ± 1.19 ms  33.86 ± 0.31 ms  33.38 ± 0.51 ms
CategoricalSeries   1.22 ± 0.03 ms   1.31 ± 0.08 ms   1.32 ± 0.05 ms   1.25 ± 0.02 ms   1.26 ± 0.07 ms
IntSeries           3.41 ± 0.17 ms   3.57 ± 0.04 ms   3.56 ± 0.05 ms   3.24 ± 0.05 ms   3.22 ± 0.06 ms
UIntSeries          3.24 ± 0.03 ms   3.56 ± 0.05 ms   3.62 ± 0.07 ms   3.24 ± 0.04 ms   3.19 ± 0.04 ms
RangeSeries         3.25 ± 0.03 ms   3.57 ± 0.07 ms   3.60 ± 0.08 ms   3.22 ± 0.03 ms   3.19 ± 0.03 ms
FloatSeries         4.74 ± 0.06 ms   5.14 ± 0.11 ms   5.32 ± 0.47 ms   4.71 ± 0.05 ms   4.82 ± 0.37 ms
TimedeltaSeries     4.17 ± 0.04 ms   4.45 ± 0.10 ms   4.59 ± 0.11 ms   4.15 ± 0.04 ms   4.14 ± 0.03 ms

Code:

import pandas as pd
import pandas.util.testing as tm
import timeit
hash = pd.__git_version__[:10]

def pps(mean, std):
    system = [(1, 's'), (1e-3, 'ms'), (1e-6, 'μs'), (1e-9, 'ns'), (1e-12, 'ps')]
    for factor, suffix in system:
        if mean >= factor:
            break
    amount = round(mean / factor, 2)
    return f'{mean / factor:.2f} ± {std / factor:.2f} {suffix}'

k = 10 ** 5
rep = 10
number = 100
tic = timeit.default_timer()
tags = ['String', 'Categorical', 'Int', 'UInt',
         'Range', 'Float', 'Timedelta']
df = pd.DataFrame(index = [x+'Index' for x in tags], columns = ['mean', 'std'])
for tag in tags:
    idx = getattr(tm, f'make{tag}Index')(k=k)
    t = timeit.repeat('idx.unique()', setup='from __main__ import idx', repeat = rep+1, number = number)[1:]
    df.loc[tag+'Index', 'mean'] = pd.Series(t).mean() / number
    df.loc[tag+'Index', 'std'] = pd.Series(t).std() / number
    
    s = pd.Series(idx).sample(frac=1.1, replace=True)
    t = timeit.repeat('s.unique()', setup='from __main__ import s', repeat = rep+1, number = number)[1:]
    df.loc[tag+'Series', 'mean'] = pd.Series(t).mean() / number
    df.loc[tag+'Series', 'std'] = pd.Series(t).std() / number

res = df.apply(lambda r: pps(*tuple(r)), axis=1)

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 5, 2018

Here's the ASV run with of 8481e19 against master + #22978. I ran it with -f 1.05 because ~10% for indexing ops would still be sensitive.

The spread still indicates to me that the results are noisy, but in any case, it's clearly skewed towards being faster now. I'll probably rerun against just -b ^index to see how much of the indexing stuff remains after a rerun.

BTW, is it possible to do an ASV run for a PR on a server? Or could someone else rerun it locally? Not sure how representative my machine is.

      before           after         ratio
     [640162fa]       [8481e196]
!      13.9±0.9ms           failed      n/a  algorithms.Duplicated.time_duplicated_float('first')
!      13.9±0.2ms           failed      n/a  algorithms.Duplicated.time_duplicated_float('last')
!        10.9±0ms           failed      n/a  binary_ops.Ops.time_frame_add(False, 'default')
+        536±40μs          744±0μs     1.39  inference.NumericInferOps.time_add(<class 'numpy.int16'>)
+          52.1ms           70.3ms     1.35  eval.Query.time_query_with_boolean_selection
+        13.7±0ms         18.2±1ms     1.33  stat_ops.Rank.time_rank('DataFrame', False)
+        13.7±0ms         18.2±0ms     1.33  stat_ops.Rank.time_rank('DataFrame', True)
+          26.6μs           34.1μs     1.28  index_object.Indexing.time_get_loc_sorted('Float')
+        85.9±0ms          109±4ms     1.27  io.csv.ReadCSVCategorical.time_convert_post
+         477±4ms          602±4ms     1.26  groupby.GroupByMethods.time_dtype_as_field('float', 'mad', 'transformation')
+          2.60ms           3.28ms     1.26  index_object.Indexing.time_get_loc_non_unique('Float')
+        78.1±1ms         97.7±4ms     1.25  binary_ops.Ops.time_frame_comparison(False, 1)
+           562ms            703ms     1.25  gil.ParallelReadCSV.time_read_csv('float')
+          2.03μs           2.44μs     1.20  index_object.Indexing.time_get('Float')
+          10.4ms           12.5ms     1.20  categoricals.CategoricalSlicing.time_getitem_bool_array('non_monotonic')
+      5.76±0.4ms         6.84±0ms     1.19  frame_methods.Fillna.time_frame_fillna(True, 'pad')
+        93.8±6ms            109ms     1.17  io.sas.SAS.time_read_msgpack('sas7bdat')
+        1.08±0ms       1.26±0.2ms     1.16  indexing.CategoricalIndexIndexing.time_getitem_list('non_monotonic')
+           2.34s            2.70s     1.15  gil.ParallelGroups.time_get_groups(4)
+          13.4μs           15.3μs     1.14  ctors.SeriesDtypesConstructors.time_dtindex_from_series
+           109ms            125ms     1.14  index_object.IndexAppend.time_append_obj_list
+        36.5±0ms         41.7±0ms     1.14  join_merge.MergeAsof.time_on_int
+         281±0ms         320±10ms     1.14  io.stata.Stata.time_write_stata('th')
+     1.71±0.06ms         1.93±0ms     1.13  reindex.DropDuplicates.time_frame_drop_dups_int(True)
+        41.7±0ms         46.9±0ms     1.12  frame_methods.Iteration.time_itertuples
+          4.58μs           5.13μs     1.12  index_object.Indexing.time_slice_step('String')
+        28.1±0ms         31.2±0ms     1.11  index_object.SetOperations.time_operation('date_string', 'symmetric_difference')
+        18.8±0ms         20.8±0ms     1.11  io.hdf.HDFStoreDataFrame.time_read_store
+           4.52s            5.00s     1.11  gil.ParallelGroups.time_get_groups(8)
+           103μs            114μs     1.11  frame_methods.XS.time_frame_xs(0)
+        68.7±0μs         72.9±0μs     1.06  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+        73.2±0μs         77.7±0μs     1.06  indexing.NumericSeriesIndexing.time_ix_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+          4.40μs           4.66μs     1.06  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
+        397±20μs          419±0μs     1.05  stat_ops.SeriesOps.time_op('mean', 'float', False)
+           1.19s            1.25s     1.05  join_merge.I8Merge.time_i8merge('left')
-        20.7±0μs       19.7±0.4μs     0.95  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'nonunique_monotonic_inc')
-         312±0ms          297±0ms     0.95  io.json.ToJSON.time_delta_int_tstamp_lines('index')
-         312±0ms          297±0ms     0.95  io.json.ToJSON.time_delta_int_tstamp_lines('split')
-      2.82±0.2μs         2.67±0μs     0.95  categoricals.Contains.time_categorical_index_contains
-          8.27ms           7.81ms     0.94  index_object.Indexing.time_boolean_array('Float')
-        4.24±0μs         4.01±0μs     0.94  io.hdf.HDFStoreDataFrame.time_store_repr
-         365±0μs         344±20μs     0.94  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_decr')
-          7.77μs           7.30μs     0.94  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
-        83.3±3ms         78.1±0ms     0.94  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0.5, 'nearest')
-        1.23±0ms      1.15±0.08ms     0.93  categoricals.Constructor.time_datetimes_with_nat
-        9.77±1ms       9.11±0.2ms     0.93  strings.Cat.time_cat(0, None, '-', 0.001)
-        4.52±0ms       4.21±0.3ms     0.93  algorithms.Hashing.time_series_int
-     1.34±0.08μs       1.24±0.1μs     0.93  period.PeriodProperties.time_property('min', 'dayofweek')
-        372±30ns          344±0ns     0.92  timestamp.TimestampProperties.time_is_quarter_start(None, None)
-         102±0ms         93.8±0ms     0.92  frame_methods.Count.time_count_level_multi(1)
-        4.56±0ms       4.21±0.3ms     0.92  algorithms.Hashing.time_series_timedeltas
-         188±0ms          172±0ms     0.92  io.hdf.HDFStoreDataFrame.time_write_store_table_dc
-         671±0μs          613±0μs     0.91  indexing.DataFrameNumericIndexing.time_bool_indexer
-           1.23s            1.12s     0.91  groupby.GroupByMethods.time_dtype_as_group('float', 'mad', 'transformation')
-      2.29±0.2ms         2.08±0ms     0.91  stat_ops.SeriesOps.time_op('var', 'float', True)
-        7.81±0ms       7.10±0.2ms     0.91  index_object.Ops.time_add('float')
-        61.0±0μs         55.5±1μs     0.91  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-         123±0μs          111±1μs     0.90  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      4.32±0.3ms       3.91±0.1ms     0.90  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 1, 'lower')
-         156±0ms          141±6ms     0.90  strings.Split.time_split(False)
-      9.38±0.6ms         8.41±0ms     0.90  timeseries.DatetimeIndex.time_normalize('repeated')
-        383±30ns         343±10ns     0.90  timestamp.TimestampProperties.time_quarter(None, None)
-        4.74±0μs         4.24±0μs     0.89  indexing.CategoricalIndexIndexing.time_getitem_scalar('monotonic_decr')
-          13.7ms           12.2ms     0.89  index_object.Indexing.time_boolean_array('String')
-          13.7ms           12.2ms     0.89  index_object.Indexing.time_boolean_series('String')
-         141±0ms          125±0ms     0.89  io.excel.Excel.time_read_excel('openpyxl')
-      93.8±0.7ms         83.3±0ms     0.89  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0.5, 'linear')
-      93.8±0.7ms         83.3±3ms     0.89  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0.5, 'midpoint')
-        46.9±4ms         41.7±0ms     0.89  strings.Methods.time_title
-         541±0μs          478±0μs     0.88  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        88.5±5ms         78.1±0ms     0.88  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0.5, 'higher')
-        88.5±5ms         78.1±3ms     0.88  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0.5, 'lower')
-         121±0μs          107±0μs     0.88  panel_methods.PanelMethods.time_shift('items')
-        391±20ns         343±20ns     0.88  timestamp.TimestampProperties.time_quarter(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-          4.88ms           4.29ms     0.88  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
-          20.8ms           18.2ms     0.88  eval.Query.time_query_datetime_column
-      10.4±0.9ms         9.11±0ms     0.88  strings.Cat.time_cat(0, ',', '-', 0.0)
-           125ms            109ms     0.88  strings.Repeat.time_repeat('int')
-        15.6±1ms       13.7±0.7ms     0.88  timeseries.AsOf.time_asof('DataFrame')
-      3.91±0.2ms         3.42±0ms     0.88  timeseries.ToDatetimeCache.time_dup_string_dates(True)
-        458±10ns         401±30ns     0.87  timestamp.TimestampProperties.time_dayofyear(None, None)
-        78.7±0μs         68.7±0μs     0.87  indexing.NumericSeriesIndexing.time_iloc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        48.6±3μs         42.4±2μs     0.87  timestamp.TimestampProperties.time_is_year_start(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
-      4.51±0.1ms      3.91±0.09ms     0.87  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0, 'midpoint')
-         406±0ms          352±4ms     0.87  timeseries.ToDatetimeISO8601.time_iso8601_tz_spaceformat
-         117±6μs          101±2μs     0.86  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        473±50μs          407±0μs     0.86  groupby.GroupByMethods.time_dtype_as_group('float', 'sum', 'transformation')
-        54.7±1ms         46.9±0ms     0.86  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'linear')
-        54.7±0ms         46.9±3ms     0.86  strings.Methods.time_center
-        54.7±0ms         46.9±3ms     0.86  strings.Methods.time_pad
-        36.5±0ms         31.2±0ms     0.86  strings.Methods.time_strip
-        13.7±3ms         11.7±0ms     0.86  timedelta.ToTimedelta.time_convert_string_seconds
-      4.56±0.2ms         3.91±0ms     0.86  timeseries.ToDatetimeCache.time_dup_string_tzoffset_dates(True)
-      3.12±0.3ms       2.67±0.2ms     0.86  timeseries.DatetimeIndex.time_add_timedelta('tz_naive')
-        49.1±5μs       42.0±0.5μs     0.86  timestamp.TimestampProperties.time_is_quarter_end(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
-        313±30ns          267±0ns     0.85  timedelta.DatetimeAccessor.time_dt_accessor
-        50.0±6μs         42.4±0μs     0.85  timestamp.TimestampProperties.time_is_year_end(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
-      7.40±0.2ms       6.25±0.1ms     0.84  timeseries.ResampleSeries.time_resample('period', '5min', 'mean')
-           1.17s            984ms     0.84  groupby.Apply.time_copy_function_multi_col
-      4.65±0.1ms       3.91±0.1ms     0.84  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 1, 'midpoint')
-         466±0ns         391±20ns     0.84  timestamp.TimestampProperties.time_week(None, None)
-      7.25±0.6ms       6.08±0.4ms     0.84  timeseries.DatetimeIndex.time_timeseries_is_month_start('tz_naive')
-        21.9±0ms         18.2±0ms     0.83  inference.DateInferOps.time_timedelta_plus_datetime
-        10.9±0ms       9.11±0.7ms     0.83  strings.Cat.time_cat(0, None, '-', 0.0)
-      10.9±0.8ms         9.11±0ms     0.83  strings.Cat.time_cat(0, None, None, 0.001)
-      10.9±0.6ms       9.11±0.5ms     0.83  timeseries.DatetimeAccessor.time_dt_accessor_normalize
-        93.8±0ms         78.1±1ms     0.83  strings.Cat.time_cat(3, ',', '-', 0.15)
-         375±8ms          312±0ms     0.83  strings.Split.time_split(True)
-         188±4ms          156±8ms     0.83  timeseries.DatetimeIndex.time_timeseries_is_month_start('tz_aware')
-           703ms          586±4ms     0.83  timeseries.ToDatetimeFormat.time_no_exact
-        17.0±1μs         14.1±0μs     0.83  timestamp.TimestampProperties.time_is_year_end(None, 'B')
-           1.56s            1.30s     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'transformation')
-           734ms            609ms     0.83  timeseries.ToDatetimeFormat.time_exact
-        52.7±2ms         43.7±2ms     0.83  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'higher')
-        23.6±1μs       19.5±0.9μs     0.83  timeseries.AsOf.time_asof_single('Series')
-         180±4ms          148±4ms     0.83  frame_methods.Iteration.time_iteritems_indexing
-        486±30ns          401±0ns     0.82  timestamp.TimestampProperties.time_week(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
-        66.4±4ms         54.7±4ms     0.82  strings.Cat.time_cat(3, None, '-', 0.001)
-      7.81±0.8ms       6.43±0.1ms     0.82  timeseries.ToDatetimeYYYYMMDD.time_format_YYYYMMDD
-        568±30μs         467±30μs     0.82  multiindex_object.Values.time_datetime_level_values_sliced
-        343±30ns          282±0ns     0.82  timestamp.TimestampProperties.time_microsecond(None, None)
-        172±20μs          142±0μs     0.82  groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'transformation')
-        378±30ns         311±20ns     0.82  timestamp.TimestampProperties.time_tz(None, None)
-      3.51±0.2ms         2.88±0ms     0.82  timeseries.DatetimeIndex.time_add_timedelta('repeated')
-      10.4±0.9ms       8.52±0.5ms     0.82  strings.Cat.time_cat(0, None, None, 0.0)
-      8.37±0.6ms       6.84±0.4ms     0.82  algorithms.Hashing.time_series_categorical
-        491±60ns          401±0ns     0.82  timestamp.TimestampProperties.time_week(None, 'B')
-         211±8ms          172±0ms     0.81  timeseries.ToDatetimeCache.time_dup_string_tzoffset_dates(False)
-           1.09s            891ms     0.81  timedelta.TimedeltaOps.time_add_td_ts
-           250ms            203ms     0.81  gil.ParallelKth.time_kth_smallest
-     1.35±0.07ms         1.09±0ms     0.81  timeseries.InferFreq.time_infer_freq(None)
-      4.84±0.2ms       3.91±0.1ms     0.81  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0, 'higher')
-        10.4±0ms       8.41±0.5ms     0.81  strings.Cat.time_cat(0, ',', '-', 0.001)
-      5.02±0.3ms       4.04±0.1ms     0.80  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0, 'nearest')
-           797ms          641±0ms     0.80  inference.MaybeConvertNumeric.time_convert
-      4.62±0.4ms       3.70±0.2ms     0.80  timeseries.AsOf.time_asof_nan_single('DataFrame')
-      3.76±0.1ms       3.01±0.1ms     0.80  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 1, 'midpoint')
-      2.74±0.2ms         2.20±0ms     0.80  stat_ops.SeriesOps.time_op('var', 'int', True)
-        458±60ns         366±20ns     0.80  timestamp.TimestampProperties.time_days_in_month(None, None)
-        78.1±4ms         62.5±2ms     0.80  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'midpoint')
-        58.6±4ms         46.9±0ms     0.80  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'midpoint')
-        78.1±0ms         62.5±4ms     0.80  strings.Cat.time_cat(3, None, '-', 0.15)
-        305±20ns         244±10ns     0.80  timedelta.TimedeltaProperties.time_timedelta_seconds
-        31.2±2ms         25.0±0ms     0.80  timeseries.DatetimeIndex.time_to_date('repeated')
-        54.7±0ms         43.7±2ms     0.80  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'lower')
-        27.3±2ms         21.9±0ms     0.80  strings.Contains.time_contains(False)
-      4.63±0.1ms      3.70±0.05ms     0.80  timeseries.ResampleSeries.time_resample('period', '1D', 'mean')
-      3.98±0.1ms      3.17±0.09ms     0.80  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 1, 'nearest')
-           4.09s            3.27s     0.80  timeseries.ToDatetimeNONISO8601.time_different_offset
-        1.33±0ms         1.06±0ms     0.80  timeseries.InferFreq.time_infer_freq('D')
-        430±20ns         342±20ns     0.79  timestamp.TimestampProperties.time_is_quarter_start(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-        44.3±5ms         35.2±1ms     0.79  strings.Methods.time_endswith
-        375±10ms          297±2ms     0.79  strings.Methods.time_extract
-           24.8s            19.6s     0.79  strings.Dummies.time_get_dummies
-      3.76±0.2ms      2.97±0.02ms     0.79  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 1, 'lower')
-        477±10ms          375±4ms     0.79  groupby.Groups.time_series_groups('int64_large')
-        219±10ms          172±0ms     0.79  timeseries.DatetimeIndex.time_to_date('tz_aware')
-      5.07±0.2ms       3.98±0.1ms     0.79  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0, 'linear')
-        16.4±1μs       12.8±0.9μs     0.79  timestamp.TimestampOps.time_replace_None('US/Eastern')
-           2.17s            1.70s     0.78  timeseries.ToDatetimeNONISO8601.time_same_offset
-        342±20ns          267±0ns     0.78  timestamp.TimestampProperties.time_microsecond(None, 'B')
-        31.1±0μs         24.3±0μs     0.78  timestamp.TimestampAcrossDst.time_replace_across_dst
-        2.50±0μs         1.95±0μs     0.78  timedelta.TimedeltaConstructor.time_from_missing
-          4.39μs           3.44μs     0.78  categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_decr')
-      7.81±0.5μs         6.10±0μs     0.78  timedelta.TimedeltaConstructor.time_from_unit
-        938±60ns         732±60ns     0.78  timestamp.TimestampOps.time_to_pydatetime(None)
-         343±0ns          267±0ns     0.78  timestamp.TimestampProperties.time_microsecond(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-          17.6ms           13.7ms     0.78  categoricals.Isin.time_isin_categorical('int64')
-        70.3±5ms         54.7±0ms     0.78  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'higher')
-        6.87±0μs       5.34±0.4μs     0.78  timedelta.TimedeltaConstructor.time_from_int
-        80.7±3ms         62.5±0ms     0.77  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'linear')
-        484±10ms          375±4ms     0.77  groupby.Apply.time_copy_overhead_single_col
-         442±0ns          342±0ns     0.77  timestamp.TimestampProperties.time_is_quarter_end(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-        674±40μs          521±0μs     0.77  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-        443±30ns          342±0ns     0.77  timestamp.TimestampProperties.time_is_year_start(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
-      5.36±0.2ms         4.09±0ms     0.76  rolling.Quantile.time_quantile('Series', 10, 'int', 1, 'linear')
-        13.0±1ms       9.94±0.7ms     0.76  strings.Cat.time_cat(0, None, None, 0.15)
-        1.91±0ms         1.46±0ms     0.76  timeseries.InferFreq.time_infer_freq('B')
-      3.98±0.1ms       3.01±0.1ms     0.76  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 1, 'higher')
-      4.52±0.1ms         3.42±0ms     0.76  timeseries.AsOf.time_asof_single('DataFrame')
-        487±30μs         366±20μs     0.75  timeseries.DatetimeIndex.time_add_timedelta('dst')
-        31.2±1ms         23.4±0ms     0.75  algorithms.Hashing.time_frame
-        93.8±4ms         70.3±4ms     0.75  strings.Cat.time_cat(3, ',', '-', 0.0)
-        62.5±8ms         46.9±5ms     0.75  strings.Cat.time_cat(3, ',', None, 0.15)
-        41.7±3ms         31.2±2ms     0.75  strings.Methods.time_len
-        46.9±3ms         35.2±2ms     0.75  strings.Methods.time_startswith
-      5.58±0.5ms       4.15±0.1ms     0.74  rolling.Quantile.time_quantile('Series', 10, 'int', 1, 'higher')
-      7.35±0.5ms       5.47±0.2ms     0.74  timeseries.ToDatetimeISO8601.time_iso8601_format
-      4.11±0.2ms       3.05±0.1ms     0.74  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0, 'midpoint')
-        70.3±0ms         52.1±0ms     0.74  strings.Cat.time_cat(3, None, None, 0.0)
-      14.1±0.8ms         10.4±0ms     0.74  strings.Cat.time_cat(0, None, '-', 0.15)
-      2.47±0.1ms       1.82±0.1ms     0.74  stat_ops.SeriesOps.time_op('var', 'float', False)
-      5.21±0.3ms       3.83±0.1ms     0.74  rolling.Quantile.time_quantile('DataFrame', 1000, 'float', 0, 'lower')
-        75.5±3ms         54.7±0ms     0.72  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'lower')
-      4.12±0.2ms      2.98±0.09ms     0.72  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 1, 'linear')
-        204±30ns          147±0ns     0.72  timeseries.DatetimeAccessor.time_dt_accessor
-      3.47±0.4ms         2.49±0ms     0.72  timeseries.ToDatetimeCache.time_dup_string_dates_and_format(False)
-        21.9±2ms         15.6±0ms     0.71  algorithms.Hashing.time_series_string
-      6.49±0.4μs         4.58±0μs     0.71  timeseries.DatetimeIndex.time_get('tz_naive')
-        488±20ns          343±0ns     0.70  timestamp.TimestampProperties.time_is_year_end(None, None)
-          52.1ms           36.5ms     0.70  eval.Query.time_query_datetime_index
-      2.69±0.4μs       1.83±0.1μs     0.68  timestamp.TimestampConstruction.time_parse_iso8601_no_tz
-      7.29±0.3ms         4.97±0ms     0.68  timeseries.ToDatetimeISO8601.time_iso8601_format_no_sep
-        46.9±8ms         31.2±0ms     0.67  strings.Methods.time_lstrip
-      7.49±0.6ms       4.84±0.5ms     0.65  rolling.Quantile.time_quantile('Series', 10, 'float', 0, 'nearest')
-      2.67±0.2ms       1.71±0.2ms     0.64  timedelta.DatetimeAccessor.time_timedelta_nanoseconds
-      2.68±0.2ms       1.66±0.1ms     0.62  timedelta.DatetimeAccessor.time_timedelta_days
-       997±200μs          613±0μs     0.61  inference.NumericInferOps.time_multiply(<class 'numpy.int16'>)
-      2.60±0.1ms         1.59±0ms     0.61  timedelta.DatetimeAccessor.time_timedelta_seconds
-           1.05s          609±0ms     0.58  timeseries.Iteration.time_iter(<function date_range at 0x000001C81A7DB7B8>)
-        195±70μs          111±0μs     0.57  series_methods.SeriesConstructor.time_constructor(None)
-         977±0μs          518±0μs     0.53  inference.NumericInferOps.time_subtract(<class 'numpy.int16'>)
-        1.09±0ms          570±0μs     0.52  inference.NumericInferOps.time_multiply(<class 'numpy.uint16'>)
-         117±8ms         33.9±0ms     0.29  join_merge.Align.time_series_align_left_monotonic 

@h-vetinari
Copy link
Contributor Author

And here's the one with @cython.wraparound(False):
On some calls, this can completely avoid python interaction, it seems - there's a ~180'000 speed-up for multiindex_object.Integer.time_is_monotonic, for example. The outliers on the upper end seem more harmless to me - and less index-related.

Purely from what's happening in the code:

  • hashtable.unique goes through one more call layer to hashtable._unique_no_inverse (which is unchanged from before), with the layer just forwarding the arguments.
  • hashtable.get_labels goes through hashtable._unique_with_inverse and has a tiny overhead for evaluating the bint ignore_na in the loop (which should however be small compared to (val != val or (use_na_value and val == na_value2) in the exact same if-condition).

To me, these changes shouldn't amount to much difference at all - I'm guessing the spread should mostly be the result of randomness (i.e. what else is happening on the computer in that exact moment), although the mean is (luckily) skewed towards being faster.

What do you all think? Also repeating my previous point:

BTW, is it possible to do an ASV run for a PR on a server? Or could someone else rerun it locally? Not sure how representative my machine is.

      before           after         ratio
     [640162fa]       [27ceb4d6]
!        9.11±0ms           failed      n/a  algorithms.Duplicated.time_duplicated_int('first')
!      9.11±0.3ms           failed      n/a  algorithms.Duplicated.time_duplicated_int('last')
!        258±10ms           failed      n/a  io.json.ToJSON.time_floats_with_dt_index_lines('index')
+         105±9μs        434±100μs     4.13  series_methods.SeriesConstructor.time_constructor(None)
+        133±10μs         244±20μs     1.83  groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'direct')
+          36.5ms           62.5ms     1.71  eval.Query.time_query_datetime_index
+         142±8μs         229±30μs     1.61  groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'transformation')
+        17.1±0μs         27.5±3μs     1.61  offset.OffestDatetimeArithmetic.time_apply(<BusinessMonthEnd>)
+         484±0μs        767±100μs     1.59  inference.NumericInferOps.time_add(<class 'numpy.uint16'>)
+          46.9ms           70.3ms     1.50  eval.Query.time_query_with_boolean_selection
+      1.46±0.1ms       2.20±0.1ms     1.50  timedelta.DatetimeAccessor.time_timedelta_days
+        23.4±0ms       33.9±0.7ms     1.44  io.sql.WriteSQLDtypes.time_to_sql_dataframe_column('sqlite', 'float')
+     1.49±0.09ms         2.14±0ms     1.44  timedelta.DatetimeAccessor.time_timedelta_microseconds
+         243±0μs          344±0μs     1.42  groupby.GroupByMethods.time_dtype_as_field('float', 'cumcount', 'transformation')
+          4.26ms           5.95ms     1.40  categoricals.CategoricalSlicing.time_getitem_bool_array('monotonic_decr')
+        26.1±2μs         35.5±2μs     1.36  offset.OffestDatetimeArithmetic.time_add_10(<SemiMonthEnd: day_of_month=15>)
+         250±0ms          328±4ms     1.31  io.json.ToJSON.time_floats_with_dt_index_lines('columns')
+           486μs            592μs     1.22  categoricals.CategoricalSlicing.time_getitem_list('monotonic_decr')
+         500±8ms         594±20ms     1.19  groupby.GroupByMethods.time_dtype_as_field('int', 'mad', 'direct')
+      6.58±0.4ms         7.81±0ms     1.19  index_object.Ops.time_add('float')
+          4.39μs           5.13μs     1.17  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')
+          3.12ms           3.65ms     1.17  index_object.Indexing.time_get_loc_non_unique('Float')
+          13.7ms           15.6ms     1.14  categoricals.Isin.time_isin_categorical('int64')
+          12.2ms           13.9ms     1.14  index_object.Indexing.time_boolean_series('String')
+        15.6±0ms         17.9±0ms     1.14  index_object.SetOperations.time_operation('date_string', 'union')
+      9.23±0.4ms           10.4ms     1.13  index_object.Indexing.time_boolean_array('Float')
+        31.2±0ms         35.2±0ms     1.12  eval.Eval.time_chained_cmp('numexpr', 1)
+           125ms            141ms     1.12  index_object.Indexing.time_get_loc_sorted('String')
+          10.2ms         11.4±0ms     1.11  index_object.Indexing.time_boolean_series('Float')
+         828±8ms          906±0ms     1.09  gil.ParallelGroups.time_get_groups(2)
+         102±0μs          111±0μs     1.09  indexing.NumericSeriesIndexing.time_ix_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+          6.46μs           7.06μs     1.09  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
+        404±20ns         442±30ns     1.09  timestamp.TimestampProperties.time_freqstr(None, 'B')
+        91.6±1μs         99.4±0μs     1.09  indexing.NumericSeriesIndexing.time_iloc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+          4.30μs           4.66μs     1.08  categoricals.CategoricalSlicing.time_getitem_slice('non_monotonic')
+          28.3μs           30.5μs     1.08  ctors.SeriesDtypesConstructors.time_dtindex_from_index_with_series
+        47.5±0μs         50.9±1μs     1.07  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+          11.7ms           12.5ms     1.07  index_object.Indexing.time_boolean_array('String')
+           547ms            578ms     1.06  gil.ParallelReadCSV.time_read_csv('float')
+        10.7±0ms         11.3±0ms     1.06  rolling.Methods.time_rolling('Series', 10, 'int', 'kurt')
+          6.72μs           7.06μs     1.05  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_incr')
-         160±8μs          153±0μs     0.95  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-          4.82μs           4.58μs     0.95  frame_methods.XS.time_frame_xs(1)
-     1.19±0.02ms         1.13±0ms     0.95  frame_methods.Isnull.time_isnull
-         430±4ms          406±0ms     0.95  io.json.ReadJSONLines.time_read_json_lines('int')
-        8.27±0ms       7.81±0.4ms     0.94  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
-        3.17±0ms      2.97±0.02ms     0.94  rolling.Quantile.time_quantile('DataFrame', 10, 'int', 1, 'lower')
-      4.63±0.1ms       4.32±0.2ms     0.93  reindex.DropDuplicates.time_frame_drop_dups_na(True)
-      22.9±0.5μs         21.4±0μs     0.93  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-           1.12s            1.05s     0.93  stat_ops.FrameMultiIndexOps.time_op([0, 1], 'kurt')
-         219±0ms          203±0ms     0.93  frame_methods.Iteration.time_iterrows
-           219ms            203ms     0.93  gil.ParallelKth.time_kth_smallest
-          7.86μs           7.28μs     0.93  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_decr')
-         528±0μs          488±0μs     0.92  frame_methods.Iteration.time_iteritems_cached
-         432±0μs          398±0μs     0.92  frame_ctor.FromRecords.time_frame_from_records_generator(1000)
-        4.51±0ms         4.15±0ms     0.92  rolling.Methods.time_rolling('DataFrame', 10, 'int', 'skew')
-        4.97±0ms         4.56±0ms     0.92  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')
-         172±0ms          156±0ms     0.91  indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc')
-          3.81μs           3.43μs     0.90  categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_decr')
-           156ms            141ms     0.90  index_object.Indexing.time_get_loc_non_unique('String')
-        13.7±0ms         12.2±0ms     0.89  replace.FillNa.time_replace(False)
-         211±4ms          188±0ms     0.89  frame_methods.Dropna.time_dropna_axis_mixed_dtypes('any', 0)
-        46.9±0ms         41.7±0ms     0.89  io.hdf.HDFStoreDataFrame.time_write_store_table
-          35.5μs           31.5μs     0.89  index_object.Indexing.time_get_loc_sorted('Float')
-         346±0μs          306±5μs     0.88  categoricals.Constructor.time_from_codes_all_int8
-          4.98μs           4.39μs     0.88  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
-           117μs            103μs     0.88  frame_methods.XS.time_frame_xs(0)
-         108±0μs         95.4±0μs     0.88  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-          14.6μs           12.9μs     0.88  ctors.SeriesDtypesConstructors.time_dtindex_from_series
-      32.6±0.7μs         28.6±0μs     0.88  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-          2.91ms           2.55ms     0.88  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
-         122±0μs          105±0μs     0.86  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      7.25±0.8μs       6.22±0.3μs     0.86  timedelta.TimedeltaConstructor.time_from_unit
-          37.9μs           32.3μs     0.85  index_object.Indexing.time_get_loc('Float')
-         903±0μs         766±10μs     0.85  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      7.72±0.4μs       6.48±0.4μs     0.84  timedelta.TimedeltaConstructor.time_from_string
-         135±5μs          113±4μs     0.84  index_object.SetOperations.time_operation('datetime', 'union')
-          5.54μs           4.58μs     0.83  index_object.Indexing.time_slice_step('String')
-        17.7±0μs         14.6±0μs     0.82  timedelta.TimedeltaConstructor.time_from_components
-          15.3μs           12.5μs     0.82  index_object.Indexing.time_slice('Float')
-         179±4μs          147±3μs     0.82  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      6.25±0.4ms       5.08±0.4ms     0.81  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', None)
-        39.1±3ms         31.2±0ms     0.80  strings.Methods.time_strip
-        27.3±3ms         21.9±0ms     0.80  timedelta.ToTimedelta.time_convert_string_days
-        62.5±4ms         49.5±3ms     0.79  strings.Cat.time_cat(3, None, None, 0.15)
-        93.8±0ms         74.2±4ms     0.79  series_methods.IsInFloat64.time_isin_nan_values
-        182±10μs         144±10μs     0.79  timedelta.ToTimedelta.time_convert_int
-      13.9±0.9ms         10.9±0ms     0.79  algorithms.Duplicated.time_duplicated_float('last')
-           1.16s            906ms     0.78  timedelta.TimedeltaOps.time_add_td_ts
-        3.51±0ms         2.73±0ms     0.78  timeseries.DatetimeIndex.time_add_timedelta('tz_naive')
-      5.73±0.4μs       4.41±0.2μs     0.77  timedelta.TimedeltaConstructor.time_from_np_timedelta
-        15.8±0μs       12.1±0.9μs     0.77  timedelta.TimedeltaConstructor.time_from_iso_format
-        7.21±0μs       5.50±0.4μs     0.76  timedelta.TimedeltaConstructor.time_from_int
-      9.38±0.6μs         7.06±0μs     0.75  timedelta.TimedeltaConstructor.time_from_datetime_timedelta
-       93.8±20ms         70.3±0ms     0.75  strings.Cat.time_cat(3, None, '-', 0.15)
-        41.7±3ms         31.2±0ms     0.75  strings.Methods.time_len
-        62.5±1ms         46.9±3ms     0.75  strings.Methods.time_pad
-      12.2±0.7ms       9.11±0.5ms     0.75  timeseries.DatetimeAccessor.time_dt_accessor_normalize
-        691±70μs          518±0μs     0.75  inference.NumericInferOps.time_subtract(<class 'numpy.uint16'>)
-      3.91±0.3ms       2.90±0.2ms     0.74  timeseries.DatetimeIndex.time_add_timedelta('tz_aware')
-        70.3±2ms         52.1±1ms     0.74  strings.Cat.time_cat(3, ',', None, 0.0)
-        70.3±0ms       52.1±0.7ms     0.74  strings.Cat.time_cat(3, ',', None, 0.001)
-         422±0ms          312±0ms     0.74  strings.Split.time_split(True)
-         234±8ms          172±8ms     0.73  strings.Slice.time_vector_slice
-         109±4ms         78.1±0ms     0.71  strings.Cat.time_cat(3, ',', '-', 0.15)
-      2.77±0.1μs         1.94±0μs     0.70  timedelta.TimedeltaConstructor.time_from_missing
-        78.1±1ms         54.7±0ms     0.70  strings.Cat.time_cat(3, None, '-', 0.0)
-        78.1±3ms         54.7±0ms     0.70  strings.Methods.time_get
-        39.1±5ms         27.3±0ms     0.70  strings.Methods.time_slice
-           156ms            109ms     0.70  strings.Repeat.time_repeat('int')
-         102±6ms         70.3±4ms     0.69  timedelta.ToTimedeltaErrors.time_convert('ignore')
-        14.0±2μs         9.71±0μs     0.69  timeseries.AsOf.time_asof_single_early('Series')
-        74.2±7ms         50.8±4ms     0.68  strings.Cat.time_cat(3, None, '-', 0.001)
-      5.76±0.7ms         3.91±0ms     0.68  timeseries.AsOf.time_asof_nan_single('Series')
-        54.7±8ms         36.5±3ms     0.67  strings.Methods.time_replace
-        188±20ms          125±2ms     0.67  strings.Methods.time_findall
-         188±8ms          125±8ms     0.67  strings.Split.time_split(False)
-        41.7±3ms         27.3±0ms     0.66  strings.Methods.time_upper
-       886±100μs          570±0μs     0.64  inference.NumericInferOps.time_multiply(<class 'numpy.uint16'>)
-         109±8ms         70.3±0ms     0.64  strings.Methods.time_count
-        221±30ns          141±0ns     0.64  timeseries.DatetimeAccessor.time_dt_accessor
-        46.9±5ms         29.3±2ms     0.62  strings.Methods.time_rstrip
-        21.9±2ms         13.7±0ms     0.62  timeseries.AsOf.time_asof('DataFrame')
-        21.9±0ms         13.7±0ms     0.62  timeseries.AsOf.time_asof_nan('DataFrame')
-        403±60μs         248±20μs     0.62  strings.Encode.time_encode_decode
-        17.9±2ms       10.9±0.8ms     0.61  timedelta.ToTimedelta.time_convert_string_seconds
-           31.2s            19.1s     0.61  strings.Dummies.time_get_dummies
-        31.2±3ms         18.8±1ms     0.60  stat_ops.FrameOps.time_op('mad', 'int', 0, True)
-        93.8±8ms         54.7±0ms     0.58  strings.Contains.time_contains(True)
-        9.94±0ms         5.76±0ms     0.58  timeseries.AsOf.time_asof('Series')
-      9.94±0.5ms       5.76±0.3ms     0.58  timeseries.AsOf.time_asof_nan('Series')
-        516±20ms          297±0ms     0.58  strings.Methods.time_extract
-        82.0±6ms         46.9±1ms     0.57  strings.Cat.time_cat(3, ',', None, 0.15)
-        41.7±3ms         23.4±1ms     0.56  strings.Contains.time_contains(False)
-        6.08±0ms       3.42±0.2ms     0.56  timeseries.AsOf.time_asof_nan_single('DataFrame')
-         453±4ms          250±8ms     0.55  multiindex_object.Values.time_datetime_level_values_copy
-      6.51±0.3ms       3.50±0.2ms     0.54  timeseries.AsOf.time_asof_single('DataFrame')
-        117±20ms         62.5±4ms     0.53  strings.Methods.time_match
-         125±8ms         66.4±4ms     0.53  strings.Cat.time_cat(3, ',', '-', 0.001)
-        70.3±8ms         36.5±3ms     0.52  strings.Methods.time_endswith
-         255±0μs          114±7μs     0.45  timeseries.AsOf.time_asof_single_early('DataFrame')
-        51.1±4μs         20.8±1μs     0.41  timeseries.AsOf.time_asof_single('Series')
-        93.8±0ms          418±0μs     0.00  frame_ctor.FromRecords.time_frame_from_records_generator(None)
-        93.8±6ms          525±0ns     0.00  multiindex_object.Integer.time_is_monotonic

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is just moving code around w/o combing anything (which IIUC was the original intention), so why should this PR be accepted?

@h-vetinari
Copy link
Contributor Author

@jreback:
The unifying aspect has been weakened for perf reasons (and because it didn't work to force cython to compile the different bint paths in 52ae84e ).

However, this still gives hashtable.unique a working return_inverse kwarg, and does so with minimal code additions by sharing the implementation of hashtable.get_labels. As such, it will allow fixing #21720 and proceed on all the return_inverse stuff.

@jreback
Copy link
Contributor

jreback commented Oct 6, 2018

However, this still gives hashtable.unique a working return_inverse kwarg, and does so with minimal code additions by sharing the implementation of hashtable.get_labels. As such, it will allow fixing #21720 and proceed on all the return_inverse stuff.

and how does this differ from factorize? which requires no code changes

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 6, 2018

and how does this differ from factorize? which requires no code changes

The difference between .factorize and .unique is the reason for #21720 - .factorize puts all flavours of NaN in one bucket, while unique keeps them separate. You can see the difference between them by looking for the ignore_na-kwarg that I introduced for _unique_with_inverse.

And for implementing a return_inverse kwarg on python level, it'd be nice (and much clearer) to have a cython-level .unique that supports that as well (rather than complicated wrappers around get_labels).

@jreback
Copy link
Contributor

jreback commented Oct 6, 2018

The difference between .factorize and .unique is the reason for #21720 - .factorize puts all flavours of NaN in one bucket, while unique keeps them separate. You can see the difference between them by looking for the ignore_na-kwarg that I introduced for _unique_with_inverse.

then ok with adding the keyword to the existing functions. I just see a giant moving around code with new names for little gain here.

let's try to minimize the changes.

@h-vetinari
Copy link
Contributor Author

then ok with adding the keyword to the existing functions. I just see a giant moving around code with new names for little gain here.

I don't want to implement this within get_labels. It's clunky to have to hand over your own instance of a HashVector (corresponding to the right dtype), not get the uniques returned, etc. This might be needed somewhere else, but is wholly useless for the purpose of .unique. I can "unmove" the existing unique implementations, just tried to group the functions together logically.

@@ -851,20 +885,40 @@ cdef class PyObjectHashTable(HashTable):
val = values[i]
hash(val)

if ((val != val or val is None) or
(use_na_value and val == na_value)):
if ignore_na and ((val != val or val is None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this keyword is completely untested

Copy link
Contributor Author

@h-vetinari h-vetinari Oct 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that it is (at least for ignore_na=True) because all the tests for factorize need to run through this.

Do we explicitly test cython code? I can if necessary. This PR does the required preparations on cython level to continue with return_inverse later on. It will be excessively tested there (because unique(return_inverse=True) calls _unique_with_inverse(ignore_na=False)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes of course, you added a keyword.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am talking about the ignore_na keyword

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that I added this keyword to switch between .unique/.factorize behaviour.

Can you point me to where the cython hashtable code is tested? I could add some tests there. That's why I asked "Do we explicitly test cython code?", because I haven't come across it yet.

Otherwise, I would add copious tests for the behaviour as soon as the keyword makes its way into the pandas API, but this is just the preparation for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h-vetinari : IMO tests/generic/test_generic.py or tests/generic/test_frame.py could be good places to put any kinds of tests for this refactoring for now.

If it is possible to hit those keyword arguments from Python land, then do add some tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's drop the ignore_na for now. and add it when I see that you actually need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
This aspect is the whole point of the PR - to not duplicate the code between .factorize (=hashtable._unique_with_inverse(ignore_na=True)) and the not-yet-implemented-because-this-PR-is-starting-the-process .unique(return_inverse=True) (=hashtable._unique_with_inverse(ignore_na=False))
hashtable._unique_with_inverse(ignore_na=False)

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review; undid the moving. Hopefully clearer now.

@@ -851,20 +885,40 @@ cdef class PyObjectHashTable(HashTable):
val = values[i]
hash(val)

if ((val != val or val is None) or
(use_na_value and val == na_value)):
if ignore_na and ((val != val or val is None)
Copy link
Contributor Author

@h-vetinari h-vetinari Oct 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that it is (at least for ignore_na=True) because all the tests for factorize need to run through this.

Do we explicitly test cython code? I can if necessary. This PR does the required preparations on cython level to continue with return_inverse later on. It will be excessively tested there (because unique(return_inverse=True) calls _unique_with_inverse(ignore_na=False)).

@gfyoung gfyoung added the Refactor Internal refactoring of code label Oct 6, 2018
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 12, 2018

I think we should stop arguing about that point, as I completely understood your point, but clearly didn't make my point clear :-)
(as I also didn't mean to update usages of Factorizer.factorize)

For testing complicated behaviour, a non-trivial test is justified IMO

But is it that complicated?
For a set of values like ['a', 'c', None, np.nan, 'a', 'b', np.NaT, 'c'] (for object dtype), I can write the expected result (and one could rather easily write such a set of values for other dtypes that gives the same expected labels, so you can still parametrize over the different dtypes).
I completely agree that good test coverage is important, but readable / easily maintainable tests are important as well, and I don't really see the added value here.

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche

For a set of values like ['a', 'c', None, np.nan, 'a', 'b', np.NaT, 'c'] (for object dtype), I can write the expected result

Absolutely (I added this test for you). It's just that then you're only testing the object dtype, plus there might be bugs that only occur for larger examples etc.

So I've made another try at simplifying but keeping the other test - I switched to testing the inverse implicitly through reconstruction - that's essentially just as thorough, but the test becomes much shorter/easier

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche @jreback How does this look to you now?

@jreback
Copy link
Contributor

jreback commented Oct 12, 2018

i will look soon


Returns
-------
uniques : ndarray[object]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be dtype here

s_duplicated.values.setflags(write=writable)
na_mask = s_duplicated.isna().values

result_inverse, result_unique = htable().factorize(s_duplicated.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would rather you not test unrelated things directly here; these are pure unit tests

it just makes the test confusing. add with the other duplicated tests if you must
though i think these are already well covered, but if not you can add

Copy link
Contributor Author

@h-vetinari h-vetinari Oct 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not unrelated - it's the central functionality of factorize, and a bit more thorough than the few hand-made tests further up.

The code for duplicated is completely separate (in cython) and so this should have a separate test as well IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know you really want to show that this works, but these tests need hard coded expectations. It is very hard for a reader to assert what you are doing, because you are computing the results here as well. I think its pretty easy to cook an example where you start with a duplicated array and just can assert the results directly. We do lots of these kinds of things already in the drop_duplicates testing.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the review

s_duplicated.values.setflags(write=writable)
na_mask = s_duplicated.isna().values

result_inverse, result_unique = htable().factorize(s_duplicated.values)
Copy link
Contributor Author

@h-vetinari h-vetinari Oct 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not unrelated - it's the central functionality of factorize, and a bit more thorough than the few hand-made tests further up.

The code for duplicated is completely separate (in cython) and so this should have a separate test as well IMO

@h-vetinari
Copy link
Contributor Author

Failures are one hypothesis timeout and one server timeout, so unrelated.

@jorisvandenbossche
Copy link
Member

Failures are one hypothesis timeout and one server timeout, so unrelated.

If you merge in master, that should be solved

@h-vetinari
Copy link
Contributor Author

This has been passing (aside from flaky CI) for the last 3 commits - anything left to do @jreback @jorisvandenbossche ?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am uncomfortable with these tests as is (the last one). I want a bit less computation to produce duplicates and more hard coding of the actual indexers. I don't mind these in addition.

s_duplicated.values.setflags(write=writable)
na_mask = s_duplicated.isna().values

result_inverse, result_unique = htable().factorize(s_duplicated.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know you really want to show that this works, but these tests need hard coded expectations. It is very hard for a reader to assert what you are doing, because you are computing the results here as well. I think its pretty easy to cook an example where you start with a duplicated array and just can assert the results directly. We do lots of these kinds of things already in the drop_duplicates testing.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2018

This has been passing (aside from flaky CI) for the last 3 commits - anything left to do @jreback @jorisvandenbossche ?

we have limited reviewers and 230 odd PR's. Certainly welcome for you to review other PR's to help out.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 17, 2018

i am uncomfortable with these tests as is (the last one). I want a bit less computation to produce duplicates and more hard coding of the actual indexers. I don't mind these in addition.

So, adding hard-coded tests, and leaving the ones already in the PR as-is? Do I understand that correctly?

This has been passing (aside from flaky CI) for the last 3 commits - anything left to do @jreback @jorisvandenbossche ?

we have limited reviewers and 230 odd PR's. Certainly welcome for you to review other PR's to help out.

Didn't mean to be pushy - was really just asking what's left (after several rounds of reviews).

@jreback
Copy link
Contributor

jreback commented Oct 18, 2018

So, adding hard-coded tests, and leaving the ones already in the PR as-is? Do I understand that correctly?

yes that's fine.

@h-vetinari
Copy link
Contributor Author

yes that's fine.

Cool. I already did that in the latest commit.

@jreback jreback merged commit 99e6401 into pandas-dev:master Oct 18, 2018
@jreback
Copy link
Contributor

jreback commented Oct 18, 2018

thanks @h-vetinari

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants