ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

h-vetinari · 2018-06-07T17:17:39Z

When deduplicating, I need to get the mapping from the deduplicated records back to the originals.

Unfortunately:

drop_duplicates and duplicated provide no information about this mapping
np.unique has a return_inverse-parameter, but does not work with object data due to python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) numpy/numpy#641

Let's say I have a large DF (df) with several object columns (say, sig) that are the deduplication signature (subset-parameter of duplicated).

Then I have (here only for a subset of 100k records):

sig = ['list', 'of', 'column', 'names']
N = 100000
tic = timeit.default_timer()
### no reverse mapping
df[sig].sample(N).duplicated(keep=False)
toc = timeit.default_timer()
print(f'duplicated: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### no reverse mapping
pd.unique([x for x in df[sig].sample(N).itertuples(name='', index=False)])
toc = timeit.default_timer()
print(f'pd.unique: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### this creates the reverse mapping
dups = df.sample(N).groupby(sig).groups.values() # indices per group
redup = pd.concat([pd.Series(x[0], index = x) for x in dups], axis=0) # assume sorted index; otherwise "x.min()"
toc = timeit.default_timer()
print(f'with groupby: {toc-tic:.2f} sec.')

resulting in:

duplicated: 2.45 sec.
pd.unique: 1.64 sec.
with groupby: 16.89 sec.

The performance gets worse with size, so for a few millions, I need to wait ~20min to get the inverse, whereas a pure duplicated call only takes around 10 sec.

This caused me to hunt down the inner workings of duplicated, to see where the inverse gets lost. Turns out until almost the very end of duplicated, the indices are there:

pandas/pandas/core/frame.py

Line 4384 in 3147a86

ids = get_group_index(labels, shape, sort=False, xnull=False)

I extracted and applied the guts of duplicated and once having ids for the whole set (~10sec), I have the following:

[calculate ids]
tic = timeit.default_timer()
### no reverse mapping
duplicated_int64(ids, keep=False)
toc = timeit.default_timer()
print(f'pd.duplicated_int64: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### no reverse mapping
np.unique(ids)
toc = timeit.default_timer()
print(f'np.unique, no inverse: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### this creates the reverse mapping
np.unique(ids, return_inverse=True)
toc = timeit.default_timer()
print(f'np.unique, with inverse: {toc-tic:.2f} sec.')

yielding

pd.duplicate_int64: 1.60 sec.
np.unique, no inverse: 0.31 sec.
np.unique, with inverse: 0.64 sec.

So, by exposing a return_inverse argument -- either in .duplicated, .drop_duplicates, or in a separate .unique function for DF -- this would accelerate my problem from 20min to around 11sec.

I would imagine the following signature:

df.unique(subset=None, keep={'first'|'last'}, return_inverse=False)

If return_inverse were added to drop_duplicates, it would have to raise an error for the combination keep=False, return_inverse=True, because no inverse can be returned in this case.

I wouldn't know how to adapt the cython code for duplicate_int64, but since it's 5x slower than np.unique anyway, there's no point, IMHO. I would know how to call the numpy function at the appropriate point...

Edit: just saw that np.unique sorts, which makes the 'first'|'last' thing a fair bit harder...

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-06-07T21:18:55Z

Can you clarify what you are looking for with a small example?

h-vetinari · 2018-06-07T22:42:09Z

@WillAyd

Here's a small example - I'm looking for the quantity reverse:

df = pd.DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']}, index=[1, 4, 9, 16, 25])
df
#     A  B
# 1   0  a
# 4   1  b
# 9   1  b
# 16  2  c
# 25  0  a
### deduplicate
dfdd = df.drop_duplicates()
dfdd
#     A  B
# 1   0  a
# 4   1  b
# 16  2  c
reverse = pd.Series([1, 4, 4, 16, 1], index=[1, 4, 9, 16, 25])
reverse
# 1      1
# 4      4
# 9      4
# 16    16
# 25     1
# dtype: int64
### with .loc/reindex, the values match, but not the indices
dfdd.reindex(reverse)
#     A  B
# 1   0  a
# 4   1  b
# 4   1  b
# 16  2  c
# 1   0  a
### full reconstruction -> the goal
dfdd.loc[reverse].set_index(reverse.index).equals(df)
# True

### turn column B into numbers to avoid numpy issue #641
df.B = [10, 11, 11, 12, 10]
dfdd_np, rev_np = np.unique(df.values, axis=0, return_inverse=True)
dfdd_np
# array([[ 0, 10],
#        [ 1, 11],
#        [ 2, 12]], dtype=int64)
rev_np
# array([0, 1, 1, 2, 0], dtype=int64)
### these outputs are conceptually correct, but need to be mapped to the original indices...

PS. As an unrelated remark, dfdd.reindex(labels=reverse, index=reverse.index) should produce the full reconstruction right away, or at least throw an error for having contradictory labels/index information (also, surprisingly, although it's the second argument, index has precedence)

WillAyd · 2018-06-07T22:49:21Z

Still a pretty large example so I can't said I've read through every detail of it, but why don't you just reset the index on dfdd and merge it back onto df if it's reverse that you are trying to reconstruct?

h-vetinari · 2018-06-07T23:06:29Z

Well, I tried to be thorough with the example (twice)...

Merging couple million records simultaneously on 6 object columns in the presence of NaNs might be possible to get towards the result, but it's a huge waste of time because the indices are effectively just lying there in the penultimate step of duplicated - surely the merge would take order of magnitude more time than the np.unique(inds, return_inverse=True)-call I demonstrated in the OP.

h-vetinari · 2018-06-07T23:07:53Z

Also, it's established functionality in numpy. It's often useful to be able to go back and forth easily between duplicated-deduplicated.

WillAyd · 2018-06-07T23:22:43Z

OK - PRs are always welcome if you have an idea on how to do it. Probably best served in drop_duplicates rather than creating a new unique method for frames

h-vetinari · 2018-06-08T05:46:55Z

@WillAyd
The problem with adding it to drop_duplicates is that I'd have to change the output signature depending on whether or not the Boolean is True/False -- this is what numpy does as well, but I don't know if there's any precedent for that in pandas. Also, this would make it difficult w.r.t inplace=True.

PS. Can you replace the "Needs Info" tag with something more reflective of the situation? :)

h-vetinari · 2018-06-21T06:38:28Z

@WillAyd @jreback @TomAugspurger

I'm interested in tackling this. Is there any precedent in pandas for changing the output signature of a function depending on whether a boolean is True/False (say, from Series to a tuple of Series)?

This is how numpy.unique does it, but it feels "wrong" to me. Another argument IMO for not putting this into duplicated is that an inverse would not make sense for keep=False, and it would be surprising from a user-perspective that some kwarg-combinations are mutually incompatible. Finally, since the goal is to return a reverse mapping, the obvious counterpart would be a forward mapping as well (instead of a Series of booleans).

With these considerations, I'd suggest a new function along the lines of

duplicate_map(self, subset=None, keep={'first'|'last'})

[...]

Returns
--------
tuple: (Index [or Series?] mapping from original indices to indices of unique values,
        Series mapping from unique indices to original index)

With such a function we could then do (abstractly):

orig2unique, unique2orig = df.duplicate_map(subset=mysubset)
unique = df.reindex(orig2unique)
[processing; e.g. aggregation of values outside mysubset across duplicates]
reduplicated = unique.reindex(unique2orig.values)  # has duplicates in index
df_processed = reduplicated.set_index(unique2orig.index)  # restored index of original

In terms of the example above (without the processing step):

df = pd.DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']},
                  index=[1, 4, 9, 16, 25])
#     A  B
# 1   0  a
# 4   1  b
# 9   1  b
# 16  2  c
# 25  0  a

orig2unique, unique2orig = df.duplicate_map()
orig2unique
# Int64Index([1, 4, 16], dtype='int64')

unique2orig
# 1      1
# 4      4
# 9      4
# 16    16
# 25     1
# dtype: int64

unique = df.reindex(orig2unique)
unique 
#     A  B
# 1   0  a
# 4   1  b
# 16  2  c

reduplicated = unique.reindex(unique2orig.values) # has duplicates in index
reduplicated
#     A  B
# 1   0  a
# 4   1  b
# 4   1  b
# 16  2  c
# 1   0  a

df_restored = reduplicated.set_index(unique2orig.index)  # restored index of original
df_restored
#     A  B
# 1   0  a
# 4   1  b
# 9   1  b
# 16  2  c
# 25  0  a

jreback · 2018-06-21T09:46:05Z

-1 on adding any new methods, but would be ok with adding a return_inverse kwarg (clearly defaulting to False), to .duplicated() to easily recovering the indexer.

h-vetinari · 2018-06-21T10:23:13Z

@jreback Do I understand you correctly that you would output a tuple if return_inverse=True?

In other words:

isdup = df.duplicated()
isdup, inverse = df.duplicated(return_inverse=True)

?

jreback · 2018-06-21T10:38:19Z

yes

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jun 7, 2018

WillAyd added Ideas Long-Term Enhancement Discussions Enhancement and removed Needs Info Clarification about behavior needed to assess issue Ideas Long-Term Enhancement Discussions labels Jun 8, 2018

h-vetinari mentioned this issue Jun 26, 2018

ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645

Closed

h-vetinari mentioned this issue Sep 24, 2018

API/ENH: overhaul/unify/improve .unique #22824

Open

6 tasks

h-vetinari mentioned this issue Oct 3, 2018

CLN: prepare unifying hashtable.factorize and .unique; add doc-strings #22986

Merged

h-vetinari mentioned this issue Oct 28, 2018

ENH: Add return_inverse to cython-unique; unify unique/factorize-code #23400

Merged

h-vetinari mentioned this issue Nov 18, 2018

Towards "pandas 1.0" #10000

Closed

This was referenced Dec 5, 2018

API/ENH/DEPR: Series.unique returns Series #24108

Closed

API: add return_inverse to pd.unique #24119

Closed

h-vetinari mentioned this issue Jan 28, 2019

RLS: 0.25.0 #24950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

h-vetinari commented Jun 7, 2018 •

edited

Loading

WillAyd commented Jun 7, 2018

h-vetinari commented Jun 7, 2018

WillAyd commented Jun 7, 2018

h-vetinari commented Jun 7, 2018

h-vetinari commented Jun 7, 2018

WillAyd commented Jun 7, 2018

h-vetinari commented Jun 8, 2018

h-vetinari commented Jun 21, 2018

jreback commented Jun 21, 2018

h-vetinari commented Jun 21, 2018

jreback commented Jun 21, 2018

ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

Comments

h-vetinari commented Jun 7, 2018 • edited Loading

WillAyd commented Jun 7, 2018

h-vetinari commented Jun 7, 2018

WillAyd commented Jun 7, 2018

h-vetinari commented Jun 7, 2018

h-vetinari commented Jun 7, 2018

WillAyd commented Jun 7, 2018

h-vetinari commented Jun 8, 2018

h-vetinari commented Jun 21, 2018

jreback commented Jun 21, 2018

h-vetinari commented Jun 21, 2018

jreback commented Jun 21, 2018

h-vetinari commented Jun 7, 2018 •

edited

Loading