Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

Open
h-vetinari opened this issue Jun 7, 2018 · 11 comments
Open

ENH: adding .unique() to DF (or return_inverse for duplicated) #21357

h-vetinari opened this issue Jun 7, 2018 · 11 comments

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Jun 7, 2018

When deduplicating, I need to get the mapping from the deduplicated records back to the originals.

Unfortunately:

Let's say I have a large DF (df) with several object columns (say, sig) that are the deduplication signature (subset-parameter of duplicated).

Then I have (here only for a subset of 100k records):

sig = ['list', 'of', 'column', 'names']
N = 100000
tic = timeit.default_timer()
### no reverse mapping
df[sig].sample(N).duplicated(keep=False)
toc = timeit.default_timer()
print(f'duplicated: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### no reverse mapping
pd.unique([x for x in df[sig].sample(N).itertuples(name='', index=False)])
toc = timeit.default_timer()
print(f'pd.unique: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### this creates the reverse mapping
dups = df.sample(N).groupby(sig).groups.values() # indices per group
redup = pd.concat([pd.Series(x[0], index = x) for x in dups], axis=0) # assume sorted index; otherwise "x.min()"
toc = timeit.default_timer()
print(f'with groupby: {toc-tic:.2f} sec.')

resulting in:

duplicated: 2.45 sec.
pd.unique: 1.64 sec.
with groupby: 16.89 sec.

The performance gets worse with size, so for a few millions, I need to wait ~20min to get the inverse, whereas a pure duplicated call only takes around 10 sec.

This caused me to hunt down the inner workings of duplicated, to see where the inverse gets lost. Turns out until almost the very end of duplicated, the indices are there:

ids = get_group_index(labels, shape, sort=False, xnull=False)

I extracted and applied the guts of duplicated and once having ids for the whole set (~10sec), I have the following:

[calculate ids]
tic = timeit.default_timer()
### no reverse mapping
duplicated_int64(ids, keep=False)
toc = timeit.default_timer()
print(f'pd.duplicated_int64: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### no reverse mapping
np.unique(ids)
toc = timeit.default_timer()
print(f'np.unique, no inverse: {toc-tic:.2f} sec.')

tic = timeit.default_timer()
### this creates the reverse mapping
np.unique(ids, return_inverse=True)
toc = timeit.default_timer()
print(f'np.unique, with inverse: {toc-tic:.2f} sec.')

yielding

pd.duplicate_int64: 1.60 sec.
np.unique, no inverse: 0.31 sec.
np.unique, with inverse: 0.64 sec.

So, by exposing a return_inverse argument -- either in .duplicated, .drop_duplicates, or in a separate .unique function for DF -- this would accelerate my problem from 20min to around 11sec.

I would imagine the following signature:

df.unique(subset=None, keep={'first'|'last'}, return_inverse=False)

If return_inverse were added to drop_duplicates, it would have to raise an error for the combination keep=False, return_inverse=True, because no inverse can be returned in this case.

I wouldn't know how to adapt the cython code for duplicate_int64, but since it's 5x slower than np.unique anyway, there's no point, IMHO. I would know how to call the numpy function at the appropriate point...

Edit: just saw that np.unique sorts, which makes the 'first'|'last' thing a fair bit harder...

@WillAyd
Copy link
Member

WillAyd commented Jun 7, 2018

Can you clarify what you are looking for with a small example?

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jun 7, 2018
@h-vetinari
Copy link
Contributor Author

@WillAyd

Here's a small example - I'm looking for the quantity reverse:

df = pd.DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']}, index=[1, 4, 9, 16, 25])
df
#     A  B
# 1   0  a
# 4   1  b
# 9   1  b
# 16  2  c
# 25  0  a
### deduplicate
dfdd = df.drop_duplicates()
dfdd
#     A  B
# 1   0  a
# 4   1  b
# 16  2  c
reverse = pd.Series([1, 4, 4, 16, 1], index=[1, 4, 9, 16, 25])
reverse
# 1      1
# 4      4
# 9      4
# 16    16
# 25     1
# dtype: int64
### with .loc/reindex, the values match, but not the indices
dfdd.reindex(reverse)
#     A  B
# 1   0  a
# 4   1  b
# 4   1  b
# 16  2  c
# 1   0  a
### full reconstruction -> the goal
dfdd.loc[reverse].set_index(reverse.index).equals(df)
# True

### turn column B into numbers to avoid numpy issue #641
df.B = [10, 11, 11, 12, 10]
dfdd_np, rev_np = np.unique(df.values, axis=0, return_inverse=True)
dfdd_np
# array([[ 0, 10],
#        [ 1, 11],
#        [ 2, 12]], dtype=int64)
rev_np
# array([0, 1, 1, 2, 0], dtype=int64)
### these outputs are conceptually correct, but need to be mapped to the original indices...

PS. As an unrelated remark, dfdd.reindex(labels=reverse, index=reverse.index) should produce the full reconstruction right away, or at least throw an error for having contradictory labels/index information (also, surprisingly, although it's the second argument, index has precedence)

@WillAyd
Copy link
Member

WillAyd commented Jun 7, 2018

Still a pretty large example so I can't said I've read through every detail of it, but why don't you just reset the index on dfdd and merge it back onto df if it's reverse that you are trying to reconstruct?

@h-vetinari
Copy link
Contributor Author

Well, I tried to be thorough with the example (twice)...

Merging couple million records simultaneously on 6 object columns in the presence of NaNs might be possible to get towards the result, but it's a huge waste of time because the indices are effectively just lying there in the penultimate step of duplicated - surely the merge would take order of magnitude more time than the np.unique(inds, return_inverse=True)-call I demonstrated in the OP.

@h-vetinari
Copy link
Contributor Author

Also, it's established functionality in numpy. It's often useful to be able to go back and forth easily between duplicated-deduplicated.

@WillAyd
Copy link
Member

WillAyd commented Jun 7, 2018

OK - PRs are always welcome if you have an idea on how to do it. Probably best served in drop_duplicates rather than creating a new unique method for frames

@h-vetinari
Copy link
Contributor Author

@WillAyd
The problem with adding it to drop_duplicates is that I'd have to change the output signature depending on whether or not the Boolean is True/False -- this is what numpy does as well, but I don't know if there's any precedent for that in pandas. Also, this would make it difficult w.r.t inplace=True.

PS. Can you replace the "Needs Info" tag with something more reflective of the situation? :)

@WillAyd WillAyd added Ideas Long-Term Enhancement Discussions Enhancement and removed Needs Info Clarification about behavior needed to assess issue Ideas Long-Term Enhancement Discussions labels Jun 8, 2018
@h-vetinari
Copy link
Contributor Author

@WillAyd @jreback @TomAugspurger

I'm interested in tackling this. Is there any precedent in pandas for changing the output signature of a function depending on whether a boolean is True/False (say, from Series to a tuple of Series)?

This is how numpy.unique does it, but it feels "wrong" to me. Another argument IMO for not putting this into duplicated is that an inverse would not make sense for keep=False, and it would be surprising from a user-perspective that some kwarg-combinations are mutually incompatible. Finally, since the goal is to return a reverse mapping, the obvious counterpart would be a forward mapping as well (instead of a Series of booleans).

With these considerations, I'd suggest a new function along the lines of

duplicate_map(self, subset=None, keep={'first'|'last'})

[...]

Returns
--------
tuple: (Index [or Series?] mapping from original indices to indices of unique values,
        Series mapping from unique indices to original index)

With such a function we could then do (abstractly):

orig2unique, unique2orig = df.duplicate_map(subset=mysubset)
unique = df.reindex(orig2unique)
[processing; e.g. aggregation of values outside mysubset across duplicates]
reduplicated = unique.reindex(unique2orig.values)  # has duplicates in index
df_processed = reduplicated.set_index(unique2orig.index)  # restored index of original

In terms of the example above (without the processing step):

df = pd.DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']},
                  index=[1, 4, 9, 16, 25])
#     A  B
# 1   0  a
# 4   1  b
# 9   1  b
# 16  2  c
# 25  0  a

orig2unique, unique2orig = df.duplicate_map()
orig2unique
# Int64Index([1, 4, 16], dtype='int64')

unique2orig
# 1      1
# 4      4
# 9      4
# 16    16
# 25     1
# dtype: int64

unique = df.reindex(orig2unique)
unique 
#     A  B
# 1   0  a
# 4   1  b
# 16  2  c

reduplicated = unique.reindex(unique2orig.values) # has duplicates in index
reduplicated
#     A  B
# 1   0  a
# 4   1  b
# 4   1  b
# 16  2  c
# 1   0  a

df_restored = reduplicated.set_index(unique2orig.index)  # restored index of original
df_restored
#     A  B
# 1   0  a
# 4   1  b
# 9   1  b
# 16  2  c
# 25  0  a

@jreback
Copy link
Contributor

jreback commented Jun 21, 2018

-1 on adding any new methods, but would be ok with adding a return_inverse kwarg (clearly defaulting to False), to .duplicated() to easily recovering the indexer.

@h-vetinari
Copy link
Contributor Author

@jreback Do I understand you correctly that you would output a tuple if return_inverse=True?

In other words:

isdup = df.duplicated()
isdup, inverse = df.duplicated(return_inverse=True)

?

@jreback
Copy link
Contributor

jreback commented Jun 21, 2018

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants