API/ ENH: Unambiguous indexing should be allowed, even if duplicates are present #38797

ivirshup · 2020-12-30T05:27:56Z

Is your feature request related to a problem?

While working on #38745, I was expecting this to be allowed:

import pandas as pd

result = pd.concat(
    [
        pd.Series(0, index=[0, 0, 1, 2]),
        pd.Series(1, index=[1, 2]),
    ],
    join="inner",
)
expected = pd.DataFrame({0: [0, 0], 1: [1, 1]}, index=[1, 2])

pd.testing.assert_frame_equal(result, expected)

Because the intersection of those indices is well defined. However, it turns out this does not work – pd.concat errors – and has not worked recently (v1.1.5). This is related to allowing .get_indexer when result is uniquely specified (#38773), but I'll go into that more below.

Describe the solution you'd like

I would like this to work.

I think this would also require these to work (they currently error):

pd.Series(np.arange(4), index=[0, 0, 1, 2]).reindex([1, 2])

pd.Index([0, 0, 1, 2]).get_index([1, 2])

This could coincide with better error messages when the results are ambiguous, e.g. DuplicateLabelErrors where the duplicates are explicitly named.

API breaking implications

I think this is a subtle change to the API, which should not break much. It mostly will turn some errors into working cases, but those cases should have unambiguous results.

I'd divide the behaviour changes into indexing and API functions. These indexing functions would change:

Index.get_indexer
obj.reindex
obj.reindex_like

These API functions would change, though any function which relied on returning results from the intersection of indices could be affected.

pd.concat(..., join="inner")
pd.crosstab

Describe alternatives you've considered

There could be no change, and continue to error. Maybe it's better to just encourage unique indices in all cases. But sometimes duplicates do happen.
There could be some changes (e.g. only change this for pd.concat, make it handle the indices)

It could continue to be an error. It was an error in 1.1.5, and is currently.

Additional context

The exact behavior of combining indices with set operations seems to have a long history here, so I cannot provide a full context. I will go into detail about a bit of history I've come across while looking into this issue.

`join="inner"` (Specific to `concat`)

It is a bit strange that the concat(..., join="inner") gets the intersection of the indices, when an inner join would be a bit different. It's documented that concat uses set operations on the "other axis", so I don't think this is a huge issue. the value "inner" does suggest different behaviour, where duplicates would always be allowed.

I think this has been a source of confusion on concats handling of duplicates (#36263). With set operations, there's no way to handle duplicates (unless we take some other definition of set (#6963 (comment))). With joins, you'll be taking the cross product (which is often undesirable #36290).

Performance

There should only be a performance penalty when:

There are duplicated indices, so we need to check if they would be in the result
- This may happen multiple times. In concat it would happen once when intersecting indices, and again with get_index. Is this so bad? Alternatively, there could be an "unsafe get_indexer/ reindex where the user promises the passed index has a unique solution, and no check occurs.
Checks if indexes are unique, which are cached. Wouldn't be necessary to check when allows_duplicate_labels=False.

`.reindex`, `.get_index`

Ultimately, I think this is an issue about what operations are unambiguous when there are duplicate values on the indices. Because of this, there may need to be other changes in behaviour this request consistent with the rest of library and reasonable to implement. For instance:

s1 = pd.Series(0, index=[0, 0, 1, 2])
s2 = pd.Series(1, index=[1, 2])

shared_indices = s1.index.intersection(s2.index)

s1.reindex(shared_indices)  # This currently errors

This currently errors, since the behaviour follows this logic (stated in #28257):

... .reindex should allow targets to be duplicated but not the sources. However, currently in allows the source to be duplicated but not the targets, which is exactly the wrong behaviour.

This is a bit similar to this statement (#8849):

I propose that when a series has a duplicate index, the method reindex should always raise an exception, because when a series with a duplicate index is to be conformed to a new index, the intended behaviour is always ambiguous.

I think neither is quite right. I think that reindex, get_index, etc. should allow any operation where the result is uniquely defined. If the source has duplicates, and the target includes those duplicates, the result is ambiguous. If the source has duplicates, but the target does not contain them, the result is uniquely specified.

# ambiguous, should not be allowed
pd.Index([0, 0, 1, 2]).get_index([0, 1])

# unambiguous, should be allowed
pd.Index([0, 0, 1, 2]).get_index([1, 2])

Additional example with `pd.crosstab`

Current behaviour of pd.crosstab

pd.crosstab(
    pd.Series(np.arange(4), index=[0, 1, 1, 2]),
    pd.Series(np.arange(2), index=[0, 2]),
)
# ValueError: cannot reindex from a duplicate axis

pd.crosstab(
    pd.Series(np.arange(4), index=[0, 1, 1, 2]),
    pd.Series(np.arange(2), index=[3, 4]),
)
# Empty DataFrame
# Columns: []
# Index: []

This proposal would also allow the first example to work, resulting in:

col_0  0  1
row_0      
0      1  0
3      0  1

The text was updated successfully, but these errors were encountered:

jreback · 2020-12-30T13:06:20Z

thanks for the examples @ivirshup

currently we allow concat et all when either:

all indexes are unique
all indices are exactly equal (duplicates or not)

adding unambiguous duplicates, while a good idea in theory I am afraid makes this value dependent, e.g. you could happen to change the columns slightly and it works, while a different change makes it fail.

ivirshup · 2021-01-04T02:39:34Z

I guess I see the presence of duplicates as a value dependency already. Though I'm thinking of things as value dependent vs. type dependent, were you thinking of a different kind of value dependency?

I'd also say my main point is more that:

# If this can exist
df = pd.DataFrame(np.zeros((1, 4)), columns=list("aabc"))

# And this is allowed
df[["b", "c"]]
#      b    c
# 0  0.0  0.0

# It would make sense if this worked, though it currently does not
df.columns.get_indexer(["b", "c"])
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects

And the implications of this extend to a number of functions which deal with intersections and unions of indexes.

I also think it is ultimately less work to use and maintain a consistent set of rules (e.g. when labels can be resolved) since more code can be shared.

ivirshup added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 30, 2020

ivirshup mentioned this issue Dec 30, 2020

[WIP] Test (and more fixes) for duplicate indices with concat #38745

Closed

9 tasks

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/ ENH: Unambiguous indexing should be allowed, even if duplicates are present #38797

API/ ENH: Unambiguous indexing should be allowed, even if duplicates are present #38797

ivirshup commented Dec 30, 2020 •

edited

Loading

jreback commented Dec 30, 2020

ivirshup commented Jan 4, 2021

API/ ENH: Unambiguous indexing should be allowed, even if duplicates are present #38797

API/ ENH: Unambiguous indexing should be allowed, even if duplicates are present #38797

Comments

ivirshup commented Dec 30, 2020 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

join="inner" (Specific to concat)

Performance

.reindex, .get_index

Additional example with pd.crosstab

jreback commented Dec 30, 2020

ivirshup commented Jan 4, 2021

ivirshup commented Dec 30, 2020 •

edited

Loading

`join="inner"` (Specific to `concat`)

`.reindex`, `.get_index`

Additional example with `pd.crosstab`