Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in DataFrame.set_index with class instance column keys #24969

Closed
wkschwartz opened this issue Jan 27, 2019 · 22 comments · Fixed by #24984
Closed

Regression in DataFrame.set_index with class instance column keys #24969

wkschwartz opened this issue Jan 27, 2019 · 22 comments · Fixed by #24984
Labels
Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@wkschwartz
Copy link

wkschwartz commented Jan 27, 2019

The following code worked in Pandas 0.23.4 but not in Pandas 0.24.0 (I'm on Python 3.7.2).

import pandas as pd

class Thing:
    # (Production code would also ensure a Thing instance's hash
    # and equality testing depended on name and color)

    def __init__(self, name, color):
        self.name = name
        self.color = color
    
    def __str__(self):
        return "<Thing %r>" % (self.name,)

thing1 = Thing('One', 'red')
thing2 = Thing('Two', 'blue')
df = pd.DataFrame({thing1: [0, 1], thing2: [2, 3]})
df.set_index([thing2])

In Pandas 0.23.4, I get the following correct result:

               <Thing 'One'>
<Thing 'Two'>               
2                          0
3                          1

In Pandas 0.24.0, I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../venv/lib/python3.7/site-packages/pandas/core/frame.py", line 4153, in set_index
    raise ValueError(err_msg)
ValueError: The parameter "keys" may be a column key, one-dimensional array, or a list containing only valid column keys and one-dimensional arrays.

After looking at Pandas 0.24.0's implementation of DataFrame.set_index:

pandas/pandas/core/frame.py

Lines 4144 to 4153 in 83eb242

for col in keys:
if (is_scalar(col) or isinstance(col, tuple)):
# if col is a valid column key, everything is fine
# tuples are always considered keys, never as list-likes
if col not in self:
missing.append(col)
elif (not isinstance(col, (ABCIndexClass, ABCSeries,
np.ndarray, list))
or getattr(col, 'ndim', 1) > 1):
raise ValueError(err_msg)

I noticed that is_scalar returns False for thing1 in Pandas 0.24.0:

>>> from pandas.core.dtypes.common import is_scalar
>>> is_scalar(thing1)
False

I suspect that it is incorrect to test DataFrame column keys using is_scalar.

Output of pd.show_versions()

pd.show_versions() from Pandas 0.23.4

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.4.3
Cython: None
numpy: 1.16.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.1.2
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

pd.show_versions() from Pandas 0.24.0

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: None
pip: 18.1
setuptools: 40.4.3
Cython: None
numpy: 1.16.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.1.2
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jorisvandenbossche
Copy link
Member

We have had quite some discussion lately about set_index (see eg #24046), and the actual change (that started to use is_scalar, I think) that caused this regression is #22486 and #24762

In general the usage is_scalar gives problems with custom objects. Eg we also fixed this in fillna (#20411).

cc @h-vetinari

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jan 28, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.24.1 milestone Jan 28, 2019
@h-vetinari
Copy link
Contributor

@jorisvandenbossche

Does pandas support custom objects as labels? I think that's bound to break in many places. The code previously tried everything it got as a key, so in this sense this is a regression, yes.

I'm a bit stumped as for how to deal with this. Column keys should IMO clearly be scalar (or tuples, grudgingly) - and that's the only reason is_scalar is there. CC @jreback

@wkschwartz

Your object (at least the toy version) looks a bit like a tuple. As an immediate workaround, I'd suggest to try inheriting Thing from tuple, then the isinstance(..., tuple)-side should work at least.

@wkschwartz
Copy link
Author

In my production code, I use dataclasses as custom objects in both column keys and row indices, which worked throughout Pandas 0.23.4. If Pandas 0.24 or later drop support for custom classes in row/column indices, I would be stuck at 0.23.4 forever. This is why I view the change as a regression.

@TomAugspurger
Copy link
Contributor

Does pandas support custom objects as labels?

We didn't disallow it previously, so yes.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2019

This may have happened to work, but we don't support custom objects as labels explicity. Not against reverting this, but its buyer beware here.

@wkschwartz
Copy link
Author

I never could find anything in the documentation that takes a stance on what can or can’t be column keys, except the general notion that DataFrames are dict-like. From this I surmised that column keys should be hashable and immutable. Did I miss something in the documentation?

@h-vetinari
Copy link
Contributor

I've added a fix for the regression in #24984, with the caveat that I think that this should be immediately deprecated. For a method that's supposed to intelligently process column labels, arrays and list of labels/arrays, it's opening Pandora's box to try to support custom types. There's no end to the (hashable) horrors that someone might implement (see below).

Allowing dataclasses, as in the usecase of @wkschwartz is a separate consideration, but for much the same reasons, I'd suggest to deprecate even tuples in labels: #24688

See also #24702, which also ties into the following:

what can or can’t be column keys

This should be clearly defined, documented, and enforced (same as list-like).


About the stuff that pandas already "supports", how about lists-as-labels...?

>>> from pyfrozen import FrozenList
>>> class FroList(FrozenList):
...     def __init__(self, values):
...         super().__init__()
...         self.extend(values)
...         self.freeze()
...     def __hash__(self):
...         return hash(tuple(self))
...
>>> pd.DataFrame({FroList([1, 2, 3]): ['a', 'b', 'c']})
  [1, 2, 3]
0         a
1         b
2         c

@wkschwartz
Copy link
Author

Obviously I would prefer no deprecation of custom label types as they are integral to my company’s applications. However, I would urge strongly that if you do decide to deprecate the feature, you do so starting only in the next major release (presumably 0.25.0) rather than in a minor release (0.24.1).

If you do stop supporting custom label types and I am not to be stuck at Pandas 0.23.4 forever, I could theoretically undertake the (expensive) refactoring to use unique IDs (my production code has the equivalent of the name field from my toy example in the OP). However, other users whose code would break might not have convenient unique IDs to switch to. Please do not remove this feature lightly.

@toobaz
Copy link
Member

toobaz commented Jan 28, 2019

About the stuff that pandas already "supports", how about lists-as-labels...?

FroList is just a mistake, because it has a __hash__ which changes during its lifetime. If a user does stuff like this, he's looking for trouble: fair enough, not our fault.

@WillAyd
Copy link
Member

WillAyd commented Jan 28, 2019

Hmm well this conversation is somewhat split between this and the PR but I would be -1 on supporting something like this. I understand that it may have worked previously and wasn't explicitly disallowed, but I don't think that means we have to support it going forward, when I would think there are already an infinite other more idiomatic ways of going about this.

Even if fixed in this particular instance trying to guarantee support for something like this as an index item is going to cause a lot of unnecessary complexity in other parts of the code base, I would think particularly on subsequent indexing operations

@h-vetinari
Copy link
Contributor

It does not change, that's what the .freeze is for...

It would work. Not that it should, of course, much less that it should be supported.

>>> ell = FroList([1, 2, 3])
>>> ell[1] = 'a'
Traceback (most recent call last):
[...]
RuntimeError: Cannot modify frozen list

@toobaz
Copy link
Member

toobaz commented Jan 28, 2019

It does not change, that's what the .freeze is for...

I meant: it can contain a proper list. But you're right this is irrelevant (tuples can too). And indeed it works, and that's great. Users will need to remember only a simple rule: "hashable -> works". By the way: it's a rule they already know from dicts.

@wkschwartz
Copy link
Author

I understand that it may have worked previously and wasn't explicitly disallowed, but I don't think that means we have to support it going forward

If your definition of backward compatibility means maintaining the functionality of old Pandas-based code, then you shouldn't care about why folks used Pandas as they did, only that it worked before and should continue to work as long as the major version of Pandas doesn't change. This is the way Go handles compatibility. In that case, whether my use case is Pythonic or elegant shouldn't matter.

If your goals for the backward compatibility of Pandas only encompases the examples in the documentation and the tests, then I refer you to my next point.

when I would think there are already an infinite other more idiomatic ways of going about this

Using custom classes as column headers in my application has been great because there's a bunch of extra data that is automatically carried around as I make (modified) copies of DataFrames. For my use case, custom classes have been elegant and, IMO, very Pythonic.

@h-vetinari
Copy link
Contributor

@wkschwartz

This is the way Go handles compatibility.

Go is statically typed, and that gives a very different baseline from which to make such a promise, compared to python.

Pandas API is under massive development (see the whole ExtensionArray effort), and still has far too many nooks and crannies (whether due to legacy reasons or simple oversight), such that promising backwards compatibility for "anything that works" is suicide from a maintenance perspective. That being said, there's a substantial amount of effort that things are being deprecated gently whenever possible.

@WillAyd
Copy link
Member

WillAyd commented Jan 29, 2019

Users will need to remember only a simple rule: "hashable -> works".

Just giving this some more thought today...is this all we should require? There are a lot of index operations that also require the concept of sortability and that manifests itself in quite a few of the reshaping ops. I feel like hashable alone would only work with a subset of the API that would require intense care to work around.

Using custom classes as column headers in my application has been great because there's a bunch of extra data that is automatically carried around as I make (modified) copies of DataFrames

Sure but what parts of the DataFrame API are you subsequently using? Outside of masking operations I'm kind of curious what the advantage of the DataFrame is versus just a dict, given IO operations with the former are going to be lossy and as mentioned I would think you'd have to be very careful which parts of the API you ultimately navigate.

@wkschwartz
Copy link
Author

hashable alone would only work with a subset of the API

Maybe categorical dtypes are useful analogy here. Some categoricals are ordered and thus sortable, others are not. Which operations you can do on a categorical column depends on whether the dtype is ordered.

Perhaps there's a hiererarchy for keys:

  1. All column keys must have __hash__ (and maybe __eq__?)
  2. Sorting/reshaping operations require __lt__ and __eq__. (You could insist that the class have all six rich comparison methods if that makes Pandas code simpler.)

Pandas can then assume compatibility among __hash__, __lt__ (if present), and __eq__ (if required?).

what parts of the DataFrame API are you subsequently using?

I make substantial use of row and column indexing (both boolean masking and by keys), row and column sums, merging on row indices, broadcasting comparisons (e.g., (df[list_of_my_custom_objects] < 0).any()), transposes, and, most importantly, matrix multiplication (e.g., df[list_of_my_custom_objects] @ some_vector). I make some use of groupby. At the boundaries of my API and in my CLI, I use some IO features, and some of the guarantees about dtypes.

Most of those operations are way harder with lists of dicts, which I know because I've written this application both ways.

@toobaz
Copy link
Member

toobaz commented Jan 30, 2019

Users will need to remember only a simple rule: "hashable -> works".

Just giving this some more thought today...is this all we should require? There are a lot of index operations that also require the concept of sortability

Mixed type indexes are non-sortable since Python 3: to my eyes the problem is analogous, and not particularly worrisome. For my experience, there are some reshaping ops in which we try to sort, and fallback to not sorting, and this is (I think) OK.

@toobaz toobaz closed this as completed Jan 30, 2019
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 1, 2019

@toobaz is there a reason you closed this? As I don't think the issue is resolved (or the typical wrong button mistake; github UI should put them farther apart :-))

@toobaz
Copy link
Member

toobaz commented Feb 1, 2019

@toobaz is there a reason you closed this?

Lack of sleep probably ;-) Thanks for reopening.

@jorisvandenbossche
Copy link
Member

There has been some interesting discussion, here and in the PR, on the general topic of "should we support custom objects in Index" (or, how do we define and specify what we support in Index).

And I think it is a discussion we should continue (should we open a new issue for that?); but, we also need to decide what to do on the short term.

Do we agree that this was an unintended regression that we would like to fix for 0.24.x, awaiting more general discussion later on? (-> which means working towards merging #24984)

Tom asked the same question on the PR a few days ago (#24984 (comment)), and there was not really reaction on it, so I somewhat assumed that it was the case (but I also agree with it). But @WillAyd mentioned on gitter that he has his doubts on the need to fix it.
(the discussion on the PR since then is only about the technical details of how to allow it again)

@jreback jreback closed this as completed Feb 1, 2019
@jreback jreback reopened this Feb 1, 2019
@jreback jreback removed this from the 0.24.1 milestone Feb 1, 2019
@jreback jreback added this to the 0.24.2 milestone Feb 1, 2019
@jreback jreback modified the milestones: 0.24.2, 0.25.0 Feb 19, 2019
@h-vetinari
Copy link
Contributor

h-vetinari commented Feb 24, 2019

@jreback
As I mentioned a few times in #24894, this was actually fixed for 0.24.1 by #25085.

@h-vetinari: The custom classes were re-enabled by #25085 (which took over the tests from this PR [#24894]), which closed the regression #24969, and has a corresponding whatsnew note. I guess the issue didn't get closed yet, because I only noted that #25085 was an alternative to this PR [#24894] (at the time, for solving #24969), but didn't add the "closes #24969" explicitly - sorry.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.25.0, 0.24.1 Feb 25, 2019
@jorisvandenbossche
Copy link
Member

@h-vetinari indeed, changed the milestone accordingly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants