Regression in DataFrame.set_index with class instance column keys #24969

wkschwartz · 2019-01-27T22:57:50Z

The following code worked in Pandas 0.23.4 but not in Pandas 0.24.0 (I'm on Python 3.7.2).

import pandas as pd

class Thing:
    # (Production code would also ensure a Thing instance's hash
    # and equality testing depended on name and color)

    def __init__(self, name, color):
        self.name = name
        self.color = color
    
    def __str__(self):
        return "<Thing %r>" % (self.name,)

thing1 = Thing('One', 'red')
thing2 = Thing('Two', 'blue')
df = pd.DataFrame({thing1: [0, 1], thing2: [2, 3]})
df.set_index([thing2])

In Pandas 0.23.4, I get the following correct result:

               <Thing 'One'>
<Thing 'Two'>               
2                          0
3                          1

In Pandas 0.24.0, I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../venv/lib/python3.7/site-packages/pandas/core/frame.py", line 4153, in set_index
    raise ValueError(err_msg)
ValueError: The parameter "keys" may be a column key, one-dimensional array, or a list containing only valid column keys and one-dimensional arrays.

After looking at Pandas 0.24.0's implementation of DataFrame.set_index:

pandas/pandas/core/frame.py

Lines 4144 to 4153 in 83eb242

    
           for col in keys: 
        
               if (is_scalar(col) or isinstance(col, tuple)): 
        
                   # if col is a valid column key, everything is fine 
        
                   # tuples are always considered keys, never as list-likes 
        
                   if col not in self: 
        
                       missing.append(col) 
        
               elif (not isinstance(col, (ABCIndexClass, ABCSeries, 
        
                                          np.ndarray, list)) 
        
                     or getattr(col, 'ndim', 1) > 1): 
        
                   raise ValueError(err_msg)

I noticed that is_scalar returns False for thing1 in Pandas 0.24.0:

>>> from pandas.core.dtypes.common import is_scalar
>>> is_scalar(thing1)
False

I suspect that it is incorrect to test DataFrame column keys using is_scalar.

Output of `pd.show_versions()`

`pd.show_versions()` from Pandas 0.23.4

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.4.3
Cython: None
numpy: 1.16.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.1.2
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

`pd.show_versions()` from Pandas 0.24.0

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: None
pip: 18.1
setuptools: 40.4.3
Cython: None
numpy: 1.16.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.1.2
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-01-28T07:17:27Z

We have had quite some discussion lately about set_index (see eg #24046), and the actual change (that started to use is_scalar, I think) that caused this regression is #22486 and #24762

In general the usage is_scalar gives problems with custom objects. Eg we also fixed this in fillna (#20411).

cc @h-vetinari

h-vetinari · 2019-01-28T07:35:22Z

@jorisvandenbossche

Does pandas support custom objects as labels? I think that's bound to break in many places. The code previously tried everything it got as a key, so in this sense this is a regression, yes.

I'm a bit stumped as for how to deal with this. Column keys should IMO clearly be scalar (or tuples, grudgingly) - and that's the only reason is_scalar is there. CC @jreback

@wkschwartz

Your object (at least the toy version) looks a bit like a tuple. As an immediate workaround, I'd suggest to try inheriting Thing from tuple, then the isinstance(..., tuple)-side should work at least.

wkschwartz · 2019-01-28T13:58:40Z

In my production code, I use dataclasses as custom objects in both column keys and row indices, which worked throughout Pandas 0.23.4. If Pandas 0.24 or later drop support for custom classes in row/column indices, I would be stuck at 0.23.4 forever. This is why I view the change as a regression.

TomAugspurger · 2019-01-28T14:03:43Z

Does pandas support custom objects as labels?

We didn't disallow it previously, so yes.

jreback · 2019-01-28T14:06:46Z

This may have happened to work, but we don't support custom objects as labels explicity. Not against reverting this, but its buyer beware here.

wkschwartz · 2019-01-28T14:16:55Z

I never could find anything in the documentation that takes a stance on what can or can’t be column keys, except the general notion that DataFrames are dict-like. From this I surmised that column keys should be hashable and immutable. Did I miss something in the documentation?

h-vetinari · 2019-01-28T18:13:27Z

I've added a fix for the regression in #24984, with the caveat that I think that this should be immediately deprecated. For a method that's supposed to intelligently process column labels, arrays and list of labels/arrays, it's opening Pandora's box to try to support custom types. There's no end to the (hashable) horrors that someone might implement (see below).

Allowing dataclasses, as in the usecase of @wkschwartz is a separate consideration, but for much the same reasons, I'd suggest to deprecate even tuples in labels: #24688

See also #24702, which also ties into the following:

what can or can’t be column keys

This should be clearly defined, documented, and enforced (same as list-like).

About the stuff that pandas already "supports", how about lists-as-labels...?

>>> from pyfrozen import FrozenList
>>> class FroList(FrozenList):
...     def __init__(self, values):
...         super().__init__()
...         self.extend(values)
...         self.freeze()
...     def __hash__(self):
...         return hash(tuple(self))
...
>>> pd.DataFrame({FroList([1, 2, 3]): ['a', 'b', 'c']})
  [1, 2, 3]
0         a
1         b
2         c

wkschwartz · 2019-01-28T18:27:54Z

Obviously I would prefer no deprecation of custom label types as they are integral to my company’s applications. However, I would urge strongly that if you do decide to deprecate the feature, you do so starting only in the next major release (presumably 0.25.0) rather than in a minor release (0.24.1).

If you do stop supporting custom label types and I am not to be stuck at Pandas 0.23.4 forever, I could theoretically undertake the (expensive) refactoring to use unique IDs (my production code has the equivalent of the name field from my toy example in the OP). However, other users whose code would break might not have convenient unique IDs to switch to. Please do not remove this feature lightly.

toobaz · 2019-01-28T23:25:42Z

About the stuff that pandas already "supports", how about lists-as-labels...?

FroList is just a mistake, because it has a __hash__ which changes during its lifetime. If a user does stuff like this, he's looking for trouble: fair enough, not our fault.

WillAyd · 2019-01-28T23:38:32Z

Hmm well this conversation is somewhat split between this and the PR but I would be -1 on supporting something like this. I understand that it may have worked previously and wasn't explicitly disallowed, but I don't think that means we have to support it going forward, when I would think there are already an infinite other more idiomatic ways of going about this.

Even if fixed in this particular instance trying to guarantee support for something like this as an index item is going to cause a lot of unnecessary complexity in other parts of the code base, I would think particularly on subsequent indexing operations

h-vetinari · 2019-01-28T23:39:23Z

It does not change, that's what the .freeze is for...

It would work. Not that it should, of course, much less that it should be supported.

>>> ell = FroList([1, 2, 3])
>>> ell[1] = 'a'
Traceback (most recent call last):
[...]
RuntimeError: Cannot modify frozen list

toobaz · 2019-01-28T23:55:40Z

It does not change, that's what the .freeze is for...

I meant: it can contain a proper list. But you're right this is irrelevant (tuples can too). And indeed it works, and that's great. Users will need to remember only a simple rule: "hashable -> works". By the way: it's a rule they already know from dicts.

wkschwartz · 2019-01-29T00:05:59Z

I understand that it may have worked previously and wasn't explicitly disallowed, but I don't think that means we have to support it going forward

If your definition of backward compatibility means maintaining the functionality of old Pandas-based code, then you shouldn't care about why folks used Pandas as they did, only that it worked before and should continue to work as long as the major version of Pandas doesn't change. This is the way Go handles compatibility. In that case, whether my use case is Pythonic or elegant shouldn't matter.

If your goals for the backward compatibility of Pandas only encompases the examples in the documentation and the tests, then I refer you to my next point.

when I would think there are already an infinite other more idiomatic ways of going about this

Using custom classes as column headers in my application has been great because there's a bunch of extra data that is automatically carried around as I make (modified) copies of DataFrames. For my use case, custom classes have been elegant and, IMO, very Pythonic.

h-vetinari · 2019-01-29T07:26:57Z

@wkschwartz

This is the way Go handles compatibility.

Go is statically typed, and that gives a very different baseline from which to make such a promise, compared to python.

Pandas API is under massive development (see the whole ExtensionArray effort), and still has far too many nooks and crannies (whether due to legacy reasons or simple oversight), such that promising backwards compatibility for "anything that works" is suicide from a maintenance perspective. That being said, there's a substantial amount of effort that things are being deprecated gently whenever possible.

WillAyd · 2019-01-29T23:36:13Z

Users will need to remember only a simple rule: "hashable -> works".

Just giving this some more thought today...is this all we should require? There are a lot of index operations that also require the concept of sortability and that manifests itself in quite a few of the reshaping ops. I feel like hashable alone would only work with a subset of the API that would require intense care to work around.

Using custom classes as column headers in my application has been great because there's a bunch of extra data that is automatically carried around as I make (modified) copies of DataFrames

Sure but what parts of the DataFrame API are you subsequently using? Outside of masking operations I'm kind of curious what the advantage of the DataFrame is versus just a dict, given IO operations with the former are going to be lossy and as mentioned I would think you'd have to be very careful which parts of the API you ultimately navigate.

wkschwartz · 2019-01-30T01:34:43Z

hashable alone would only work with a subset of the API

Maybe categorical dtypes are useful analogy here. Some categoricals are ordered and thus sortable, others are not. Which operations you can do on a categorical column depends on whether the dtype is ordered.

Perhaps there's a hiererarchy for keys:

All column keys must have __hash__ (and maybe __eq__?)
Sorting/reshaping operations require __lt__ and __eq__. (You could insist that the class have all six rich comparison methods if that makes Pandas code simpler.)

Pandas can then assume compatibility among __hash__, __lt__ (if present), and __eq__ (if required?).

what parts of the DataFrame API are you subsequently using?

I make substantial use of row and column indexing (both boolean masking and by keys), row and column sums, merging on row indices, broadcasting comparisons (e.g., (df[list_of_my_custom_objects] < 0).any()), transposes, and, most importantly, matrix multiplication (e.g., df[list_of_my_custom_objects] @ some_vector). I make some use of groupby. At the boundaries of my API and in my CLI, I use some IO features, and some of the guarantees about dtypes.

Most of those operations are way harder with lists of dicts, which I know because I've written this application both ways.

toobaz · 2019-01-30T07:49:43Z

Users will need to remember only a simple rule: "hashable -> works".

Just giving this some more thought today...is this all we should require? There are a lot of index operations that also require the concept of sortability

Mixed type indexes are non-sortable since Python 3: to my eyes the problem is analogous, and not particularly worrisome. For my experience, there are some reshaping ops in which we try to sort, and fallback to not sorting, and this is (I think) OK.

jorisvandenbossche · 2019-02-01T08:01:04Z

@toobaz is there a reason you closed this? As I don't think the issue is resolved (or the typical wrong button mistake; github UI should put them farther apart :-))

toobaz · 2019-02-01T09:48:58Z

@toobaz is there a reason you closed this?

Lack of sleep probably ;-) Thanks for reopening.

jorisvandenbossche · 2019-02-01T11:24:33Z

There has been some interesting discussion, here and in the PR, on the general topic of "should we support custom objects in Index" (or, how do we define and specify what we support in Index).

And I think it is a discussion we should continue (should we open a new issue for that?); but, we also need to decide what to do on the short term.

Do we agree that this was an unintended regression that we would like to fix for 0.24.x, awaiting more general discussion later on? (-> which means working towards merging #24984)

Tom asked the same question on the PR a few days ago (#24984 (comment)), and there was not really reaction on it, so I somewhat assumed that it was the case (but I also agree with it). But @WillAyd mentioned on gitter that he has his doubts on the need to fix it.
(the discussion on the PR since then is only about the technical details of how to allow it again)

h-vetinari · 2019-02-24T10:10:33Z

@jreback
As I mentioned a few times in #24894, this was actually fixed for 0.24.1 by #25085.

@h-vetinari: The custom classes were re-enabled by #25085 (which took over the tests from this PR [#24894]), which closed the regression #24969, and has a corresponding whatsnew note. I guess the issue didn't get closed yet, because I only noted that #25085 was an alternative to this PR [#24894] (at the time, for solving #24969), but didn't add the "closes #24969" explicitly - sorry.

jorisvandenbossche · 2019-02-25T17:28:09Z

@h-vetinari indeed, changed the milestone accordingly

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jan 28, 2019

jorisvandenbossche added this to the 0.24.1 milestone Jan 28, 2019

h-vetinari mentioned this issue Jan 28, 2019

API/ERR: allow iterators in df.set_index & improve errors #24984

Merged

3 tasks

h-vetinari mentioned this issue Jan 30, 2019

Towards "pandas 1.0" #10000

Closed

toobaz closed this as completed Jan 30, 2019

jorisvandenbossche reopened this Feb 1, 2019

jreback closed this as completed Feb 1, 2019

jreback reopened this Feb 1, 2019

TomAugspurger mentioned this issue Feb 1, 2019

RLS: 0.24.x #24949

Closed

jreback removed this from the 0.24.1 milestone Feb 1, 2019

jreback added this to the 0.24.2 milestone Feb 1, 2019

h-vetinari mentioned this issue Feb 3, 2019

CLN: Use ABCs in set_index #25128

Merged

jreback modified the milestones: 0.24.2, 0.25.0 Feb 19, 2019

jreback closed this as completed in #24984 Feb 24, 2019

jorisvandenbossche modified the milestones: 0.25.0, 0.24.1 Feb 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in DataFrame.set_index with class instance column keys #24969

Regression in DataFrame.set_index with class instance column keys #24969

wkschwartz commented Jan 27, 2019 •

edited

Loading

`pd.show_versions()` from Pandas 0.23.4

INSTALLED VERSIONS

`pd.show_versions()` from Pandas 0.24.0

INSTALLED VERSIONS

jorisvandenbossche commented Jan 28, 2019

h-vetinari commented Jan 28, 2019

wkschwartz commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019

jreback commented Jan 28, 2019

wkschwartz commented Jan 28, 2019

h-vetinari commented Jan 28, 2019

wkschwartz commented Jan 28, 2019

toobaz commented Jan 28, 2019 •

edited

Loading

WillAyd commented Jan 28, 2019

h-vetinari commented Jan 28, 2019

toobaz commented Jan 28, 2019

wkschwartz commented Jan 29, 2019

h-vetinari commented Jan 29, 2019

WillAyd commented Jan 29, 2019

wkschwartz commented Jan 30, 2019

toobaz commented Jan 30, 2019 •

edited

Loading

jorisvandenbossche commented Feb 1, 2019 •

edited

Loading

toobaz commented Feb 1, 2019

jorisvandenbossche commented Feb 1, 2019

h-vetinari commented Feb 24, 2019 •

edited

Loading

jorisvandenbossche commented Feb 25, 2019

Regression in DataFrame.set_index with class instance column keys #24969

Regression in DataFrame.set_index with class instance column keys #24969

Comments

wkschwartz commented Jan 27, 2019 • edited Loading

Output of pd.show_versions()

pd.show_versions() from Pandas 0.23.4

INSTALLED VERSIONS

pd.show_versions() from Pandas 0.24.0

INSTALLED VERSIONS

jorisvandenbossche commented Jan 28, 2019

h-vetinari commented Jan 28, 2019

wkschwartz commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019

jreback commented Jan 28, 2019

wkschwartz commented Jan 28, 2019

h-vetinari commented Jan 28, 2019

wkschwartz commented Jan 28, 2019

toobaz commented Jan 28, 2019 • edited Loading

WillAyd commented Jan 28, 2019

h-vetinari commented Jan 28, 2019

toobaz commented Jan 28, 2019

wkschwartz commented Jan 29, 2019

h-vetinari commented Jan 29, 2019

WillAyd commented Jan 29, 2019

wkschwartz commented Jan 30, 2019

toobaz commented Jan 30, 2019 • edited Loading

jorisvandenbossche commented Feb 1, 2019 • edited Loading

toobaz commented Feb 1, 2019

jorisvandenbossche commented Feb 1, 2019

h-vetinari commented Feb 24, 2019 • edited Loading

jorisvandenbossche commented Feb 25, 2019

wkschwartz commented Jan 27, 2019 •

edited

Loading

Output of `pd.show_versions()`

`pd.show_versions()` from Pandas 0.23.4

`pd.show_versions()` from Pandas 0.24.0

toobaz commented Jan 28, 2019 •

edited

Loading

toobaz commented Jan 30, 2019 •

edited

Loading

jorisvandenbossche commented Feb 1, 2019 •

edited

Loading

h-vetinari commented Feb 24, 2019 •

edited

Loading