Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange groupby apply behaviour #31111

Closed
endremborza opened this issue Jan 17, 2020 · 9 comments · Fixed by #34897
Closed

Strange groupby apply behaviour #31111

endremborza opened this issue Jan 17, 2020 · 9 comments · Fixed by #34897
Assignees
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@endremborza
Copy link
Contributor

Minimal reproducible example

import pandas as pd
import numpy as np

tdf = pd.DataFrame({"tree": [0, 0, 0, 0, 1, 1, 1, 1],
                    "into": ["0-2", "0-4", "0-10", np.nan, "1-2", "1-3", "1-7", np.nan],
                    "leaf_value": [0,0,0,3,0,0,0,4]},
                   index=["0-0", "0-2", "0-4", "0-10", "1-0", "1-1", "1-2", "1-7"])

def deduce_tree(df):
    print("DEDUCING TREE WITH INDEX:\n",df.index)
    next_id = df.index[0]
    while isinstance(next_id, str):
        print(next_id)
        next_node = df.loc[next_id, :]
        next_id = df.loc[next_id, "into"]
    print("RETURNING:\n", next_node)
    return next_node

tdf.groupby("tree").apply(deduce_tree)

DEDUCING GREE WITH INDEX:
Index(['0-0', '0-2', '0-4', '0-10'], dtype='object')
0-0
0-2
0-4
0-10
RETURNING:
tree 0
into NaN
leaf_value 3
Name: 0-10, dtype: object
DEDUCING GREE WITH INDEX:
Index(['1-0', '1-1', '1-2', '1-7'], dtype='object')
1-0
DEDUCING GREE WITH INDEX:
Index(['0-0', '0-2', '0-4', '0-10'], dtype='object')
0-0
0-2
0-4
0-10
RETURNING:
tree 0
into NaN
leaf_value 3
Name: 0-10, dtype: object
DEDUCING GREE WITH INDEX:
Index(['1-0', '1-1', '1-2', '1-7'], dtype='object')
1-0
1-2
1-7
RETURNING:
tree 1
into NaN
leaf_value 4
Name: 1-7, dtype: object

Problem description

This came up when trying to analyze a boosted tree internals. As the apply function gets called with a print statement, when it gets to the line next_node = df.loc[next_id, :] it just calls the deduce_tree function again, with the group 0.
it prints out DEDUCING TREE WITH INDEX 3 times as opposed to 2, and for some reason interrupts the function.

the result of tdf.groupby("tree").apply(deduce_tree) is correct, but it seems to do some unnecessary work and if I want to implement some side effects into deduce_tree it gets messed up.

Can anyone explain why it works like this? Is this some bug? How can a .loc interrupt a function?

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.2.1-1.el7.elrepo.x86_64
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.18.1
pytz : 2018.4
dateutil : 2.7.3
pip : 19.2.3
setuptools : 41.0.1
Cython : None
pytest : 4.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.1 (dt dec pq3 ext lo64)
jinja2 : 2.10
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.2.8
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@MarcoGorelli
Copy link
Member

Thanks for your report.

it prints out DEDUCING TREE WITH INDEX 3 times as opposed to 2

It gets printed 4 times when I run it on master - could you please include your output?

@endremborza
Copy link
Contributor Author

endremborza commented Jan 19, 2020

you can see my output unter the first details tab, turns out, thats also 4 times, I just misread it for some reason. still behavior is strange.

@MarcoGorelli
Copy link
Member

Thanks, will look into it

@alonme
Copy link
Contributor

alonme commented May 18, 2020

This seems to have been somehow fixed between 1.0.3 and master (currently - 239b6a7)

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Apply Apply, Aggregate, Transform, Map Groupby labels May 18, 2020
@nthiagar
Copy link

Hello,
I've tested the code out on version 1.0.1 and the DEDUCING TREE WITH INDEX printed out 4 times. Since I am a new contributor to Pandas I am still unsure what the masters version is. Can someone let me know what it is?
Thank you

@MarcoGorelli
Copy link
Member

Hi @nthiagar

I am still unsure what the masters version is

It's the master branch of the git repo. It's what you'll get if you build the project from source, rather than, say, with pip install pandas.

@MarcoGorelli
Copy link
Member

This seems to have been somehow fixed between 1.0.3 and master (currently - 239b6a7)

If I've done git bisect correctly, it was fixed in fa48f5f (#32611)

@Rohith295
Copy link
Contributor

take

@jreback
Copy link
Contributor

jreback commented Jun 20, 2020

this is already fixed on master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants