Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: dataframe.apply() loops on first row when applied method attempts to modify the row #35462

Closed
nicolasrozain opened this issue Jul 29, 2020 · 5 comments · Fixed by #35633
Closed
Labels
Apply Apply, Aggregate, Transform, Map Bug Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@nicolasrozain
Copy link

With pandas 1.1.0 on Python 3.6.8:

>>> import pandas as pd
>>> df = pd.DataFrame({'a': list(range(0,100)), 'b': list(range(100,200))})
>>> def func(row):
...     row.loc['a'] += 1
...     return row
... 
>>> df
     a    b
0    0  100
1    1  101
2    2  102
3    3  103
4    4  104
..  ..  ...
95  95  195
96  96  196
97  97  197
98  98  198
99  99  199
[100 rows x 2 columns]
>>> df.apply(func, axis=1)
      a    b
0   100  100
1   100  100
2   100  100
3   100  100
4   100  100
..  ...  ...
95  100  100
96  100  100
97  100  100
98  100  100
99  100  100
[100 rows x 2 columns]
>>> df
      a    b
0   100  100
1     1  101
2     2  102
3     3  103
4     4  104
..  ...  ...
95   95  195
96   96  196
97   97  197
98   98  198
99   99  199
[100 rows x 2 columns]
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : d9fff2792bf16178d4e450fe7384244e50635733
python           : 3.6.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.17763
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None
pandas           : 1.1.0
numpy            : 1.18.2
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 42.0.2
Cython           : None
pytest           : 5.4.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : 4.9.1
bottleneck       : 1.3.2
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.2.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 2.5.14
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.3.3
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : None

With pandas 1.0.5:

>>> import pandas as pd
>>> df = pd.DataFrame({'a': list(range(0,100)), 'b': list(range(100,200))})
>>> def func(row):
...     row.loc['a'] += 1
...     return row
... 
>>> df
     a    b
0    0  100
1    1  101
2    2  102
3    3  103
4    4  104
..  ..  ...
95  95  195
96  96  196
97  97  197
98  98  198
99  99  199
[100 rows x 2 columns]
>>> df.apply(func, axis=1)
      a    b
0     1  100
1     2  101
2     3  102
3     4  103
4     5  104
..  ...  ...
95   96  195
96   97  196
97   98  197
98   99  198
99  100  199
[100 rows x 2 columns]
>>> df
      a    b
0     1  100
1     2  101
2     3  102
3     4  103
4     5  104
..  ...  ...
95   96  195
96   97  196
97   98  197
98   99  198
99  100  199
[100 rows x 2 columns]
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None
pandas           : 1.0.5
numpy            : 1.18.2
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 42.0.2
Cython           : None
pytest           : 5.4.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : 4.9.1
bottleneck       : 1.3.2
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.2.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 2.5.14
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.4.1
pyxlsb           : None
s3fs             : None
scipy            : 1.3.3
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
xlsxwriter       : None
numba            : None

I expected the behavior of 1.0.5 in 1.1.0, did I misunderstood the apply method?
Thank you for your help.

@nicolasrozain nicolasrozain changed the title BUG: dataframe.apply() loops on first row when apply method attempts to modify the row BUG: dataframe.apply() loops on first row when applied method attempts to modify the row Jul 29, 2020
@simonjayhawkins simonjayhawkins added Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version Bug labels Jul 30, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Jul 30, 2020
@simonjayhawkins
Copy link
Member

Thanks @nicolasrozain for the report. is adding row=row.copy() to the function before the mutation a suitable workaround in the short term?

@simonjayhawkins
Copy link
Member

@nicolasrozain in 1.0.5 the original DataFrame is being mutated by this function so removing the regression tag pending further discussion/investigation of correct behaviour or whether we want to restore the 1.0.5 behaviour.

>>> pd.__version__
'1.0.5'
>>>
>>> df = pd.DataFrame({"a": list(range(0, 3)), "b": list(range(100, 103))})
>>> orig = df.copy()
>>>
>>>
>>> def func(row):
...     row.loc["a"] += 1
...     return row
...
>>>
>>> df
   a    b
0  0  100
1  1  101
2  2  102
>>>
>>> res = df.apply(func, axis=1)
>>> print(res)
   a    b
0  1  100
1  2  101
2  3  102
>>>
>>> df
   a    b
0  1  100
1  2  101
2  3  102
>>>
>>> tm.assert_frame_equal(df, orig)
Traceback (most recent call last):
...
AssertionError: DataFrame.iloc[:, 0] (column name="a") are different

@simonjayhawkins simonjayhawkins removed this from the 1.1.1 milestone Jul 31, 2020
@simonjayhawkins simonjayhawkins added Needs Discussion Requires discussion from core team before further action and removed Regression Functionality that used to work in a prior pandas version labels Jul 31, 2020
@simonjayhawkins
Copy link
Member

OK so the commit that caused the change is not the one I expected. it is #34909. This PR is labelled PERF and so should not have changed behaviour. re-instating regression tag. cc @jbrockmendel

91802a9 is the first bad commit
commit 91802a9
Author: jbrockmendel jbrockmendel@gmail.com
Date: Thu Jun 25 16:06:10 2020 -0700

PERF: avoid creating many Series in apply_standard (#34909)

@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version and removed Needs Discussion Requires discussion from core team before further action labels Jul 31, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Jul 31, 2020
@GusBite
Copy link

GusBite commented Aug 3, 2020

I have experienced the same behaviour. Modifying a copy of the series instead of the original solved the issue in the short term. I guess i'll wait a fix before enjoying the long waited 'apply does not alterate first row twice' functionnality. Keep up the good work guys.

@seal9
Copy link

seal9 commented Aug 8, 2020

Same problem for me, I feel like the 1.0.5 behavior was much better. I often use the apply method to modify rows like this, as I feel the ability to call an arbitary Python function on any row is very powerful. I'll downgrade to 1.0.5 until this is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants