BUG: dataframe.apply() loops on first row when applied method attempts to modify the row #35462

nicolasrozain · 2020-07-29T15:21:58Z

With pandas 1.1.0 on Python 3.6.8:

>>> import pandas as pd
>>> df = pd.DataFrame({'a': list(range(0,100)), 'b': list(range(100,200))})
>>> def func(row):
...     row.loc['a'] += 1
...     return row
... 
>>> df
     a    b
0    0  100
1    1  101
2    2  102
3    3  103
4    4  104
..  ..  ...
95  95  195
96  96  196
97  97  197
98  98  198
99  99  199
[100 rows x 2 columns]
>>> df.apply(func, axis=1)
      a    b
0   100  100
1   100  100
2   100  100
3   100  100
4   100  100
..  ...  ...
95  100  100
96  100  100
97  100  100
98  100  100
99  100  100
[100 rows x 2 columns]
>>> df
      a    b
0   100  100
1     1  101
2     2  102
3     3  103
4     4  104
..  ...  ...
95   95  195
96   96  196
97   97  197
98   98  198
99   99  199
[100 rows x 2 columns]
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : d9fff2792bf16178d4e450fe7384244e50635733
python           : 3.6.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.17763
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None
pandas           : 1.1.0
numpy            : 1.18.2
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 42.0.2
Cython           : None
pytest           : 5.4.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : 4.9.1
bottleneck       : 1.3.2
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.2.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 2.5.14
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.3.3
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : None

With pandas 1.0.5:

>>> import pandas as pd
>>> df = pd.DataFrame({'a': list(range(0,100)), 'b': list(range(100,200))})
>>> def func(row):
...     row.loc['a'] += 1
...     return row
... 
>>> df
     a    b
0    0  100
1    1  101
2    2  102
3    3  103
4    4  104
..  ..  ...
95  95  195
96  96  196
97  97  197
98  98  198
99  99  199
[100 rows x 2 columns]
>>> df.apply(func, axis=1)
      a    b
0     1  100
1     2  101
2     3  102
3     4  103
4     5  104
..  ...  ...
95   96  195
96   97  196
97   98  197
98   99  198
99  100  199
[100 rows x 2 columns]
>>> df
      a    b
0     1  100
1     2  101
2     3  102
3     4  103
4     5  104
..  ...  ...
95   96  195
96   97  196
97   98  197
98   99  198
99  100  199
[100 rows x 2 columns]
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None
pandas           : 1.0.5
numpy            : 1.18.2
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 42.0.2
Cython           : None
pytest           : 5.4.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : 4.9.1
bottleneck       : 1.3.2
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.2.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 2.5.14
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.4.1
pyxlsb           : None
s3fs             : None
scipy            : 1.3.3
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
xlsxwriter       : None
numba            : None

I expected the behavior of 1.0.5 in 1.1.0, did I misunderstood the apply method?
Thank you for your help.

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2020-07-30T12:48:10Z

Thanks @nicolasrozain for the report. is adding row=row.copy() to the function before the mutation a suitable workaround in the short term?

simonjayhawkins · 2020-07-31T12:44:42Z

@nicolasrozain in 1.0.5 the original DataFrame is being mutated by this function so removing the regression tag pending further discussion/investigation of correct behaviour or whether we want to restore the 1.0.5 behaviour.

>>> pd.__version__
'1.0.5'
>>>
>>> df = pd.DataFrame({"a": list(range(0, 3)), "b": list(range(100, 103))})
>>> orig = df.copy()
>>>
>>>
>>> def func(row):
...     row.loc["a"] += 1
...     return row
...
>>>
>>> df
   a    b
0  0  100
1  1  101
2  2  102
>>>
>>> res = df.apply(func, axis=1)
>>> print(res)
   a    b
0  1  100
1  2  101
2  3  102
>>>
>>> df
   a    b
0  1  100
1  2  101
2  3  102
>>>
>>> tm.assert_frame_equal(df, orig)
Traceback (most recent call last):
...
AssertionError: DataFrame.iloc[:, 0] (column name="a") are different

simonjayhawkins · 2020-07-31T13:32:41Z

OK so the commit that caused the change is not the one I expected. it is #34909. This PR is labelled PERF and so should not have changed behaviour. re-instating regression tag. cc @jbrockmendel

91802a9 is the first bad commit
commit 91802a9
Author: jbrockmendel jbrockmendel@gmail.com
Date: Thu Jun 25 16:06:10 2020 -0700

PERF: avoid creating many Series in apply_standard (#34909)

GusBite · 2020-08-03T20:11:51Z

I have experienced the same behaviour. Modifying a copy of the series instead of the original solved the issue in the short term. I guess i'll wait a fix before enjoying the long waited 'apply does not alterate first row twice' functionnality. Keep up the good work guys.

seal9 · 2020-08-08T10:36:34Z

Same problem for me, I feel like the 1.0.5 behavior was much better. I often use the apply method to modify rows like this, as I feel the ability to call an arbitary Python function on any row is very powerful. I'll downgrade to 1.0.5 until this is fixed.

nicolasrozain changed the title ~~BUG: dataframe.apply() loops on first row when apply method attempts to modify the row~~ BUG: dataframe.apply() loops on first row when applied method attempts to modify the row Jul 29, 2020

simonjayhawkins added Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version Bug labels Jul 30, 2020

simonjayhawkins added this to the 1.1.1 milestone Jul 30, 2020

simonjayhawkins removed this from the 1.1.1 milestone Jul 31, 2020

simonjayhawkins added Needs Discussion Requires discussion from core team before further action and removed Regression Functionality that used to work in a prior pandas version labels Jul 31, 2020

simonjayhawkins mentioned this issue Jul 31, 2020

QST: is the new behavior of df.apply(my_func, axis=1) in v1.1.0 intended? #35483

Closed

2 tasks

simonjayhawkins added Regression Functionality that used to work in a prior pandas version and removed Needs Discussion Requires discussion from core team before further action labels Jul 31, 2020

simonjayhawkins added this to the 1.1.1 milestone Jul 31, 2020

This was referenced Aug 2, 2020

BUG: Pandas 1.0.5 → 1.1.0 behavior change on DataFrame.apply() where fn returns np.ndarray #35517

Closed

BUG: #35526

Closed

jbrockmendel mentioned this issue Aug 8, 2020

BUG: DataFrame.apply with func altering row in-place #35633

Merged

4 tasks

simonjayhawkins mentioned this issue Aug 10, 2020

BUG: Pandas 1.1.0 apply function with axis=1 seems to mishandle the rows #35634

Closed

jreback closed this as completed in #35633 Aug 11, 2020

rhshadrach mentioned this issue Jan 16, 2021

BUG: .apply with collections #39166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: dataframe.apply() loops on first row when applied method attempts to modify the row #35462

BUG: dataframe.apply() loops on first row when applied method attempts to modify the row #35462

nicolasrozain commented Jul 29, 2020

simonjayhawkins commented Jul 30, 2020

simonjayhawkins commented Jul 31, 2020

simonjayhawkins commented Jul 31, 2020

GusBite commented Aug 3, 2020

seal9 commented Aug 8, 2020

BUG: dataframe.apply() loops on first row when applied method attempts to modify the row #35462

BUG: dataframe.apply() loops on first row when applied method attempts to modify the row #35462

Comments

nicolasrozain commented Jul 29, 2020

simonjayhawkins commented Jul 30, 2020

simonjayhawkins commented Jul 31, 2020

simonjayhawkins commented Jul 31, 2020

GusBite commented Aug 3, 2020

seal9 commented Aug 8, 2020