BUG: DataFrame.apply(raw=True) applies function twice on first row/column #34506

alonme · 2020-05-31T19:46:31Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

In [1]: import pandas as pd
   ...:
   ...: df = pd.DataFrame({'a': [1,2,3]})
   ...:
   ...: def mul2(x):
   ...:     print('Hello')
   ...:     return x*2
   ...:
   ...: print('Apply:')
   ...: df.apply(mul2)
   ...:
   ...: print('\nApply Raw:')
   ...: df.apply(mul2, raw=True)

Apply:
Hello

Apply Raw:
Hello
Hello

Problem description

Because of an implementation detail, the applied function is applied twice on the first row/column,
This results in side effects happening twice.

Other than the apply_raw code path, this issues should be fixed by #34183.

Expected Output

Apply:
Hello

Apply Raw:
Hello

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 07edc87
python : 3.7.5.final.0
python-bits : 64
OS : Darwin
OS-release : 19.5.0
Version : Darwin Kernel Version 19.5.0: Thu Apr 30 18:25:59 PDT 2020; root:xnu-6153.121.1~7/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0.dev0+1706.g07edc871f.dirty
numpy : 1.18.3
pytz : 2020.1
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : 0.29.17
pytest : 5.4.1
hypothesis : 5.11.0
sphinx : 3.0.3
blosc : 1.9.1
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.14.0
pandas_datareader: None
bs4 : 4.9.0
bottleneck : 1.3.2
fastparquet : 0.3.3
gcsfs : None
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.16
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.49.0

The text was updated successfully, but these errors were encountered:

arw2019 · 2020-05-31T21:45:32Z

The reason why you see the side effect twice is because in pandas/core/apply.py's apply_raw function

pandas/pandas/core/apply.py

Line 221 in 48c00ea

def apply_raw(self):

mul2 gets called twice, first inside the try statement with pandas._libs's compute_reduction, and then after an error is thrown, in the except block.

I haven't understood the error in compute_reduction here entirely. I tracked it down as far as this function call:

pandas/pandas/_libs/reduction.pyx

Line 629 in 48c00ea

reducer = Reducer(arr, f, axis=axis, dummy=dummy, labels=labels)

However, my impression from the comments apply_raw is that it's thrown intentionally, and apply_raw is working as expected.

I'm happy to work on a pull request/documentation update if we think there are changes or improvements to be made!

alonme · 2020-06-01T06:19:32Z

Hey @arw2019, I am familiar,
I solved this issue except for raw=True in #34183
Hope it will be merged soon, and then not much would be left to be done in apply_raw

alonme · 2020-06-03T05:04:56Z

@arw2019 My PR was merged #34183
You can look at how i solved the non-raw path, raw can work in a very similar fashion.

arw2019 · 2020-06-06T19:28:34Z

@alonme Cool! I'll work on a pull request and loop you in!

alonme · 2020-07-12T07:02:07Z

Per #34913, and comments on that issue, i think we can close this issue.

simonjayhawkins · 2020-09-09T20:10:19Z

on master

>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>>
>>>
>>> def mul2(x):
...     print("Hello")
...     return x * 2
...
>>>
>>> print("Apply:")
Apply:
>>> df.apply(mul2)
Hello
   a
0  2
1  4
2  6
>>>
>>> print("\nApply Raw:")

Apply Raw:
>>> df.apply(mul2, raw=True)
Hello
   a
0  2
1  4
2  6
>>> pd.__version__
'1.2.0.dev0+314.g44e933a765'
>>>

alonme · 2020-09-09T20:24:01Z

Hey @simonjayhawkins,

I already added a test for this issue with an XFAIL, it seems that the XFAIL was removed when this was fixed.

7d0ee96

Do you think more tests are needed?

simonjayhawkins · 2020-09-09T20:31:31Z

There seemed to be reluctance to close this issue in #34913 without a test. #34913 (comment) @jbrockmendel ?

jbrockmendel · 2020-09-11T19:45:16Z

I'm fine with closing this

alonme added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 31, 2020

jorisvandenbossche mentioned this issue Jun 22, 2020

REF: dont use compute_reduction #34913

Merged

jreback mentioned this issue Jun 23, 2020

PERF: avoid creating many Series in apply_standard #34909

Merged

5 tasks

jbrockmendel added Apply Apply, Aggregate, Transform, Map and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020

simonjayhawkins added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Apply Apply, Aggregate, Transform, Map Bug labels Sep 9, 2020

simonjayhawkins added this to the Contributions Welcome milestone Sep 9, 2020

simonjayhawkins closed this as completed Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.apply(raw=True) applies function twice on first row/column #34506

BUG: DataFrame.apply(raw=True) applies function twice on first row/column #34506

alonme commented May 31, 2020 •

edited

Loading

INSTALLED VERSIONS

arw2019 commented May 31, 2020

alonme commented Jun 1, 2020

alonme commented Jun 3, 2020

arw2019 commented Jun 6, 2020

alonme commented Jul 12, 2020

simonjayhawkins commented Sep 9, 2020

alonme commented Sep 9, 2020

simonjayhawkins commented Sep 9, 2020

jbrockmendel commented Sep 11, 2020

BUG: DataFrame.apply(raw=True) applies function twice on first row/column #34506

BUG: DataFrame.apply(raw=True) applies function twice on first row/column #34506

Comments

alonme commented May 31, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

arw2019 commented May 31, 2020

alonme commented Jun 1, 2020

alonme commented Jun 3, 2020

arw2019 commented Jun 6, 2020

alonme commented Jul 12, 2020

simonjayhawkins commented Sep 9, 2020

alonme commented Sep 9, 2020

simonjayhawkins commented Sep 9, 2020

jbrockmendel commented Sep 11, 2020

alonme commented May 31, 2020 •

edited

Loading

Output of `pd.show_versions()`