Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.apply(raw=True) applies function twice on first row/column #34506

Closed
3 tasks done
alonme opened this issue May 31, 2020 · 9 comments
Closed
3 tasks done
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@alonme
Copy link
Contributor

alonme commented May 31, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

In [1]: import pandas as pd
   ...:
   ...: df = pd.DataFrame({'a': [1,2,3]})
   ...:
   ...: def mul2(x):
   ...:     print('Hello')
   ...:     return x*2
   ...:
   ...: print('Apply:')
   ...: df.apply(mul2)
   ...:
   ...: print('\nApply Raw:')
   ...: df.apply(mul2, raw=True)

Apply:
Hello

Apply Raw:
Hello
Hello

Problem description

Because of an implementation detail, the applied function is applied twice on the first row/column,
This results in side effects happening twice.

Other than the apply_raw code path, this issues should be fixed by #34183.

Expected Output

Apply:
Hello

Apply Raw:
Hello

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 07edc87
python : 3.7.5.final.0
python-bits : 64
OS : Darwin
OS-release : 19.5.0
Version : Darwin Kernel Version 19.5.0: Thu Apr 30 18:25:59 PDT 2020; root:xnu-6153.121.1~7/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0.dev0+1706.g07edc871f.dirty
numpy : 1.18.3
pytz : 2020.1
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : 0.29.17
pytest : 5.4.1
hypothesis : 5.11.0
sphinx : 3.0.3
blosc : 1.9.1
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.14.0
pandas_datareader: None
bs4 : 4.9.0
bottleneck : 1.3.2
fastparquet : 0.3.3
gcsfs : None
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.16
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.49.0

@alonme alonme added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 31, 2020
@arw2019
Copy link
Member

arw2019 commented May 31, 2020

The reason why you see the side effect twice is because in pandas/core/apply.py's apply_raw function

def apply_raw(self):

mul2 gets called twice, first inside the try statement with pandas._libs's compute_reduction, and then after an error is thrown, in the except block.

I haven't understood the error in compute_reduction here entirely. I tracked it down as far as this function call:

reducer = Reducer(arr, f, axis=axis, dummy=dummy, labels=labels)
However, my impression from the comments apply_raw is that it's thrown intentionally, and apply_raw is working as expected.

I'm happy to work on a pull request/documentation update if we think there are changes or improvements to be made!

@alonme
Copy link
Contributor Author

alonme commented Jun 1, 2020

Hey @arw2019, I am familiar,
I solved this issue except for raw=True in #34183
Hope it will be merged soon, and then not much would be left to be done in apply_raw

@alonme
Copy link
Contributor Author

alonme commented Jun 3, 2020

@arw2019 My PR was merged #34183
You can look at how i solved the non-raw path, raw can work in a very similar fashion.

@arw2019
Copy link
Member

arw2019 commented Jun 6, 2020

@alonme Cool! I'll work on a pull request and loop you in!

@alonme
Copy link
Contributor Author

alonme commented Jul 12, 2020

Per #34913, and comments on that issue, i think we can close this issue.

@jbrockmendel jbrockmendel added Apply Apply, Aggregate, Transform, Map and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020
@simonjayhawkins
Copy link
Member

on master

>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>>
>>>
>>> def mul2(x):
...     print("Hello")
...     return x * 2
...
>>>
>>> print("Apply:")
Apply:
>>> df.apply(mul2)
Hello
   a
0  2
1  4
2  6
>>>
>>> print("\nApply Raw:")

Apply Raw:
>>> df.apply(mul2, raw=True)
Hello
   a
0  2
1  4
2  6
>>> pd.__version__
'1.2.0.dev0+314.g44e933a765'
>>>

@simonjayhawkins simonjayhawkins added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Apply Apply, Aggregate, Transform, Map Bug labels Sep 9, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Sep 9, 2020
@alonme
Copy link
Contributor Author

alonme commented Sep 9, 2020

Hey @simonjayhawkins,

I already added a test for this issue with an XFAIL, it seems that the XFAIL was removed when this was fixed.

7d0ee96

Do you think more tests are needed?

@simonjayhawkins
Copy link
Member

There seemed to be reluctance to close this issue in #34913 without a test. #34913 (comment) @jbrockmendel ?

@jbrockmendel
Copy link
Member

I'm fine with closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

No branches or pull requests

4 participants