Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Wrong expression np.multiply to validate overflow exception in pandas.core.reshape.reshape.py#L118 #33694

Closed
3 tasks done
chjinche opened this issue Apr 21, 2020 · 3 comments
Labels
Bug Duplicate Report Duplicate issue or pull request

Comments

@chjinche
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# bug code block: pandas.core.reshape.reshape.py#L118
        # GH20601: This forces an overflow if the number of cells is too high.
        num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)

        if num_rows > 0 and num_columns > 0 and num_cells <= 0:
            raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")

# Repro codes. `np.multiply` result cannot validate overflow exceptions. For example,
>>> num_rows = 66000
>>> num_columns = 66000
>>> num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)
>>> num_cells
61032704
>>> num_rows > 0 and num_columns > 0 and num_cells <= 0
False
# 66000*66000 is overflow for np.int32
>>> np.int32(66000*66000) 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

Problem description

np.multiply result cannot validate overflow exceptions. If overflow, x will be x % 2**32, which can be negative, positive, zero, depending on its first bit.
See above example codes, where 66000*66000 is overflow for np.int32 but its multiply res is positive.
[this should explain why the current behaviour is a problem and why the expected output is a better solution]

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here leaving a blank line after the details tag]

@chjinche chjinche added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 21, 2020
@jreback
Copy link
Contributor

jreback commented Apr 21, 2020

pls show actual pandas code that an issue

@chjinche
Copy link
Author

@jreback you could refer to this code example,

>>> df = pd.DataFrame({'f0': [str(i) for i in range(size)], 'f1': [str(i) for i in range(size, 2*size)]})
>>> pd.crosstab(df['f0'], df['f1'])

If size=56333, "Unstacked DataFrame is too big, " "causing int32 overflow" will be thrown, However, if size increases to like 66000, code example passed, which is really unexpected.

>>> np.multiply(56333, 56333, dtype=np.int32)
-1121560407
>>> np.multiply(66000, 66000, dtype=np.int32)
61032704

@jreback
Copy link
Contributor

jreback commented Apr 21, 2020

duplicate of #26314

@jreback jreback added Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 21, 2020
@jreback jreback added this to the No action milestone Apr 21, 2020
@jreback jreback closed this as completed Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants