Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows #37094

Closed
2 of 3 tasks
mgeplf opened this issue Oct 13, 2020 · 7 comments · Fixed by #37499
Closed
2 of 3 tasks
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@mgeplf
Copy link

mgeplf commented Oct 13, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

ROWS = 1000001  #  <--------- with 1000000, it works

with open('out.dat', 'w') as fd:
    for i in range(ROWS):
        fd.write('%d\n' % i)

df = pd.read_csv('out.dat', names=['a'], dtype={'a': np.float64}, index_col=['a'])

Problem description

When ROWS = 1000001, I get the following traceback:

Traceback (most recent call last):
  File "try.py", line 10, in <module>
    df = pd.read_csv('out.dat', names=['a'], dtype={'a': np.float64}, index_col=['a'])
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 686, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 458, in _read
    data = parser.read(nrows)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1196, in read
    ret = self._engine.read(nrows)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 2231, in read
    index, names = self._make_index(data, alldata, names)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1677, in _make_index
    index = self._agg_index(index)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1770, in _agg_index
    arr, _ = self._infer_types(arr, col_na_values | col_na_fvalues)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1871, in _infer_types
    mask = algorithms.isin(values, list(na_values))
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/core/algorithms.py", line 443, in isin
    if np.isnan(values).any():
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Expected Output

With pandas 1.1.2, or ROWS = 1000000, it works fine.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : db08276 python : 3.6.3.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-957.38.3.el7.x86_64 Version : #1 SMP Mon Nov 11 12:01:33 EST 2019 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@mgeplf mgeplf added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 13, 2020
@asishm
Copy link
Contributor

asishm commented Oct 13, 2020

Possibly caused by #36266 (having some trouble running bisects on my end)

problem arises because the default values (na_values) for read_csv is array(['', 'NULL', '#N/A', 'N/A', '1.#QNAN', 'nan', '#NA', '-1.#QNAN', '<NA>', '1.#IND', 'n/a', '-nan', '-1.#IND', '#N/A N/A', 'null', '-NaN', 'NaN', 'NA'], dtype=object)

@mgeplf
Copy link
Author

mgeplf commented Oct 13, 2020

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 13, 2020

Confirmed this is a regression compared to 1.0.x. Thanks for the report!

@jorisvandenbossche jorisvandenbossche added IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 13, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1.4 milestone Oct 13, 2020
@mujina93
Copy link

mujina93 commented Oct 14, 2020

I can confirm I have the same problem, which arises as soon as I pass the threshold of 1M rows.

I only need to specify index_col to get the bug, though. Specifying dtypes is not needed.

Pandas 1.1.3


And as a temporary workaround, I am reading without index_col, and then setting my index. E.g.

df = pd.read_csv(filepath, nrows=1000001)
df.set_index(0) # for example if I wanted the first column

@MahsaSeifikar
Copy link

I have the same problem. any solution?

@alor
Copy link

alor commented Oct 17, 2020

I have the same problem. any solution?

my solution was to downgrade to 1.1.2 and it works.

@simonjayhawkins
Copy link
Member

more minimal example not involving read_csv

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'1.1.3'
>>> ser = pd.Series([1, 2, np.nan] * 1_000_000)
>>> ser.isin({"foo", "bar"})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\anaconda3\envs\pandas-1.1.3\lib\site-packages\pandas\core
\series.py", line 4685, in isin
    result = algorithms.isin(self, values)
  File "C:\Users\simon\anaconda3\envs\pandas-1.1.3\lib\site-packages\pandas\core
\algorithms.py", line 443, in isin
    if np.isnan(values).any():
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could
 not be safely coerced to any supported types according to the casting rule ''sa
fe''
>>>
>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'1.0.5'
>>> ser = pd.Series([1, 2, np.nan] * 1_000_000)
>>> ser.isin({"foo", "bar"})
C:\Users\simon\anaconda3\envs\pandas-1.0.5\lib\site-packages\numpy\lib\arrayseto
ps.py:580: FutureWarning: elementwise comparison failed; returning scalar instea
d, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
0          False
1          False
2          False
3          False
4          False
           ...
2999995    False
2999996    False
2999997    False
2999998    False
2999999    False
Length: 3000000, dtype: bool
>>>
>>>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Regression Functionality that used to work in a prior pandas version
Projects
None yet
7 participants