BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows #37094

mgeplf · 2020-10-13T07:22:20Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

ROWS = 1000001  #  <--------- with 1000000, it works

with open('out.dat', 'w') as fd:
    for i in range(ROWS):
        fd.write('%d\n' % i)

df = pd.read_csv('out.dat', names=['a'], dtype={'a': np.float64}, index_col=['a'])

Problem description

When ROWS = 1000001, I get the following traceback:

Traceback (most recent call last):
  File "try.py", line 10, in <module>
    df = pd.read_csv('out.dat', names=['a'], dtype={'a': np.float64}, index_col=['a'])
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 686, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 458, in _read
    data = parser.read(nrows)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1196, in read
    ret = self._engine.read(nrows)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 2231, in read
    index, names = self._make_index(data, alldata, names)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1677, in _make_index
    index = self._agg_index(index)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1770, in _agg_index
    arr, _ = self._infer_types(arr, col_na_values | col_na_fvalues)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1871, in _infer_types
    mask = algorithms.isin(values, list(na_values))
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/core/algorithms.py", line 443, in isin
    if np.isnan(values).any():
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Expected Output

With pandas 1.1.2, or ROWS = 1000000, it works fine.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : db08276 python : 3.6.3.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-957.38.3.el7.x86_64 Version : #1 SMP Mon Nov 11 12:01:33 EST 2019 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

asishm · 2020-10-13T08:56:23Z

Possibly caused by #36266 (having some trouble running bisects on my end)

problem arises because the default values (na_values) for read_csv is array(['', 'NULL', '#N/A', 'N/A', '1.#QNAN', 'nan', '#NA', '-1.#QNAN', '<NA>', '1.#IND', 'n/a', '-nan', '-1.#IND', '#N/A N/A', 'null', '-NaN', 'NaN', 'NA'], dtype=object)

mgeplf · 2020-10-13T09:32:25Z

I agree, the code here:
https://github.com/pandas-dev/pandas/pull/36266/files#diff-c8f3ad29eaf121537b999e88e9117f3e3702d0b818a67516da25093fe2890ce8R442

Is suspicious.

jorisvandenbossche · 2020-10-13T14:43:57Z

Confirmed this is a regression compared to 1.0.x. Thanks for the report!

mujina93 · 2020-10-14T12:28:28Z

I can confirm I have the same problem, which arises as soon as I pass the threshold of 1M rows.

I only need to specify index_col to get the bug, though. Specifying dtypes is not needed.

Pandas 1.1.3

And as a temporary workaround, I am reading without index_col, and then setting my index. E.g.

df = pd.read_csv(filepath, nrows=1000001)
df.set_index(0) # for example if I wanted the first column

MahsaSeifikar · 2020-10-17T13:44:55Z

I have the same problem. any solution?

alor · 2020-10-17T14:39:06Z

I have the same problem. any solution?

my solution was to downgrade to 1.1.2 and it works.

simonjayhawkins · 2020-10-29T20:04:13Z

more minimal example not involving read_csv

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'1.1.3'
>>> ser = pd.Series([1, 2, np.nan] * 1_000_000)
>>> ser.isin({"foo", "bar"})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\anaconda3\envs\pandas-1.1.3\lib\site-packages\pandas\core
\series.py", line 4685, in isin
    result = algorithms.isin(self, values)
  File "C:\Users\simon\anaconda3\envs\pandas-1.1.3\lib\site-packages\pandas\core
\algorithms.py", line 443, in isin
    if np.isnan(values).any():
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could
 not be safely coerced to any supported types according to the casting rule ''sa
fe''
>>>

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'1.0.5'
>>> ser = pd.Series([1, 2, np.nan] * 1_000_000)
>>> ser.isin({"foo", "bar"})
C:\Users\simon\anaconda3\envs\pandas-1.0.5\lib\site-packages\numpy\lib\arrayseto
ps.py:580: FutureWarning: elementwise comparison failed; returning scalar instea
d, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
0          False
1          False
2          False
3          False
4          False
           ...
2999995    False
2999996    False
2999997    False
2999998    False
2999999    False
Length: 3000000, dtype: bool
>>>
>>>

mgeplf added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 13, 2020

jorisvandenbossche added IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 13, 2020

jorisvandenbossche added this to the 1.1.4 milestone Oct 13, 2020

MrOlm mentioned this issue Oct 21, 2020

TypeError: ufunc 'isnan' MrOlm/inStrain#26

Closed

simonjayhawkins mentioned this issue Oct 29, 2020

RLS: 1.1.4 #37397

Closed

simonjayhawkins added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed IO CSV read_csv, to_csv labels Oct 29, 2020

jorisvandenbossche mentioned this issue Oct 29, 2020

REGR: fix isin for large series with nan and mixed object dtype (causing regression in read_csv) #37499

Merged

brent-stone mentioned this issue Oct 29, 2020

TypeError: ufunc 'isnan' not supported for the input types... brent-stone/CAN_Reverse_Engineering#9

Open

simonjayhawkins closed this as completed in #37499 Oct 30, 2020

smguo mentioned this issue Apr 28, 2022

Master tests mehta-lab/microDL#149

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows #37094

BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows #37094

mgeplf commented Oct 13, 2020

asishm commented Oct 13, 2020

mgeplf commented Oct 13, 2020

jorisvandenbossche commented Oct 13, 2020 •

edited

Loading

mujina93 commented Oct 14, 2020 •

edited

Loading

MahsaSeifikar commented Oct 17, 2020

alor commented Oct 17, 2020

simonjayhawkins commented Oct 29, 2020

BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows #37094

BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows #37094

Comments

mgeplf commented Oct 13, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

asishm commented Oct 13, 2020

mgeplf commented Oct 13, 2020

jorisvandenbossche commented Oct 13, 2020 • edited Loading

mujina93 commented Oct 14, 2020 • edited Loading

MahsaSeifikar commented Oct 17, 2020

alor commented Oct 17, 2020

simonjayhawkins commented Oct 29, 2020

Output of `pd.show_versions()`

jorisvandenbossche commented Oct 13, 2020 •

edited

Loading

mujina93 commented Oct 14, 2020 •

edited

Loading