Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv not applying dtype to index col when dtype is globally specified #45801

Open
2 of 3 tasks
gaow opened this issue Feb 3, 2022 · 3 comments
Open
2 of 3 tasks
Labels
Bug IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action

Comments

@gaow
Copy link

gaow commented Feb 3, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Not expected behavior: index are still int
import pandas as pd
from io import StringIO
data = "1,a\n2,b"
df = pd.read_csv(StringIO(data), index_col=0, dtype=str, header=None)
df.index

Issue Description

This is separated from #44632. More discussions on #9435. In version 1.4.0 it does fix the case when dtype is defined for a column specifically:

df = pd.read_csv(StringIO(data), index_col=0, dtype={0:str}, header=None)

works. But not when you specify dtype globally.

Expected Behavior

I expect by specifying index_col=0, dtype=str which requires all columns be str, then the first column is also a str and is used as index as well. With current version of pandas I am not sure what's the alternative implementation for my expected behavior.

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-39-generic
Version : #43-Ubuntu SMP Fri Jun 19 10:28:31 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.0
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.1.post20201107
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : 4.3.1
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fastparquet : None
fsspec : 0.8.5
gcsfs : None
matplotlib : 3.3.4
numba : None
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 0.17.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : 1.3.22
tables : None
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 2.0.1
xlwt : None
zstandard : 0.15.1

@gaow gaow added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 3, 2022
@phofl phofl added IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 4, 2022
@phofl
Copy link
Member

phofl commented Feb 4, 2022

As mentioned on the pr, my preference would be to keep as is. But open for discussion

@Marcos-C7
Copy link

I experienced this problem because I was thinking that dtype globally specified was global indeed. I agree with gaow that global should be global, unless the implementation brings more problems than solutions.

@DriesSchaumont
Copy link
Member

DriesSchaumont commented Mar 14, 2022

This issue also confused me today. The docs seem to say:
"Use str or object together with suitable na_values settings to preserve and not interpret dtype"

But using str or object, for the dtype parameter in read_csv causes pandas to still tries to infer a suitable dtype for the index. I am dealing with a fixed number of index columns and a variable number of data columns, in which case I cannot seem to stop pandas inferring any of the columns. If I specify only the index columns in a dict, the dtype of the data columns are inferred, and if I use str, the index columns dtypes are inferred. I tried to use defaultdict, but that does not seem to work either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants