IO Conflict Between index_col and dtype #25067

mbkupfer · 2019-02-01T00:40:27Z

Code Sample, a copy-pastable example if possible

I'm reading in an excel file using the code below. Assume that column 0 is my zip codes and column 2 is my adjustment factor that I'm mapping zips to.

# Your code here
 zip_adjustments_factors = pd.read_excel(fp, sheet_name='zip_mapping', 
    dtype={'zip': object}, index_col=0, usecols=[0,2])

In [1]: zip_adjustments_factors.index.dtype
Out[1]: dtype('int64')

Problem description

In my situation, I need zipcodes to be dtype of object since zipcodes can start with 0's, but the converter under-the-hood is converting these to int's and dropping the 0's.

This requires an extra step as a workaround:

zip_adjustments_factors = pd.read_excel(fp, sheet_name='zip_mapping', 
    dtype={'zip': object}, usecols=[0,2])
zip_adjustments_factors.set_index('zip')

In [1]: zip_adjustments_factors.set_index('zip').index.dtype
Out[1]: dtype('O')

But couldn't this be done within pd.read_excel by checking whether the index column parameter has a specified dtype parameter?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.7.1
pip: 18.1
setuptools: 40.5.0
Cython: None
numpy: 1.15.2
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.1.2
lxml: 4.1.1
bs4: 4.6.1
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-02-03T01:16:59Z

Can you provide a sample excel file for this?

mbkupfer · 2019-02-03T04:54:17Z

Of course. You can recreate with this:

zipcode-example.xlsx

WillAyd · 2019-04-24T16:32:15Z

The problem here is really the combination of dtype and index_col. As a workaround if you remove the latter can get the result you want:

df = pd.read_excel(fp, sheet_name='zip_mapping', dtype={'zip': object})
df = df.set_index('area_name')

Note that this seems to be a general IO issue not just excel:

pd.read_csv(io.StringIO("a,b\n01,02"), dtype=object, index_col=0)
    b
a
1  02

Investigation and PRs would certainly be welcome

WillAyd · 2019-04-24T16:34:27Z

Closing as a duplicate of #9435

WillAyd added IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue labels Feb 3, 2019

WillAyd removed the Needs Info Clarification about behavior needed to assess issue label Apr 24, 2019

WillAyd added IO Data IO issues that don't fit into a more specific label and removed IO Excel read_excel, to_excel labels Apr 24, 2019

WillAyd changed the title ~~pd.read_excel's index_col ignores dtypes~~ IO Conflict Between index_col and dtype Apr 24, 2019

WillAyd closed this as completed Apr 24, 2019

WillAyd added the Duplicate Report Duplicate issue or pull request label Apr 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO Conflict Between index_col and dtype #25067

IO Conflict Between index_col and dtype #25067

mbkupfer commented Feb 1, 2019 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented Feb 3, 2019

mbkupfer commented Feb 3, 2019

WillAyd commented Apr 24, 2019

WillAyd commented Apr 24, 2019

IO Conflict Between index_col and dtype #25067

IO Conflict Between index_col and dtype #25067

Comments

mbkupfer commented Feb 1, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Feb 3, 2019

mbkupfer commented Feb 3, 2019

WillAyd commented Apr 24, 2019

WillAyd commented Apr 24, 2019

mbkupfer commented Feb 1, 2019 •

edited

Loading

Output of `pd.show_versions()`