Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO Conflict Between index_col and dtype #25067

Closed
mbkupfer opened this issue Feb 1, 2019 · 4 comments
Closed

IO Conflict Between index_col and dtype #25067

mbkupfer opened this issue Feb 1, 2019 · 4 comments
Labels
Duplicate Report Duplicate issue or pull request IO Data IO issues that don't fit into a more specific label

Comments

@mbkupfer
Copy link
Contributor

mbkupfer commented Feb 1, 2019

Code Sample, a copy-pastable example if possible

I'm reading in an excel file using the code below. Assume that column 0 is my zip codes and column 2 is my adjustment factor that I'm mapping zips to.

# Your code here
 zip_adjustments_factors = pd.read_excel(fp, sheet_name='zip_mapping', 
    dtype={'zip': object}, index_col=0, usecols=[0,2])

In [1]: zip_adjustments_factors.index.dtype
Out[1]: dtype('int64')

Problem description

In my situation, I need zipcodes to be dtype of object since zipcodes can start with 0's, but the converter under-the-hood is converting these to int's and dropping the 0's.

This requires an extra step as a workaround:

zip_adjustments_factors = pd.read_excel(fp, sheet_name='zip_mapping', 
    dtype={'zip': object}, usecols=[0,2])
zip_adjustments_factors.set_index('zip')

In [1]: zip_adjustments_factors.set_index('zip').index.dtype
Out[1]: dtype('O')

But couldn't this be done within pd.read_excel by checking whether the index column parameter has a specified dtype parameter?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.7.1
pip: 18.1
setuptools: 40.5.0
Cython: None
numpy: 1.15.2
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.1.2
lxml: 4.1.1
bs4: 4.6.1
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd WillAyd added IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue labels Feb 3, 2019
@WillAyd
Copy link
Member

WillAyd commented Feb 3, 2019

Can you provide a sample excel file for this?

@mbkupfer
Copy link
Contributor Author

mbkupfer commented Feb 3, 2019

Of course. You can recreate with this:

zipcode-example.xlsx

@WillAyd WillAyd removed the Needs Info Clarification about behavior needed to assess issue label Apr 24, 2019
@WillAyd
Copy link
Member

WillAyd commented Apr 24, 2019

The problem here is really the combination of dtype and index_col. As a workaround if you remove the latter can get the result you want:

df = pd.read_excel(fp, sheet_name='zip_mapping', dtype={'zip': object})
df = df.set_index('area_name')

Note that this seems to be a general IO issue not just excel:

pd.read_csv(io.StringIO("a,b\n01,02"), dtype=object, index_col=0)
    b
a
1  02

Investigation and PRs would certainly be welcome

@WillAyd WillAyd added IO Data IO issues that don't fit into a more specific label and removed IO Excel read_excel, to_excel labels Apr 24, 2019
@WillAyd WillAyd changed the title pd.read_excel's index_col ignores dtypes IO Conflict Between index_col and dtype Apr 24, 2019
@WillAyd
Copy link
Member

WillAyd commented Apr 24, 2019

Closing as a duplicate of #9435

@WillAyd WillAyd closed this as completed Apr 24, 2019
@WillAyd WillAyd added the Duplicate Report Duplicate issue or pull request label Apr 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

2 participants