Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_stata always uses 'utf8' #21244

Closed
adrian-castravete opened this issue May 29, 2018 · 25 comments · Fixed by #21279 or #21400
Closed

BUG: read_stata always uses 'utf8' #21244

adrian-castravete opened this issue May 29, 2018 · 25 comments · Fixed by #21279 or #21400
Labels
IO Stata read_stata, to_stata Unicode Unicode strings
Milestone

Comments

@adrian-castravete
Copy link

adrian-castravete commented May 29, 2018

Code Sample, a copy-pastable example if possible

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
    pass # do something with chunk (never reached)

This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte.
OK. So the file isn't a utf8 one. Even though the StataReader doesn't specify any Unicode support; I then try and open it with a latin-1 encoding:

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
    pass # do something with chunk (never reached)

This raises the same exception at exactly the same place (still utf-8).

Problem description

This is a problem because it appears that read_stata doesn't honour the encoding argument.
I think this line introduced a bug. The StataReader doesn't manage any other type of data than ascii or latin-1.

Changing the line 1338 of the pandas.io.stata module:

        return s.decode('utf-8')

to:

        return s.decode('latin-1')

Seemed to make everything work and I could read the data from the given file.
Even better, changing it to the following:

        return s.decode(self._encoding or self._default_encoding)

also seems to have made it work.

I believe though, that if you want to make this work with Unicode too you'd have to add the following encodings to VALID_ENCODINGS: utf-8, utf8, iso10646.

Expected Output

The file should be correctly read and parsed

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: ro_RO.UTF-8 LANG: ro_RO.UTF-8 LOCALE: None.None

pandas: 0.24.0.dev0+41.gb2eec25
pytest: 3.2.3
pip: 9.0.3
setuptools: 36.6.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 3.8.0
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 29, 2018
adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 29, 2018
adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 29, 2018
@toobaz toobaz added the IO Stata read_stata, to_stata label May 29, 2018
adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018
adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018
adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018
adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018
adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018
@jreback jreback added this to the 0.24.0 milestone May 30, 2018
@jreback jreback added the Unicode Unicode strings label May 30, 2018
adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018
@jreback jreback modified the milestones: 0.24.0, 0.23.1 Jun 5, 2018
@orthaeus
Copy link

Code Sample, a copy-pastable example if possible

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
    pass # do something with chunk (never reached)

This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte.
OK. So the file isn't a utf8 one. Even though the StataReader doesn't specify any Unicode support; I then try and open it with a latin-1 encoding:

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
    pass # do something with chunk (never reached)

This raises the same exception at exactly the same place (still utf-8).

Problem description

This is a problem because it appears that read_stata doesn't honour the encoding argument.
I think this line introduced a bug. The StataReader doesn't manage any other type of data than ascii or latin-1.

Changing the line 1338 of the pandas.io.stata module:

        return s.decode('utf-8')

to:

        return s.decode('latin-1')

Seemed to make everything work and I could read the data from the given file.
Even better, changing it to the following:

        return s.decode(self._encoding or self._default_encoding)

also seems to have made it work.

I believe though, that if you want to make this work with Unicode too you'd have to add the following encodings to VALID_ENCODINGS: utf-8, utf8, iso10646.

Expected Output

The file should be correctly read and parsed

Output of pd.show_versions()

Perfectly solved the problem I was having, thank you.

@hudcap
Copy link

hudcap commented Feb 24, 2019

I am still having issues with this. I'm using a 118 Stata file, and I'm getting the same UnicodeDecodeError.
When I edit the stata.py file to use latin-1 as per @adrian-castravete, everything works.

@naranjja
Copy link

naranjja commented Mar 1, 2019

Having the same issue just today. Changing line 1339 from site-packages/pandas/io/stata.py fixed it:

def _null_terminate(self, s):
    # have bytes not strings, so must decode
    s = s.partition(b"\0")[0]
    return s.decode('latin-1')  # instead of s.decode(self._encoding)

@hudcap
Copy link

hudcap commented Apr 2, 2019

Can this bug please be reopened?

@jreback
Copy link
Contributor

jreback commented Apr 2, 2019

if u have a self contained example reproducing with master pls open a new issue

@harmbuisman
Copy link

Having the same issue just today. Changing line 1339 from site-packages/pandas/io/stata.py fixed it:

def _null_terminate(self, s):
    # have bytes not strings, so must decode
    s = s.partition(b"\0")[0]
    return s.decode('latin-1')  # instead of s.decode(self._encoding)

Thanks, this fixed my issue. Not sure why this issue is closed while the problem is still around, even though the issue doesn't contain the dataset to reproduce this. Problem description seems quite clear to me.

@jorisvandenbossche
Copy link
Member

To all: this has been fixed by @bashtage with a fallback + warning in #25967

@bashtage
Copy link
Contributor

bashtage commented Apr 5, 2019

@harmbuisman It was hard to produce a dataset that has this characteristic since it can only be produced due to a bug in Stata. Stata incorrectly writes latin-1 encoded 117 format files with latin-1 encoding when saving as 118. This doesn't happen if a new file is created and then saved to 118 format.

@bashtage
Copy link
Contributor

bashtage commented Apr 5, 2019

If this bug can be reproduced using master, please make sure you share a datafile (it could be a small extraction from a larger file, as long as the small extraction reproduced the issue), so that the structure of the file can be inspected.

@Larz60p
Copy link

Larz60p commented May 19, 2019

The file 196slers1967to2016_20180908.dta has this problem.
It can be downloaded here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3WZFK9 Click download, and select original file format for 196slers1967to2016_20180908.tab (1st file)

@bashtage
Copy link
Contributor

@Larz60p You should probably let harvard know that their platform is not providing files that confirm to the Stata dta file format spec.

@leolovethewayyoulie
Copy link

Having the same issue just today. Changing line 1339 from site-packages/pandas/io/stata.py fixed it:

def _null_terminate(self, s):
    # have bytes not strings, so must decode
    s = s.partition(b"\0")[0]
    return s.decode('latin-1')  # instead of s.decode(self._encoding)

Hi I am having the same issue. When I exported stata file to csv file and added pd.read_csv("file csv", encoding = "latin-1"), it worked. But when I added that to pd.read_stata("file dta" , encoding = "latin-1), it happened "Futurewarning encoding is..."). Even when I tried your ways, it's still the same, nothing changed (even the _null_terminate....)
Can you have any suggestion for me? Thank you!

@bashtage
Copy link
Contributor

What version is the DTA file you are creating?

@leolovethewayyoulie
Copy link

What version is the DTA file you are creating?

stata 16
I read this version to find out that its' encode is "ISO-8859-1"
I have already exported the dta to csv, and using encode worked.
But the problem with encoding in read_stata is "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: the 'encoding' keyword is deprecated and will be removed in a future version. Please take steps to stop the use of 'encoding'
"""Entry point for launching an IPython kernel."
:(

@bashtage
Copy link
Contributor

bashtage commented Mar 11, 2020 via email

@bashtage
Copy link
Contributor

FWIW "ISO-8859-1" is latin-1.

@leolovethewayyoulie
Copy link

Sure, but since it is really heavy, I might send it through email, can I have your email, I will send with my csv as well.
Thank you so much

@leolovethewayyoulie
Copy link

leolovethewayyoulie commented Mar 11, 2020

FWIW "ISO-8859-1" is latin-1.

Yeap, so what I'm trying to say is the dta file is encoded "latin-1" since the exported-csv file from this dta file can be read with encoded "ISO-8859-1". In another word, here is my situation:

  • a = pd.read_stata("E:\file.dta", encoding = "ISO-8859-1") --> Dont work, result:"C:\Users\USER\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: the 'encoding' keyword is deprecated and will be removed in a future version. Please take steps to stop the use of 'encoding'
    """Entry point for launching an IPython kernel."
  • b = pd.read_csv("E:\file(exported from file dta).csv", encoding="ISO-8859-1") worked

@bashtage
Copy link
Contributor

You could share with dropbox or google drive as well to kevin.k.sheppard@gmail.com

@leolovethewayyoulie
Copy link

leolovethewayyoulie commented Mar 11, 2020

You could share with dropbox or google drive as well to kevin.k.sheppard@gmail.com

I have sent you my data through google drive
Thank you so much for your help!

@bashtage
Copy link
Contributor

AFAICT pandas reads the file correctly. You get a warning that the file does not have the correct format. This warning is correct since this is a stata DTA 118 file which must b utf-8 encoded per Stata's dta documentation. However, it is latin-1 encoded. This happens when an older dta file is loaded into Stata and then saved in 118 format. If you think this should be fixed, you should contact Stata since this is their bug.

@bashtage
Copy link
Contributor

Works in pandas 1.0.1.

@leolovethewayyoulie
Copy link

Works in pandas 1.0.1.

Okie, I'll install pandas 1.0.1 to try
In the meantime, can you give me your command?
Thank you so much

@bashtage
Copy link
Contributor

import pandas as pd
pd.read_stata("data.dta")

@leolovethewayyoulie
Copy link

import pandas as pd
pd.read_stata("data.dta")

Haha, thank you so much dude,
since I install the newest version, it worked although it still has the warning but I guess it's alright @@
Thank you so much ❤️❤️❤️❤️❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment