BUG: `read_stata` always uses 'utf8' #21244

adrian-castravete · 2018-05-29T11:39:28Z

Code Sample, a copy-pastable example if possible

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
    pass # do something with chunk (never reached)

This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte.
OK. So the file isn't a utf8 one. Even though the StataReader doesn't specify any Unicode support; I then try and open it with a latin-1 encoding:

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
    pass # do something with chunk (never reached)

This raises the same exception at exactly the same place (still utf-8).

Problem description

This is a problem because it appears that read_stata doesn't honour the encoding argument.
I think this line introduced a bug. The StataReader doesn't manage any other type of data than ascii or latin-1.

Changing the line 1338 of the pandas.io.stata module:

        return s.decode('utf-8')

to:

        return s.decode('latin-1')

Seemed to make everything work and I could read the data from the given file.
Even better, changing it to the following:

        return s.decode(self._encoding or self._default_encoding)

also seems to have made it work.

I believe though, that if you want to make this work with Unicode too you'd have to add the following encodings to VALID_ENCODINGS: utf-8, utf8, iso10646.

Expected Output

The file should be correctly read and parsed

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: ro_RO.UTF-8 LANG: ro_RO.UTF-8 LOCALE: None.None

pandas: 0.24.0.dev0+41.gb2eec25
pytest: 3.2.3
pip: 9.0.3
setuptools: 36.6.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 3.8.0
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

orthaeus · 2019-02-19T23:41:51Z

Code Sample, a copy-pastable example if possible
import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
    pass # do something with chunk (never reached)
This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte.
OK. So the file isn't a utf8 one. Even though the StataReader doesn't specify any Unicode support; I then try and open it with a latin-1 encoding:
import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
    pass # do something with chunk (never reached)
This raises the same exception at exactly the same place (still utf-8).

Problem description

This is a problem because it appears that read_stata doesn't honour the encoding argument.
I think this line introduced a bug. The StataReader doesn't manage any other type of data than ascii or latin-1.

Changing the line 1338 of the pandas.io.stata module:
        return s.decode('utf-8')
to:
        return s.decode('latin-1')
Seemed to make everything work and I could read the data from the given file.
Even better, changing it to the following:
        return s.decode(self._encoding or self._default_encoding)
also seems to have made it work.

I believe though, that if you want to make this work with Unicode too you'd have to add the following encodings to VALID_ENCODINGS: utf-8, utf8, iso10646.

Expected Output

The file should be correctly read and parsed

Output of pd.show_versions()

Perfectly solved the problem I was having, thank you.

hudcap · 2019-02-24T00:59:15Z

I am still having issues with this. I'm using a 118 Stata file, and I'm getting the same UnicodeDecodeError.
When I edit the stata.py file to use latin-1 as per @adrian-castravete, everything works.

naranjja · 2019-03-01T03:21:20Z

Having the same issue just today. Changing line 1339 from site-packages/pandas/io/stata.py fixed it:

def _null_terminate(self, s):
    # have bytes not strings, so must decode
    s = s.partition(b"\0")[0]
    return s.decode('latin-1')  # instead of s.decode(self._encoding)

hudcap · 2019-04-02T02:49:41Z

Can this bug please be reopened?

jreback · 2019-04-02T04:30:05Z

if u have a self contained example reproducing with master pls open a new issue

harmbuisman · 2019-04-05T14:17:13Z

Having the same issue just today. Changing line 1339 from site-packages/pandas/io/stata.py fixed it:
def _null_terminate(self, s):
    # have bytes not strings, so must decode
    s = s.partition(b"\0")[0]
    return s.decode('latin-1')  # instead of s.decode(self._encoding)

Thanks, this fixed my issue. Not sure why this issue is closed while the problem is still around, even though the issue doesn't contain the dataset to reproduce this. Problem description seems quite clear to me.

jorisvandenbossche · 2019-04-05T14:47:21Z

To all: this has been fixed by @bashtage with a fallback + warning in #25967

bashtage · 2019-04-05T15:09:05Z

@harmbuisman It was hard to produce a dataset that has this characteristic since it can only be produced due to a bug in Stata. Stata incorrectly writes latin-1 encoded 117 format files with latin-1 encoding when saving as 118. This doesn't happen if a new file is created and then saved to 118 format.

bashtage · 2019-04-05T15:10:49Z

If this bug can be reproduced using master, please make sure you share a datafile (it could be a small extraction from a larger file, as long as the small extraction reproduced the issue), so that the structure of the file can be inspected.

Larz60p · 2019-05-19T08:59:36Z

The file 196slers1967to2016_20180908.dta has this problem.
It can be downloaded here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3WZFK9 Click download, and select original file format for 196slers1967to2016_20180908.tab (1st file)

bashtage · 2019-05-19T10:41:30Z

@Larz60p You should probably let harvard know that their platform is not providing files that confirm to the Stata dta file format spec.

leolovethewayyoulie · 2020-03-11T10:41:57Z

Having the same issue just today. Changing line 1339 from site-packages/pandas/io/stata.py fixed it:
def _null_terminate(self, s):
    # have bytes not strings, so must decode
    s = s.partition(b"\0")[0]
    return s.decode('latin-1')  # instead of s.decode(self._encoding)

Hi I am having the same issue. When I exported stata file to csv file and added pd.read_csv("file csv", encoding = "latin-1"), it worked. But when I added that to pd.read_stata("file dta" , encoding = "latin-1), it happened "Futurewarning encoding is..."). Even when I tried your ways, it's still the same, nothing changed (even the _null_terminate....)
Can you have any suggestion for me? Thank you!

bashtage · 2020-03-11T10:45:15Z

What version is the DTA file you are creating?

leolovethewayyoulie · 2020-03-11T21:00:40Z

What version is the DTA file you are creating?

stata 16
I read this version to find out that its' encode is "ISO-8859-1"
I have already exported the dta to csv, and using encode worked.
But the problem with encoding in read_stata is "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: the 'encoding' keyword is deprecated and will be removed in a future version. Please take steps to stop the use of 'encoding'
"""Entry point for launching an IPython kernel."
:(

bashtage · 2020-03-11T21:44:26Z

Can you share the dta file so I can take a look?

…

On Wed, Mar 11, 2020, 21:00 leolovethewayyoulie ***@***.***> wrote: What version is the DTA file you are creating? stata 16 I read this version to find out that its' encode is "ISO-8859-1" I have already exported the dta to csv, and using encode worked. But the problem with encoding in read_stata is "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: the 'encoding' keyword is deprecated and will be removed in a future version. Please take steps to stop the use of 'encoding' """Entry point for launching an IPython kernel." :( — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21244 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKTSRJPCE4YYO34FKERUODRG73YRANCNFSM4FCFJ5FQ> .

bashtage · 2020-03-11T22:03:20Z

FWIW "ISO-8859-1" is latin-1.

leolovethewayyoulie · 2020-03-11T22:04:44Z

Sure, but since it is really heavy, I might send it through email, can I have your email, I will send with my csv as well.
Thank you so much

leolovethewayyoulie · 2020-03-11T22:09:57Z

FWIW "ISO-8859-1" is latin-1.

Yeap, so what I'm trying to say is the dta file is encoded "latin-1" since the exported-csv file from this dta file can be read with encoded "ISO-8859-1". In another word, here is my situation:

a = pd.read_stata("E:\file.dta", encoding = "ISO-8859-1") --> Dont work, result:"C:\Users\USER\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: the 'encoding' keyword is deprecated and will be removed in a future version. Please take steps to stop the use of 'encoding'
"""Entry point for launching an IPython kernel."
b = pd.read_csv("E:\file(exported from file dta).csv", encoding="ISO-8859-1") worked

bashtage · 2020-03-11T22:18:35Z

You could share with dropbox or google drive as well to kevin.k.sheppard@gmail.com

leolovethewayyoulie · 2020-03-11T22:33:54Z

You could share with dropbox or google drive as well to kevin.k.sheppard@gmail.com

I have sent you my data through google drive
Thank you so much for your help!

bashtage · 2020-03-11T22:59:26Z

AFAICT pandas reads the file correctly. You get a warning that the file does not have the correct format. This warning is correct since this is a stata DTA 118 file which must b utf-8 encoded per Stata's dta documentation. However, it is latin-1 encoded. This happens when an older dta file is loaded into Stata and then saved in 118 format. If you think this should be fixed, you should contact Stata since this is their bug.

bashtage · 2020-03-11T23:00:19Z

Works in pandas 1.0.1.

leolovethewayyoulie · 2020-03-11T23:09:35Z

Works in pandas 1.0.1.

Okie, I'll install pandas 1.0.1 to try
In the meantime, can you give me your command?
Thank you so much

bashtage · 2020-03-11T23:15:28Z

import pandas as pd
pd.read_stata("data.dta")

leolovethewayyoulie · 2020-03-11T23:17:18Z

import pandas as pd
pd.read_stata("data.dta")

Haha, thank you so much dude,
since I install the newest version, it worked although it still has the warning but I guess it's alright @@
Thank you so much ❤️❤️❤️❤️❤️

adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 29, 2018

BUG: Fix handling of encoding for the StataReader pandas-dev#21244

a1e2975

adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 29, 2018

BUG: Fix handling of encoding for the StataReader pandas-dev#21244

a65b8ef

adrian-castravete mentioned this issue May 29, 2018

BUG: Fix handling of encoding for the StataReader #21244 #21246

Closed

4 tasks

adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 29, 2018

BUG: Fix handling of encoding for the StataReader pandas-dev#21244

7c022f9

toobaz added the IO Stata read_stata, to_stata label May 29, 2018

adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018

BUG: Fix handling of encoding for the StataReader pandas-dev#21244

a1efb5d

adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018

BUG: Fix handling of encoding for the StataReader pandas-dev#21244

adb0918

adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018

BUG: Fix handling of encoding for the StataReader pandas-dev#21244

b291a30

adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018

BUG: Fix handling of encoding for the StataReader pandas-dev#21244

2968c59

adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018

BUG: Fix handling of encoding for the StataReader pandas-dev#21244

cacb391

jreback added this to the 0.24.0 milestone May 30, 2018

jreback added the Unicode Unicode strings label May 30, 2018

adrian-castravete pushed a commit to adrian-castravete/pandas that referenced this issue May 30, 2018

BUG: Fix handling of encoding for the StataReader pandas-dev#21244

57c24f8

bashtage mentioned this issue May 31, 2018

BUG: Fix encoding for Stata format 118 files #21279

Merged

4 tasks

jreback modified the milestones: 0.24.0, 0.23.1 Jun 5, 2018

jorisvandenbossche closed this as completed in #21279 Jun 6, 2018

bashtage mentioned this issue Jun 9, 2018

MAINT: Deprecate encoding from stata reader/writer #21400

Merged

4 tasks

hudcap mentioned this issue Apr 2, 2019

UnicodeDecodeError for Stata file #25960

Closed

hansendx mentioned this issue Sep 24, 2019

German umlauts in labels are decoded incorrectly ddionrails/collect_stata#48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `read_stata` always uses 'utf8' #21244

BUG: `read_stata` always uses 'utf8' #21244

adrian-castravete commented May 29, 2018 •

edited

Loading

orthaeus commented Feb 19, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of `pd.show_versions()`

hudcap commented Feb 24, 2019

naranjja commented Mar 1, 2019

hudcap commented Apr 2, 2019

jreback commented Apr 2, 2019

harmbuisman commented Apr 5, 2019

jorisvandenbossche commented Apr 5, 2019

bashtage commented Apr 5, 2019 •

edited

Loading

bashtage commented Apr 5, 2019

Larz60p commented May 19, 2019

bashtage commented May 19, 2019

leolovethewayyoulie commented Mar 11, 2020

bashtage commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020

bashtage commented Mar 11, 2020 via email

bashtage commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020 •

edited

Loading

bashtage commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020 •

edited

Loading

bashtage commented Mar 11, 2020

bashtage commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020

bashtage commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020

BUG: read_stata always uses 'utf8' #21244

BUG: read_stata always uses 'utf8' #21244

Comments

adrian-castravete commented May 29, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

orthaeus commented Feb 19, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

hudcap commented Feb 24, 2019

naranjja commented Mar 1, 2019

hudcap commented Apr 2, 2019

jreback commented Apr 2, 2019

harmbuisman commented Apr 5, 2019

jorisvandenbossche commented Apr 5, 2019

bashtage commented Apr 5, 2019 • edited Loading

bashtage commented Apr 5, 2019

Larz60p commented May 19, 2019

bashtage commented May 19, 2019

leolovethewayyoulie commented Mar 11, 2020

bashtage commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020

bashtage commented Mar 11, 2020 via email

bashtage commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020 • edited Loading

bashtage commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020 • edited Loading

bashtage commented Mar 11, 2020

bashtage commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020

bashtage commented Mar 11, 2020

leolovethewayyoulie commented Mar 11, 2020

BUG: `read_stata` always uses 'utf8' #21244

BUG: `read_stata` always uses 'utf8' #21244

adrian-castravete commented May 29, 2018 •

edited

Loading

Output of `pd.show_versions()`

Output of `pd.show_versions()`

bashtage commented Apr 5, 2019 •

edited

Loading

leolovethewayyoulie commented Mar 11, 2020 •

edited

Loading

leolovethewayyoulie commented Mar 11, 2020 •

edited

Loading