Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError with Latin-1 characters in Stata files #23736

Closed
yatharth opened this issue Nov 16, 2018 · 2 comments
Closed

UnicodeDecodeError with Latin-1 characters in Stata files #23736

yatharth opened this issue Nov 16, 2018 · 2 comments
Labels
Needs Info Clarification about behavior needed to assess issue

Comments

@yatharth
Copy link

Steps to reproduce

df = pd.read_stata('buggy_file.dta')

Expected behaviour

Pandas reads the stata file just fine.

Actual behaviour

Pandas raises an error to do with encoding, traceable back to this line:

Diagnosis

The error is caused by the “smart quote” character , which is encoded in Latin-1 in the Stata .dta file, but it considered an invalid byte sequence in Unicode.

The errors originates in the StataReader class in io/stata.py:

    def _decode(self, s):
        s = s.partition(b"\0")[0]
        return s.decode('utf-8')

Instead of 'utf-8', Pandas should use self._encoding or self._default_encoding, just like other parts of the code use when reading from the input buffer/file. Making the relevant change on my machine makes the issue go away.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Can you try on master. May have been fixed by #21400

@TomAugspurger TomAugspurger added the Needs Info Clarification about behavior needed to assess issue label Nov 16, 2018
@yatharth
Copy link
Author

Unable to reproduce 😞; closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

2 participants