Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix handling of encoding for the StataReader #21244 #21246

Closed

Conversation

adrian-castravete
Copy link

@adrian-castravete adrian-castravete commented May 29, 2018

@codecov
Copy link

codecov bot commented May 30, 2018

Codecov Report

Merging #21246 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21246   +/-   ##
=======================================
  Coverage   91.84%   91.84%           
=======================================
  Files         153      153           
  Lines       49538    49538           
=======================================
  Hits        45499    45499           
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.24% <ø> (ø) ⬆️
#single 41.87% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c85ab08...57c24f8. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible include / and/or generate a utf-8 encoded file for testing?

@@ -146,7 +146,8 @@ MultiIndex
I/O
^^^

-
- :func:`pandas.read_stata` now honours the ``encoding`` parameter, and supports the 'utf-8'
encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding now.

@jreback
Copy link
Contributor

jreback commented May 30, 2018

@bashtage can you have a look

@bashtage
Copy link
Contributor

You need to add a small Stata produce dta file that can replicate the issue and that this or fixes. Then please add a test that reads this dta file. It needs to be produced by Stata and not pandas.

@bashtage
Copy link
Contributor

118 does support utf8. But this needs to be tested using a real Stata file with both fixed width utf8 strings and StrL utf8 characters, and numbers.

@adrian-castravete
Copy link
Author

adrian-castravete commented May 30, 2018

I see. I thought that https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/data/stata14_118.dta contains Unicode, since this test TestStata.test_read_dta18 from https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_stata.py has a check for that and it passes.
Though indeed I don't know if the generated file was made by Stata or not.

Unfortunately I don't own a copy of the software and it's neither in my field to work with stata files as a Scientist/Statistician. :)
My use case is for a converter for a simple view of the file.

@bashtage
Copy link
Contributor

bashtage commented May 30, 2018 via email

@adrian-castravete
Copy link
Author

adrian-castravete commented May 30, 2018

When I tried encoding the string with unicode and then decoding it with latin-1 the string was different, It showed two weird characters instead of the expected Ü.

As for the encoding argument: I agree if the specs say that 118 is always Unicode then it would make sense to add the necessary ifs where applicable. I've seen parts of the code where the encoding gets set to latin-1.

If the encoding is to be determined via the version number, then this argument should also be removed and everything else like documentation or tests should be changed to reflect this.

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think the encoding should be changd form settable to not settable. This will require a deprecation in StataReader. Users should not be encouraged to set encoding in read_stata since it isn't really settable (either latin-1 or utf-8)

@@ -37,7 +37,8 @@
from pandas.util._decorators import deprecate_kwarg

VALID_ENCODINGS = ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1',
'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1')
'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not in general valid. The set of valid depends on the reader. For < 118 it is

('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1',
'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1')

for 118 it is 'utf-8', 'utf8'.

@@ -1335,7 +1336,7 @@ def _calcsize(self, fmt):

def _decode(self, s):
s = s.partition(b"\0")[0]
return s.decode('utf-8')
return s.decode(self._encoding or self._default_encoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be changed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... this is the line that was causing all the problems with my converter. So _decode should only be used in >= 118, right?

@@ -99,9 +99,9 @@ def setup_method(self, method):

self.stata_dates = os.path.join(self.dirpath, 'stata13_dates.dta')

def read_dta(self, file):
def read_dta(self, file, encoding='latin-1'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use None as encoding so it can be overridden by the reader depending on the dta version.

# Legacy default reader configuration
return read_stata(file, convert_dates=True)
return read_stata(file, convert_dates=True, encoding=encoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@bashtage
Copy link
Contributor

I took a look at the dta spec and it is stricter than pandas enforces. dta < 118 claim to use ASCII only although Stata internally displays and works with latin-1. dta 118 is utf-8 only.

@adrian-castravete
Copy link
Author

I see. I will continue with the added suggestions.

@bashtage
Copy link
Contributor

bashtage commented Jun 6, 2018

@adrian-castravete Master just got updated with a fix for this. If you have a chance could you try it out with your dta file? It should always use the correct encoding (automatically) now. If it fails, we might need to look into your dta file.

@bashtage
Copy link
Contributor

I think this has been resolved in master.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2018

deprecated encoding by #21400 (and bug is already fixed in master)

@jreback jreback closed this Jun 12, 2018
@jreback jreback added this to the No action milestone Jun 12, 2018
@eirki eirki mentioned this pull request Jun 2, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_stata always uses 'utf8'
3 participants