Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.to_stata fails when a column of type object contains only None #23572

Closed
kylebarron opened this issue Nov 8, 2018 · 20 comments · Fixed by #23718
Closed

df.to_stata fails when a column of type object contains only None #23572

kylebarron opened this issue Nov 8, 2018 · 20 comments · Fixed by #23718
Labels
Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata
Milestone

Comments

@kylebarron
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'a': ['a', None]})
df.to_stata('test.dta')
df = pd.DataFrame({'a': [None, 'a']})
df.to_stata('test.dta')
df = pd.DataFrame({'a': [None, None]})
df.to_stata('test.dta')
# ValueError: Writing general object arrays is not supported

Problem description

The df.to_stata() method writes columns containing None without error when there is at least one string value in the column, but fails if the column contains only None. It's unclear what data type to write a column of None as, so maybe that's why this isn't supported? I would propose that a column with values of only None be written as str1 with empty strings.

I came across this error because I read in a Parquet file with pd.read_parquet() and was unable to write the file to Stata format. In the Parquet schema, the column had type BYTE_ARRAY UTF8, but since the column had only missing values, it was read into Pandas as only None.

Expected Output

Stata file written to disk with missing values for the column with None.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.3.2
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.11.1
xarray: None
IPython: 6.5.0
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.1.14
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None
@WillAyd
Copy link
Member

WillAyd commented Nov 9, 2018

Thanks for the report. Investigation and PRs are always welcome!

@WillAyd WillAyd added Bug IO Stata read_stata, to_stata labels Nov 9, 2018
@WillAyd WillAyd added this to the Contributions Welcome milestone Nov 9, 2018
@bashtage
Copy link
Contributor

bashtage commented Nov 10, 2018

What should it write? Empty strings, Nans (with with type, integers will the missing keystone value, or ['None', 'None']? This is an ambiguous case and the best solution is to raise with an exception. TBH the case with ['a',None] should also probably raise and so that columns containing only strings should be written as string columns.

@kylebarron
Copy link
Contributor Author

My point was that None is already being coerced to an empty string whenever there is at least one string value in the column. So I didn't think it would be too much of a stretch to coerce None to an empty string when there are no strings in the column.

In [1]: import pandas as pd
   ...: df = pd.DataFrame({'a': ['a', None]})
   ...: df.to_stata('test.dta')
   ...: pd.read_stata('test.dta')
   ...:

Out[1]:
   index  a
0      0  a
1      1

@WillAyd
Copy link
Member

WillAyd commented Nov 12, 2018

I'm not familiar with stata but based off of the examples I agree with @bashtage here - I think it should raise rather than silently coerce for you. Raising would be an explicit sign to the user that they may want to replace or fill values before export, especially since its not a lossless round trip

@jtkiley
Copy link
Contributor

jtkiley commented Nov 12, 2018

It seems like there are two sub-issues being discussed:

  1. What should happen to a column of None? I'm not sure, but Stata reads an empty csv column as a byte:

screen shot 2018-11-12 at 10 44 52 am

  1. What should happen to a column of strings containing one or more Nones? In this case, I like the coercing behavior (see StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633). Part of that is that I have a bunch of code in projects that relies on it, though I also think intent is a little clearer when the rest of the column is strings. I have a lot of datasets with text columns in them that tends to be sparse (e.g., correction text for newspaper articles, where corrections are decently rare overall), and they have a lot of Nones. I don't recall whether my parsing code does that directly or whether they're populated as None (cf. '') in some other way.

I'm also somewhat split on warning about it. On the one hand, it would be clearer for users. On the other, if you have a lot of code that relies on this, you'll get some warning fatigue. Another option is adding a parameter, but I wonder how often coercing is actually the wrong behavior.

@bashtage
Copy link
Contributor

It is tempting to not change the current behavior vis-a-vis mixed string and None. This would require a deprecation cycle, and while more explicit isn't likely to be an issue for users who use Stata and know that a column with a string must be a string.

The correct behavior in a column with Nones and dtype 'object' is less well defined IMO. FWIW this is why it fails here:

inferred_dtype = infer_dtype(column.dropna())

This drops na-like values, including None and so with mixed this infers string but when all None it infers
empty, which is correct.

One issue that afects the ambiguity here is that both datetime and string arrays can be object and are supported. Converting all None to strings is making a choice about the column type from these two candidates.

@jtkiley
Copy link
Contributor

jtkiley commented Nov 12, 2018

That makes sense to me about the mixed None and strings.

The all None case does seem more difficult. For this particular issue, I wonder if the better change is changing pd.read_parquet() to make a fully missing column of BYTE_ARRAY UTF8 into empty strings. That doesn't solve this issue directly, but it would address one case that leads into all None columns, and it has a clue to intent. Note that those internals are probably beyond my skillset, so if this suggestion doesn't make sense, that would be why.

Beyond that, maybe the error text could be better. I'd like to know that (a) the more descriptive reason is that it's all None, and (b) the two easiest solutions are to (i) drop the column or (ii) fill one or more rows with either values or type-specific missingness (i.e. '' or NaN).

@kylebarron
Copy link
Contributor Author

@jtkiley I looked into the Parquet details more and it turns out that a string of length 0 and a null string are two different values. So they map exactly to None and '' objects in Pandas.

It appears to me that the core issue is that Stata doesn't have a missing value for strings. In Pandas, '' is a string of length 0, not a missing value; None is the missing value for strings. This explains why Series.isna() returns False for the former and True for the latter. For that reason, I think I now agree with @bashtage that a column of None should not be automatically coerced to an empty string, because that's akin to replace null values with values of length zero.

@kylebarron
Copy link
Contributor Author

I linked a PR with a proposed changed error message that

  • mentions the issue is that the object column is non-string
  • gives the name of the offending column

@jtkiley
Copy link
Contributor

jtkiley commented Nov 12, 2018

@kylebarron Good sleuthing. So, just to be clear, you're only talking about raising in the all None case, right? What you've found in parquet makes sense to me.

So, I think my (revised) preferences are:

  1. Continue to coerce in df.to_stata() if the column has at least one string (and resolve StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633 to be consistent with that).
  2. Don't change how parquet is ingested to preserve the difference between a zero-length string and a null string.
  3. No strong opinion on writing out a column full of None with df.to_stata(). The argument to coerce is trying to honor the user's intent in writing out the dataset. The argument to raise is that the intent as to the column isn't clear enough to make a choice for them (in contrast to the mixed None and string case), and implicit and wrong is a bad combo. The argument to warn is that you honor the writing intent and avoid the possibility of implicit and wrong, though it's practically the same as raising (i.e. more syntax either to suppress warnings or modify data) unless you like seeing warnings. Those are all valid in my view, and it's a matter of making tradeoffs.

Also, to get another look at a known source of Nones in string columns, I looked and found out why my private text parsing package generates None in string columns. It uses a lot of regexes with optional named capture groups, and they have a value of None when they don't match. They end up in the per document data/metadata dicts, and a list of those dicts is fed to pd.DataFrame().

@kylebarron
Copy link
Contributor Author

So, just to be clear, you're only talking about raising in the all None case, right?

Above I was referencing the all None case.

No strong opinion on writing out a column full of None with df.to_stata().

I agree with @bashtage raising an error is the best solution.

@bashtage
Copy link
Contributor

Are there still any remaining fixes/enhancements related to this issue?

@jreback
Copy link
Contributor

jreback commented Nov 14, 2018

@bashtage does #23692 cover this?

@jtkiley
Copy link
Contributor

jtkiley commented Nov 14, 2018

@jreback #23692 is the mixed strl and None case, and this is the all None case. I think the eventual consensus here was raising in the all None case, and @kylebarron's PR will make the error message more helpful.

@kylebarron
Copy link
Contributor Author

The PR @jtkiley is referring to is #23646.

@bashtage
Copy link
Contributor

FWIW the change that allows strl columns to contain None in 117 (#23692) also allows all None columns if the convert_strl column is set. I think this is correct since the conversion here is explicit since the user has directed to_stata that this column is a string.

import pandas as pd

sample_data2 = [
    {'str1': None, 'number': 0},
    {'str1': None, 'number': 1}
]
data2 = pd.DataFrame(sample_data2)
data2.to_stata('./sample2_117.dta', version=117, convert_strl=['str1'])

If convert_strl is not set, an error is raised.

bashtage added a commit to bashtage/pandas that referenced this issue Nov 15, 2018
Improve the error message shown when an object array is empty

closes pandas-dev#23572
@gfyoung gfyoung added Error Reporting Incorrect or improved errors from pandas and removed Bug labels Nov 15, 2018
bashtage added a commit to bashtage/pandas that referenced this issue Nov 15, 2018
Improve the error message shown when an object array is empty

closes pandas-dev#23572
@jreback jreback removed this from the Contributions Welcome milestone Nov 17, 2018
@jreback jreback added this to the 0.24.0 milestone Nov 17, 2018
jreback pushed a commit that referenced this issue Nov 17, 2018
* ENH: Improve error message for empty object array

Improve the error message shown when an object array is empty

closes #23572

* TST: Add tests for all None

Test exception is hit when all values in an object column are None
Extend the test for strl conversion to ensure this case passes (as expected)
tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018
* ENH: Improve error message for empty object array

Improve the error message shown when an object array is empty

closes pandas-dev#23572

* TST: Add tests for all None

Test exception is hit when all values in an object column are None
Extend the test for strl conversion to ensure this case passes (as expected)
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
* ENH: Improve error message for empty object array

Improve the error message shown when an object array is empty

closes pandas-dev#23572

* TST: Add tests for all None

Test exception is hit when all values in an object column are None
Extend the test for strl conversion to ensure this case passes (as expected)
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
* ENH: Improve error message for empty object array

Improve the error message shown when an object array is empty

closes pandas-dev#23572

* TST: Add tests for all None

Test exception is hit when all values in an object column are None
Extend the test for strl conversion to ensure this case passes (as expected)
@diego898
Copy link

diego898 commented Dec 8, 2019

I was trying to save the results of querying datasets across several years, where column names change, etc. I ended up adding the following "hack" to get it to automatically set convert_strl to any column that is empty, for it to go through:

finalDF.to_stata('myfile.dta', version=117, convert_strl = finalDF.columns[finalDF.isnull().all()].tolist())

I'm sure this isn't "safe", but it might be worth adding an option to enable this in a "safe way".

@bashtage
Copy link
Contributor

bashtage commented Dec 8, 2019

Safer would be

finalDF.to_stata('myfile.dta', version=117, convert_strl = finalDF.columns[finalDF.isnull().all() & finalDF.dtypes==object].tolist())

so that you don't try to write all missing value numeric columns.

@diego898
Copy link

diego898 commented Dec 8, 2019

Great point thanks @bashtage! Do you think it might be worth adding an option to to_stata so that something like this happens automatically?

@mgao6767
Copy link

Safer would be

finalDF.to_stata('myfile.dta', version=117, convert_strl = finalDF.columns[finalDF.isnull().all() & finalDF.dtypes==object].tolist())

so that you don't try to write all missing value numeric columns.

Just want to add that there should be brackets around finalDF.dtypes==object since & has a higher priority than ==.

finalDF.to_stata('myfile.dta', version=117, convert_strl = finalDF.columns[finalDF.isnull().all() & (finalDF.dtypes==object)].tolist())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata
Projects
None yet
8 participants