df.to_stata fails when a column of type object contains only None #23572

kylebarron · 2018-11-08T17:40:12Z

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'a': ['a', None]})
df.to_stata('test.dta')
df = pd.DataFrame({'a': [None, 'a']})
df.to_stata('test.dta')
df = pd.DataFrame({'a': [None, None]})
df.to_stata('test.dta')
# ValueError: Writing general object arrays is not supported

Problem description

The df.to_stata() method writes columns containing None without error when there is at least one string value in the column, but fails if the column contains only None. It's unclear what data type to write a column of None as, so maybe that's why this isn't supported? I would propose that a column with values of only None be written as str1 with empty strings.

I came across this error because I read in a Parquet file with pd.read_parquet() and was unable to write the file to Stata format. In the Parquet schema, the column had type BYTE_ARRAY UTF8, but since the column had only missing values, it was read into Pandas as only None.

Expected Output

Stata file written to disk with missing values for the column with None.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.3.2
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.11.1
xarray: None
IPython: 6.5.0
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.1.14
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-11-09T19:00:16Z

Thanks for the report. Investigation and PRs are always welcome!

bashtage · 2018-11-10T15:27:02Z

What should it write? Empty strings, Nans (with with type, integers will the missing keystone value, or ['None', 'None']? This is an ambiguous case and the best solution is to raise with an exception. TBH the case with ['a',None] should also probably raise and so that columns containing only strings should be written as string columns.

kylebarron · 2018-11-10T16:40:50Z

My point was that None is already being coerced to an empty string whenever there is at least one string value in the column. So I didn't think it would be too much of a stretch to coerce None to an empty string when there are no strings in the column.

In [1]: import pandas as pd
   ...: df = pd.DataFrame({'a': ['a', None]})
   ...: df.to_stata('test.dta')
   ...: pd.read_stata('test.dta')
   ...:

Out[1]:
   index  a
0      0  a
1      1

WillAyd · 2018-11-12T06:03:18Z

I'm not familiar with stata but based off of the examples I agree with @bashtage here - I think it should raise rather than silently coerce for you. Raising would be an explicit sign to the user that they may want to replace or fill values before export, especially since its not a lossless round trip

jtkiley · 2018-11-12T16:58:44Z

It seems like there are two sub-issues being discussed:

What should happen to a column of None? I'm not sure, but Stata reads an empty csv column as a byte:

What should happen to a column of strings containing one or more Nones? In this case, I like the coercing behavior (see StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633). Part of that is that I have a bunch of code in projects that relies on it, though I also think intent is a little clearer when the rest of the column is strings. I have a lot of datasets with text columns in them that tends to be sparse (e.g., correction text for newspaper articles, where corrections are decently rare overall), and they have a lot of Nones. I don't recall whether my parsing code does that directly or whether they're populated as None (cf. '') in some other way.

I'm also somewhat split on warning about it. On the one hand, it would be clearer for users. On the other, if you have a lot of code that relies on this, you'll get some warning fatigue. Another option is adding a parameter, but I wonder how often coercing is actually the wrong behavior.

bashtage · 2018-11-12T17:56:59Z

It is tempting to not change the current behavior vis-a-vis mixed string and None. This would require a deprecation cycle, and while more explicit isn't likely to be an issue for users who use Stata and know that a column with a string must be a string.

The correct behavior in a column with Nones and dtype 'object' is less well defined IMO. FWIW this is why it fails here:

inferred_dtype = infer_dtype(column.dropna())

This drops na-like values, including None and so with mixed this infers string but when all None it infers
empty, which is correct.

One issue that afects the ambiguity here is that both datetime and string arrays can be object and are supported. Converting all None to strings is making a choice about the column type from these two candidates.

jtkiley · 2018-11-12T18:42:36Z

That makes sense to me about the mixed None and strings.

The all None case does seem more difficult. For this particular issue, I wonder if the better change is changing pd.read_parquet() to make a fully missing column of BYTE_ARRAY UTF8 into empty strings. That doesn't solve this issue directly, but it would address one case that leads into all None columns, and it has a clue to intent. Note that those internals are probably beyond my skillset, so if this suggestion doesn't make sense, that would be why.

Beyond that, maybe the error text could be better. I'd like to know that (a) the more descriptive reason is that it's all None, and (b) the two easiest solutions are to (i) drop the column or (ii) fill one or more rows with either values or type-specific missingness (i.e. '' or NaN).

kylebarron · 2018-11-12T19:33:32Z

@jtkiley I looked into the Parquet details more and it turns out that a string of length 0 and a null string are two different values. So they map exactly to None and '' objects in Pandas.

It appears to me that the core issue is that Stata doesn't have a missing value for strings. In Pandas, '' is a string of length 0, not a missing value; None is the missing value for strings. This explains why Series.isna() returns False for the former and True for the latter. For that reason, I think I now agree with @bashtage that a column of None should not be automatically coerced to an empty string, because that's akin to replace null values with values of length zero.

kylebarron · 2018-11-12T19:55:03Z

I linked a PR with a proposed changed error message that

mentions the issue is that the object column is non-string
gives the name of the offending column

jtkiley · 2018-11-12T21:03:44Z

@kylebarron Good sleuthing. So, just to be clear, you're only talking about raising in the all None case, right? What you've found in parquet makes sense to me.

So, I think my (revised) preferences are:

Continue to coerce in df.to_stata() if the column has at least one string (and resolve StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633 to be consistent with that).
Don't change how parquet is ingested to preserve the difference between a zero-length string and a null string.
No strong opinion on writing out a column full of None with df.to_stata(). The argument to coerce is trying to honor the user's intent in writing out the dataset. The argument to raise is that the intent as to the column isn't clear enough to make a choice for them (in contrast to the mixed None and string case), and implicit and wrong is a bad combo. The argument to warn is that you honor the writing intent and avoid the possibility of implicit and wrong, though it's practically the same as raising (i.e. more syntax either to suppress warnings or modify data) unless you like seeing warnings. Those are all valid in my view, and it's a matter of making tradeoffs.

Also, to get another look at a known source of Nones in string columns, I looked and found out why my private text parsing package generates None in string columns. It uses a lot of regexes with optional named capture groups, and they have a value of None when they don't match. They end up in the per document data/metadata dicts, and a list of those dicts is fed to pd.DataFrame().

kylebarron · 2018-11-12T21:32:18Z

So, just to be clear, you're only talking about raising in the all None case, right?

Above I was referencing the all None case.

No strong opinion on writing out a column full of None with df.to_stata().

I agree with @bashtage raising an error is the best solution.

bashtage · 2018-11-14T12:59:11Z

Are there still any remaining fixes/enhancements related to this issue?

jreback · 2018-11-14T13:51:34Z

@bashtage does #23692 cover this?

jtkiley · 2018-11-14T13:58:14Z

@jreback #23692 is the mixed strl and None case, and this is the all None case. I think the eventual consensus here was raising in the all None case, and @kylebarron's PR will make the error message more helpful.

kylebarron · 2018-11-14T14:17:44Z

The PR @jtkiley is referring to is #23646.

bashtage · 2018-11-15T08:57:57Z

FWIW the change that allows strl columns to contain None in 117 (#23692) also allows all None columns if the convert_strl column is set. I think this is correct since the conversion here is explicit since the user has directed to_stata that this column is a string.

import pandas as pd

sample_data2 = [
    {'str1': None, 'number': 0},
    {'str1': None, 'number': 1}
]
data2 = pd.DataFrame(sample_data2)
data2.to_stata('./sample2_117.dta', version=117, convert_strl=['str1'])

If convert_strl is not set, an error is raised.

Improve the error message shown when an object array is empty closes pandas-dev#23572

* ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes #23572 * TST: Add tests for all None Test exception is hit when all values in an object column are None Extend the test for strl conversion to ensure this case passes (as expected)

* ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes pandas-dev#23572 * TST: Add tests for all None Test exception is hit when all values in an object column are None Extend the test for strl conversion to ensure this case passes (as expected)

diego898 · 2019-12-08T17:32:54Z

I was trying to save the results of querying datasets across several years, where column names change, etc. I ended up adding the following "hack" to get it to automatically set convert_strl to any column that is empty, for it to go through:

finalDF.to_stata('myfile.dta', version=117, convert_strl = finalDF.columns[finalDF.isnull().all()].tolist())

I'm sure this isn't "safe", but it might be worth adding an option to enable this in a "safe way".

bashtage · 2019-12-08T20:00:10Z

Safer would be

finalDF.to_stata('myfile.dta', version=117, convert_strl = finalDF.columns[finalDF.isnull().all() & finalDF.dtypes==object].tolist())

so that you don't try to write all missing value numeric columns.

diego898 · 2019-12-08T20:37:44Z

Great point thanks @bashtage! Do you think it might be worth adding an option to to_stata so that something like this happens automatically?

mgao6767 · 2021-05-23T00:20:27Z

Safer would be
finalDF.to_stata('myfile.dta', version=117, convert_strl = finalDF.columns[finalDF.isnull().all() & finalDF.dtypes==object].tolist())
so that you don't try to write all missing value numeric columns.

Just want to add that there should be brackets around finalDF.dtypes==object since & has a higher priority than ==.

finalDF.to_stata('myfile.dta', version=117, convert_strl = finalDF.columns[finalDF.isnull().all() & (finalDF.dtypes==object)].tolist())

WillAyd added Bug IO Stata read_stata, to_stata labels Nov 9, 2018

WillAyd added this to the Contributions Welcome milestone Nov 9, 2018

kylebarron mentioned this issue Nov 12, 2018

StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL. #23633

Closed

kylebarron pushed a commit to kylebarron/pandas that referenced this issue Nov 12, 2018

More helpful Stata object dtype error. (pandas-dev#23572)

b10f728

kylebarron mentioned this issue Nov 12, 2018

More helpful Stata object dtype error. (#23572) #23646

Closed

4 tasks

kylebarron closed this as completed Nov 12, 2018

kylebarron reopened this Nov 12, 2018

bashtage added a commit to bashtage/pandas that referenced this issue Nov 15, 2018

ENH: Improve error message for empty object array

3eb95fd

Improve the error message shown when an object array is empty closes pandas-dev#23572

bashtage mentioned this issue Nov 15, 2018

ENH: Improve error message for empty object array #23718

Merged

4 tasks

gfyoung added Error Reporting Incorrect or improved errors from pandas and removed Bug labels Nov 15, 2018

bashtage added a commit to bashtage/pandas that referenced this issue Nov 15, 2018

ENH: Improve error message for empty object array

904576a

Improve the error message shown when an object array is empty closes pandas-dev#23572

jreback removed this from the Contributions Welcome milestone Nov 17, 2018

jreback added this to the 0.24.0 milestone Nov 17, 2018

jreback closed this as completed in #23718 Nov 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.to_stata fails when a column of type object contains only None #23572

df.to_stata fails when a column of type object contains only None #23572

kylebarron commented Nov 8, 2018

WillAyd commented Nov 9, 2018

bashtage commented Nov 10, 2018 •

edited

Loading

kylebarron commented Nov 10, 2018

WillAyd commented Nov 12, 2018

jtkiley commented Nov 12, 2018

bashtage commented Nov 12, 2018

jtkiley commented Nov 12, 2018

kylebarron commented Nov 12, 2018

kylebarron commented Nov 12, 2018

jtkiley commented Nov 12, 2018

kylebarron commented Nov 12, 2018

bashtage commented Nov 14, 2018

jreback commented Nov 14, 2018

jtkiley commented Nov 14, 2018

kylebarron commented Nov 14, 2018

bashtage commented Nov 15, 2018

diego898 commented Dec 8, 2019

bashtage commented Dec 8, 2019

diego898 commented Dec 8, 2019

mgao6767 commented May 23, 2021

df.to_stata fails when a column of type object contains only None #23572

df.to_stata fails when a column of type object contains only None #23572

Comments

kylebarron commented Nov 8, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented Nov 9, 2018

bashtage commented Nov 10, 2018 • edited Loading

kylebarron commented Nov 10, 2018

WillAyd commented Nov 12, 2018

jtkiley commented Nov 12, 2018

bashtage commented Nov 12, 2018

jtkiley commented Nov 12, 2018

kylebarron commented Nov 12, 2018

kylebarron commented Nov 12, 2018

jtkiley commented Nov 12, 2018

kylebarron commented Nov 12, 2018

bashtage commented Nov 14, 2018

jreback commented Nov 14, 2018

jtkiley commented Nov 14, 2018

kylebarron commented Nov 14, 2018

bashtage commented Nov 15, 2018

diego898 commented Dec 8, 2019

bashtage commented Dec 8, 2019

diego898 commented Dec 8, 2019

mgao6767 commented May 23, 2021

Output of `pd.show_versions()`

bashtage commented Nov 10, 2018 •

edited

Loading