Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow export of mixed columns to Stata strl #23692

Merged
merged 1 commit into from
Nov 14, 2018

Conversation

bashtage
Copy link
Contributor

Enable export of large columns to Stata strls when the column
contains None as a null value

closes #23633

Enable export of large columns to Stata strls when the column
contains None as a null value

closes pandas-dev#23633
@pep8speaks
Copy link

Hello @bashtage! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Nov 14, 2018

Codecov Report

Merging #23692 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23692   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files         161      161           
  Lines       51318    51318           
=======================================
  Hits        47339    47339           
  Misses       3979     3979
Flag Coverage Δ
#multiple 90.63% <ø> (ø) ⬆️
#single 42.31% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a197837...75f9d80. Read the comment docs.

@jreback jreback added Enhancement IO Stata read_stata, to_stata labels Nov 14, 2018
@jreback jreback added this to the 0.24.0 milestone Nov 14, 2018
@jreback
Copy link
Contributor

jreback commented Nov 14, 2018

does this break idempotency? (i think answer is no, this is round-trippable)

@kylebarron
Copy link
Contributor

I think yes because Stata doesn't have a string missing value. When the Stata file is read back in to Pandas, the None values are ''.

This already happens with shorter strings with the Stata 114 writer. This PR allows the same to happen with strings longer than 245 characters in the 117 writer.

>>> import pandas as pd
>>> df = pd.DataFrame({'a': ['abc', None]})
>>> df.to_stata('test.dta')
>>> pd.read_stata('test.dta')
Out[7]:
   index    a
0      0  abc
1      1

'number': 1}
]

output = pd.DataFrame(output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth having a test for other versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant for the other version (114) which doesn't support strls.

@bashtage
Copy link
Contributor Author

does this break idempotency? (i think the answer is no, this is round-trippable)

Yes. Essentially you get None->'' conversion in mixed columns since there is no missing value for strings in Stata.

The only way to have idempotency would be to raise on mixed string columns so that None is not allowed in string columns. Then users would need to convert None to '' before saving, so that '' -> file -> ''.

@jreback jreback merged commit fcb8403 into pandas-dev:master Nov 14, 2018
@jreback
Copy link
Contributor

jreback commented Nov 14, 2018

thanks @bashtage

thoo added a commit to thoo/pandas that referenced this pull request Nov 15, 2018
* upstream/master: (25 commits)
  DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651)
  DOC: Change release and whatsnew (pandas-dev#21599)
  DOC: Fix format of the See Also descriptions (pandas-dev#23654)
  DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374)
  ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)
  CLN: Remove unnecessary code (pandas-dev#23696)
  Pin flake8-rst version (pandas-dev#23699)
  Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643)
  CI: raise clone depth limit on CI
  BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688)
  REF: Move Excel names parameter handling to CSV (pandas-dev#23690)
  DOC: Accessing files from a S3 bucket. (pandas-dev#23639)
  Fix errorbar visualization (pandas-dev#23674)
  DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678)
  DOC: Update is_sparse docstring (pandas-dev#19983)
  BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661)
  Add to_flat_index method to MultiIndex (pandas-dev#22866)
  CLN: Move to_excel to generic.py (pandas-dev#23656)
  TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
  CI: Allow to compile docs with ipython 7.11 pandas-dev#22990 (pandas-dev#23655)
  ...
thoo added a commit to thoo/pandas that referenced this pull request Nov 15, 2018
…fixed

* upstream/master:
  DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651)
  DOC: Change release and whatsnew (pandas-dev#21599)
  DOC: Fix format of the See Also descriptions (pandas-dev#23654)
  DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374)
  ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)
  CLN: Remove unnecessary code (pandas-dev#23696)
  Pin flake8-rst version (pandas-dev#23699)
  Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643)
  CI: raise clone depth limit on CI
  BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688)
  REF: Move Excel names parameter handling to CSV (pandas-dev#23690)
  DOC: Accessing files from a S3 bucket. (pandas-dev#23639)
  Fix errorbar visualization (pandas-dev#23674)
  DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678)
  DOC: Update is_sparse docstring (pandas-dev#19983)
  BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661)
  Add to_flat_index method to MultiIndex (pandas-dev#22866)
  CLN: Move to_excel to generic.py (pandas-dev#23656)
  TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Enable export of large columns to Stata strls when the column
contains None as a null value

closes pandas-dev#23633
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Enable export of large columns to Stata strls when the column
contains None as a null value

closes pandas-dev#23633
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Enable export of large columns to Stata strls when the column
contains None as a null value

closes pandas-dev#23633
@bashtage bashtage deleted the strl-none branch March 21, 2019 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StataWriter for version 117 fails on None in a string column long enough to be a Stata StrL.
4 participants