Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DataFrame.to_string output full content by default #28052

Merged

Conversation

lshepard
Copy link

@lshepard lshepard commented Aug 21, 2019

I modeled this off of #24841. Some alternatives I considered:

  • Instead of setting the option_context here, we could wind the param into the depths of the formatter. I tried this, actually, and started finding a number of edge cases and bugs. I realized that the issue only occurs in a pretty narrow case - if the user is explicitly calling to_string - because most of the time, when representing a DataFrame, the user will want long strings truncated for readability. So I think the safest way is to do it at the top level without interfering with lower-level formatters.
  • Series.to_string() could arguably benefit from the same treatment, although that wasn't mentioned in the original issue (and I have never found the need to use it personally) so I didn't bring that in.

Here's an example on a real dataset showing long columns preserved in a text file produced by to_string():

Screen Shot 2019-08-20 at 10 21 06 PM

Additional manual testing:

  • Main use case- by default, no limits and ignores the display options, but can still override:
>>> print(df.to_string())
        A       B
0     NaN     NaN
1 -1.0000     foo
2 -2.1234   foooo
3  3.0000  fooooo
4  4.0000     bar
>>> with option_context('display.max_colwidth', 5):
...     print(df.to_string())
... 
        A       B
0     NaN     NaN
1 -1.0000     foo
2 -2.1234   foooo
3  3.0000  fooooo
4  4.0000     bar
>>> print(df.to_string(max_colwidth=5))
      A     B
0   NaN   NaN
1 -1...   foo
2 -2...  f...
3  3...  f...
4  4...   bar
  • The string representation of DataFrame does still use the display options (so it's only the explicit to_string that doesn't:
>>> with option_context('display.max_colwidth', 5):
...     print(str(df))
... 
      A     B
0   NaN   NaN
1 -1...   foo
2 -2...  f...
3  3...  f...
4  4...   bar
  • The new parameter validates for None and positive ints, but rejects anything else:
>>> print(df.to_string(max_colwidth=-5))
    ...
    raise ValueError(msg)
ValueError: Value must be a nonnegative integer or None

Luke Shepard added 5 commits August 20, 2019 20:29
starting to feel a bit uncomfortable about it. The max_colwidth is an
important feature for legibility in the vast majority of contexts -
and one expects the display config setting to work. It is only when invoked
at the highest level as to_string() that it should be unlimited.

So even though this is a temp commit, I'm about to unwind it I think and
try an approach at the top level only.:
… have

a quick override at the very top level, and everything else behaves based
on that one override.#
…width is 0 instead of a large number. So I set it to a large number (like the html diff) to preserve the justification behavior.
@pep8speaks
Copy link

pep8speaks commented Aug 21, 2019

Hello @lshepard! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-14 14:09:02 UTC

@@ -707,11 +708,14 @@ def to_string(
max_cols=None,
show_dimensions=False,
decimal=".",
max_colwidth=9999999,
Copy link
Author

@lshepard lshepard Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to note - one of the commenters on the issue asks: "With Pandas 0.25.0, setting display.max_colwidth to a large number stops the truncation but when trying to left justify columns with df.to_string(justify='left'), that same display setting somehow pads columns on the left so they are not left aligned. Is there any present way to prevent truncation and get left justified string columns when output to a terminal? I know a pull request is in process but I would like to do this now. Thanks."

I can accomplish this in my testing by setting max_colwidth=0, which switches the padding to left. It is weird, though, that passing justify="left" does not justify it correctly. Seems like maybe a separate bug or one that I could look further into.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this cause, but I don't think we'll want a workaround like this.

Can you post the output with max_colwidth=0, and how it compares to this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows how the max_colwidth=0 strangely forces a left justification:

>>> print(df.to_string(max_colwidth=0))
        A       B
0 NaN      NaN   
1 -1.0000  foo   
2 -2.1234  foooo 
3  3.0000  fooooo
4  4.0000  bar   

Whereas the behavior in this PR preserves the right justification that is the current default:

>>> print(df.to_string(max_colwidth=99999))
        A       B
0     NaN     NaN
1 -1.0000     foo
2 -2.1234   foooo
3  3.0000  fooooo
4  4.0000     bar

And interestingly, passing justify='left' doesn't have an effect:

>>> print(df.to_string(justify='left'))
   A      B      
0     NaN     NaN
1 -1.0000     foo
2 -2.1234   foooo
3  3.0000  fooooo
4  4.0000     bar

This is because of these lines in _make_fixed_width:

def _make_fixed_width(
    ...
    max_len = max(adj.len(x) for x in strings)
    ...
    def just(x):
        if conf_max is not None:
            if (conf_max > 3) & (adj.len(x) > max_len):
                x = x[: max_len - 3] + "..."
        return x

    strings = [just(x) for x in strings]
    result = adj.justify(strings, max_len, mode=justify)
    return result

It checks if conf_max > 3 to apply the dot truncation ... so if it's <= 3 then that isn't called. So it's not just 0 but any of 0, 1, 2, or 3 that causes the justification to line up.

I can spend some more time to better understand why this is happening. I agree that we should not rely on some incidentals of the underlying implementation to determine whether to justify the text.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I figured out the issue. There are two lines where format_array is called and the justify parameter is not passed all the way through -- so in some places, the justification is being overridden.

Note that the bug where justification doesn't happen if conf_max < 3 already appears - so I think it can probably be pulled out as a separate PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch all that. I re-read the docs and I see that I misinterpreted the justify param, as it only refers to the column headers, not the content. In that regard it is behaving correctly. So I think I'll leave the justification question out of this PR.

Luke Shepard added 2 commits August 20, 2019 22:19
…the expected values

though because the fixed width will be harder to read if the lines are split, and they
kind of have to be long to test the truncation...
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I think we'll need to deprecate the current behavior, rather than just changing the default. Users will need to be explicit about the max_colwidth for now I think.

@@ -707,11 +708,14 @@ def to_string(
max_cols=None,
show_dimensions=False,
decimal=".",
max_colwidth=9999999,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this cause, but I don't think we'll want a workaround like this.

Can you post the output with max_colwidth=0, and how it compares to this?

@lshepard
Copy link
Author

lshepard commented Aug 21, 2019

I think we'll need to deprecate the current behavior, rather than just changing the default.
Users will need to be explicit about the max_colwidth for now I think.

To make sure I understand, what should happen when the user calls df.to_string() without parameters?

  • Limits columns to the display.max_colwidth config option (the current default)
  • No limit to column width (this PR)
  • Make max_colwidth a required parameter (so vanilla to_string() no longer works)
  • One of the above, but with a deprecation warning

Sounds like you're suggesting that we continue the current behavior, but with a deprecation warning? Should we only sound the deprecation warning if the data frame contains columns that would have otherwise been truncated? Seems like in most cases, the truncation won't be a difference in behavior and I would hate to make vanilla df.to_string() not work...

@lshepard
Copy link
Author

To update: I chose to punt the justify question as I think that's a separate issue that pre-existed and I may not understand the use case that well anyway.

I changed the behavior of max_colwidth so that it will use None to mean unlimited instead of the sentinel of 999999. I also changed the other two places from where I had copied the style (in the clipboard formatter & the html formatter).

I have not yet added a deprecation warning - happy to do so if you think desired behavior is that the max_colwidth param should be required in the future. That doesn't make as much sense to me.

@TomAugspurger
Copy link
Contributor

To make sure I understand, what should happen when the user calls df.to_string() without parameters?

That needs to be decided. It seems that in #28052 we decided that to_html not outputting the entire dataframe was a bug. Presumably we would say the same about this, but I'm not sure.

cc @simonjayhawkins

@simonjayhawkins
Copy link
Member

To make sure I understand, what should happen when the user calls df.to_string() without parameters?

That needs to be decided. It seems that in #28052 we decided that to_html not outputting the entire dataframe was a bug. Presumably we would say the same about this, but I'm not sure.

cc @simonjayhawkins

we should probably maintain consistency. although long html would probably wrap within a cell and that does not apply to to_string.

@lshepard
Copy link
Author

I agree- the original issue came from someone calling vanilla to_string and being surprised:

I am calling to_string() without any
parameters and it beautifully fixed-
formatted my dataframe apart from my
very wide filename column, that is being
truncated with "...". How can I avoid that?

I think no truncation by default is the most intuitive approach, and matches to_html behavior. Someone who is surprised can find the parameter to truncate pretty easily in the docs, and it’s unlikely to be a surprise the same way that truncation might be.

@simonjayhawkins simonjayhawkins added API Design Output-Formatting __repr__ of pandas objects, to_string labels Aug 25, 2019
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's call using the options.display value a bug then.

This will need a release note in doc/source/whatsnew/v1.0.0.rst under bug fixes.

pandas/core/config_init.py Show resolved Hide resolved
Luke Shepard added 4 commits August 26, 2019 22:09
…dn't realize that this also allowed None until checking the docs, but it does so it's the perfect validator for our new parameter.
@lshepard
Copy link
Author

Thanks for the feedback! I added the whatsnew notice, swapped to the correct validator, and did some more manual testing to ensure it worked as expected.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good overall. A few small requests.

pandas/core/frame.py Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
@lshepard
Copy link
Author

Ready for review, thanks!

@TomAugspurger TomAugspurger added this to the 1.0 milestone Aug 30, 2019
Co-Authored-By: Tom Augspurger <TomAugspurger@users.noreply.github.com>
@TomAugspurger
Copy link
Contributor

@lshepard merge conflict in the release notes. Could you merge master & repush?

Luke Shepard added 2 commits September 5, 2019 07:57
Merge branch 'master' into issue9784-to-string-truncate-long-strings
Merge branch 'issue9784-to-string-truncate-long-strings' of github.com:lshepard/pandas into issue9784-to-string-truncate-long-strings
@lshepard
Copy link
Author

lshepard commented Sep 5, 2019

Huh, failed with “worker 'gw0' crashed while running 'pandas/tests/test_sorting.py::TestSafeSort::test_labels_out_of_bound[-1]'”. I don’t think that’s related but I’ll take a look.

@TomAugspurger
Copy link
Contributor

Could be a random failure. I'd start by repushing an empty commit.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@lshepard
Copy link
Author

lshepard commented Sep 9, 2019

Thanks for all the detailed comments on the review!

@WillAyd
Copy link
Member

WillAyd commented Sep 13, 2019

@lshepard can you fix the merge conflict on the whatsnew? @TomAugspurger mind taking a look at this one?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too. Just need to fix the merge conflict.

@WillAyd WillAyd merged commit d92b46f into pandas-dev:master Sep 16, 2019
@WillAyd
Copy link
Member

WillAyd commented Sep 16, 2019

Nice PR - thanks a lot @lshepard

@rswgnu
Copy link

rswgnu commented Sep 16, 2019

Excited to try this out and see it resolved. I hope proper left justification can be resolved soon too. Thanks.

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.to_string truncates long strings
6 participants