Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve estimation table #379

Merged
merged 29 commits into from
Apr 10, 2023
Merged

Improve estimation table #379

merged 29 commits into from
Apr 10, 2023

Conversation

ChristianZimpelmann
Copy link
Member

@ChristianZimpelmann ChristianZimpelmann commented Aug 23, 2022

  • Allow to easily add a row of strings like "Controls": ["Yes", "No"]) to the footer before calling render_latex.
  • Do not round all cells which contain integer values. Before this PR, e.g. the number of observations were also rounded to two significant digits.

Closes #373.

@ChristianZimpelmann
Copy link
Member Author

ChristianZimpelmann commented Aug 23, 2022

Three open questions remain (for now):

  • as _center_align_integers_and_non_numeric_strings is currently set up, a string "Total 2.2 observations" in a cell, would be treated as a float (not center-aligned) and also the first part "Total " would be stripped of. I think we should adjust this, but wanted to discuss this first as I might miss a reason why you came up with this solution in the first place.

  • How should the interaction with add_trailing_zeros be? Currently implemented: In body, trailing zeros are added to integers. In footer, no trailing zeros are added. I don't have a strong opinion here, as I am not planning to use add_trailing_zeros=True.

  • I feel like the code block below can somehow be simplified, i.e. without looping over the columns. But didn't find a solution.

    for c in df_formatted:
        df_formatted.loc[position_of_integers[c], c] = df_raw.loc[
            position_of_integers[c], c
        ]

@codecov
Copy link

codecov bot commented Aug 23, 2022

Codecov Report

Merging #379 (a32f69c) into main (a770938) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #379      +/-   ##
==========================================
+ Coverage   93.24%   93.26%   +0.01%     
==========================================
  Files         247      247              
  Lines       17987    18000      +13     
==========================================
+ Hits        16772    16787      +15     
+ Misses       1215     1213       -2     
Impacted Files Coverage Δ
src/estimagic/differentiation/derivatives.py 96.42% <ø> (ø)
...imagic/optimization/internal_criterion_template.py 82.48% <ø> (ø)
src/estimagic/optimization/optimize.py 92.26% <ø> (ø)
src/estimagic/visualization/estimation_table.py 84.96% <100.00%> (+0.03%) ⬆️
tests/visualization/test_estimation_table.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@hmgaudecker
Copy link
Member

Thanks! While you are at it, can you please turn off the performance warnings, which are irrelevant for tables:

╭────────────────────────────────────────────────────────────────── Warnings ───────────────────────────────────────────────────────────────────╮
│ task_reg_invest_on_groups_or_params.py::task_reg_invest_on_groups                                                                             │
│ task_reg_invest_on_groups_or_params.py::task_reg_invest_on_groups                                                                             │
│ task_reg_invest_on_groups_or_params.py::task_reg_invest_on_groups                                                                             │
│ task_reg_invest_on_groups_or_params.py::task_reg_invest_on_groups                                                                             │
│ task_reg_invest_on_groups_or_params.py::task_reg_invest_on_groups                                                                             │
│ ... in 15 more locations.                                                                                                                     │
│     /x/estimagic/estimagic/src/estimagic/visualization/estimation_table.py:296: PerformanceWarning: indexing past lexsort depth may    │
│     impact performance.                                                                                                                       │
│       ci_in_body = body.loc[("",)][body.columns[0]].str.contains(";").any()                                                                   │
│                                                                                                                                               │
│ ♥ https://pytask-dev.rtdf.io/en/stable/how_to_guides/capture_warnings.html                                                                    │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

@hmgaudecker
Copy link
Member

Thanks. Please make sure that when you call a test test_render_latex(), it actually calls the function render_latex. The test that is called this way should be test_center_align_integers_and_non_numeric_strings

While the bug we saw today seems fixed in that function, it has no effect on render_latex because the loop in lines 349-350 of estimation_table is flawed:

        for _, r in footer.iterrows():
            r = _center_align_integers_and_non_numeric_strings(r)

The (correct) result on the RHS is assigned to r, which just disappears in the next iteration.

I tried to fix with

        for idx, row in footer.iterrows():
            footer[idx] = _center_align_integers_and_non_numeric_strings(row)

but this gives an error

TypeError: incompatible index of inserted column with frame index

in my application.

@ChristianZimpelmann
Copy link
Member Author

Reminder: At some point, we should make sure that columns consisting of integer values only are right aligned (instead of center-aligned)

@ChristianZimpelmann
Copy link
Member Author

@janosg, I just merged main and updated the first comment in this PR.

The remaining comments are:

  • How should the interaction of treatment of integers with add_trailing_zeros be? Currently implemented: In body, trailing zeros are added to integers. In footer, no trailing zeros are added. I don't have a strong opinion here, as I am not planning to use add_trailing_zeros=True.

My suggestion: leave as is

  • Reminder: At some point, we should make sure that columns consisting of integer values only are right aligned (instead of center-aligned)

If I am not mistaken, this comments was brought forward by Hans-Martin. I suggest opening an issue and not implementing it here as thinking about comparisons over full columns will most likely need a deeper thought.

@hmgaudecker
Copy link
Member

How should the interaction of treatment of integers with add_trailing_zeros be? Currently implemented: In body, trailing zeros are added to integers. In footer, no trailing zeros are added. I don't have a strong opinion here, as I am not planning to use add_trailing_zeros=True.

My suggestion: leave as is

Given that the docstring reads:

add_trailing_zeros (bool): If True, format floats such that they have same number of digits after the decimal point. Default True.

I would be very surprised if (under the default!) suddenly all integers showed up as floats. Or do you mean by interaction "columns that contain both integers and floats" ? And by footer the number of observations or so? In that case I would not have a strong opinion, we don't really support tables containing all kinds of different numbers yet, anyhow.

Reminder: At some point, we should make sure that columns consisting of integer values only are right aligned (instead of center-aligned)

If I am not mistaken, this comments was brought forward by Hans-Martin. I suggest opening an issue and not implementing it here as thinking about comparisons over full columns will most likely need a deeper thought.

I guess I do not quite understand why we can easily center-align columns but not right-align them?

@ChristianZimpelmann
Copy link
Member Author

ChristianZimpelmann commented Mar 30, 2023

Given that the docstring reads:

add_trailing_zeros (bool): If True, format floats such that they have same number of digits after the decimal point. Default True.

I would be very surprised if (under the default!) suddenly all integers showed up as floats. Or do you mean by interaction "columns that contain both integers and floats" ? And by footer the number of observations or so? In that case I would not have a strong opinion, we don't really support tables containing all kinds of different numbers yet, anyhow.

No, I really mean that integers in any of the model outputs are shown as floats. (I guess usually those coefficients are just not integers). But this can easily be changed.

def _get_models_multiindex():
    df = pd.DataFrame(
        data={}, columns=["value", "ci_lower", "ci_upper", "p_value"]
    )
    df.loc[0] = [0.23123, 0, 1, 1]
    df.loc[1] = [1, 0, 1, 1]
    df.loc[2] = [2, 0, 1, 1]
    df.index = pd.MultiIndex.from_tuples(
        [("p_1", "v_1"), ("p_1", "v_2"), ("p_2", "v_2")]
    )
    
    info = {"n_obs": 400}
    mod1 = {"params": df, "info": info, "name": "m1"}
    mod2 = {"params": df, "info": info, "name": "m2"}
    models = [mod1, mod2]
    return models

models = _get_models_multiindex()
out = et.estimation_table(
        models, return_type="render_inputs", confidence_intervals=True
    )
out["body"]

leads to

m1 m2
('p_1', 'v_1') 0.231$^{ }$ 0.231$^{ }$
('p_1', '') (0.000;1.000) (0.000;1.000)
('p_1', 'v_2') 1.000$^{ }$ 1.000$^{ }$
('p_1', '') (0.000;1.000) (0.000;1.000)
('p_2', 'v_2') 2.000$^{ }$ 2.000$^{ }$
('p_2', '') (0.000;1.000) (0.000;1.000)

I guess I do not quite understand why we can easily center-align columns but not right-align them?

Have I misunderstood your comment that you would like to check the whole column whether it only consists of integers? This is something we do not do at the moment. We currently do not center-align whole columns, but only check and align single cells. Switching from center-align to right-align for those specific cells wouldn't be a problem. But I am not sure whether this really is what we want in columns in which we have floats in the main part of the body and then n_observations at the end.

@hmgaudecker
Copy link
Member

Given that the docstring reads:

add_trailing_zeros (bool): If True, format floats such that they have same number of digits after the decimal point. Default True.

I would be very surprised if (under the default!) suddenly all integers showed up as floats. Or do you mean by interaction "columns that contain both integers and floats" ? And by footer the number of observations or so? In that case I would not have a strong opinion, we don't really support tables containing all kinds of different numbers yet, anyhow.

No, I really mean that integers in any of the model outputs are shown as floats. (I guess usually those coefficients are just not integers). But this can easily be changed.

def _get_models_multiindex():
    df = pd.DataFrame(
        data={}, columns=["value", "ci_lower", "ci_upper", "p_value"]
    )
    df.loc[0] = [0.23123, 0, 1, 1]
    df.loc[1] = [1, 0, 1, 1]
    df.loc[2] = [2, 0, 1, 1]
    df.index = pd.MultiIndex.from_tuples(
        [("p_1", "v_1"), ("p_1", "v_2"), ("p_2", "v_2")]
    )
    
    info = {"n_obs": 400}
    mod1 = {"params": df, "info": info, "name": "m1"}
    mod2 = {"params": df, "info": info, "name": "m2"}
    models = [mod1, mod2]
    return models

models = _get_models_multiindex()
out = et.estimation_table(
        models, return_type="render_inputs", confidence_intervals=True
    )
out["body"]

leads to
m1 m2
('p_1', 'v_1') 0.231$^{ }$ 0.231$^{ }$
('p_1', '') (0.000;1.000) (0.000;1.000)
('p_1', 'v_2') 1.000$^{ }$ 1.000$^{ }$
('p_1', '') (0.000;1.000) (0.000;1.000)
('p_2', 'v_2') 2.000$^{ }$ 2.000$^{ }$
('p_2', '') (0.000;1.000) (0.000;1.000)

Ah, sorry, seems like I was mistaken in my assumptions.

When talking about integer cells, I was thinking about columns with (Pandas) dtype object, which may contain integer cells:

In [8]: x = pd.Series(["a", 1, 0.01])

In [9]: type(x[1])
Out[9]: int

Seems like you are thinking about the case where we have all-float columns, some of which may happen to look like integers. I agree these cases should not be treated in any special way.

@hmgaudecker
Copy link
Member

I guess I do not quite understand why we can easily center-align columns but not right-align them?

Have I misunderstood your comment that you would like to check the whole column whether it only consists of integers? This is something we do not do at the moment. We currently do not center-align whole columns, but only check and align single cells. Switching from center-align to right-align for those specific cells wouldn't be a problem. But I am not sure whether this really is what we want in columns in which we have floats in the main part of the body and then n_observations at the end.

Apologies, I had not looked at the code and did not realise that this is fully internal, i.e., it cannot be controlled by the user. This is something that should change eventually (and we may prepare for it in the code by making two functions _center_align_integers and _center_align_non_numeric_strings out of the one function we have), but fine to leave as is for the moment IMHO.

@ChristianZimpelmann
Copy link
Member Author

Perfect. I opened #451.

Janos, if you agree, I think we can merge.

@hmgaudecker
Copy link
Member

With "We may prepare for it in the code ..." I meant "make the preparations now" 😆

In the issue, that line will be misleading to future developers, I believe.

@ChristianZimpelmann
Copy link
Member Author

Alright, that makes sense. I deleted it in the issue.

I think it isn`t worth right now to split up the function before we know exactly which changes we are planning to make.

The failing tests seem to be unrelated to this PR.

@janosg
Copy link
Member

janosg commented Mar 31, 2023

Hi @ChristianZimpelmann, thanks a lot for finishing this up. A few questions:

"Total 2.2 observations" in a cell, would be treated as a float (not center-aligned) and also the first part "Total " would be stripped of.

Is this still the case? I don't think there ever was an intentional choice to have it and it is just a bug and we should fix it before merging.

Allow to easily add a row of strings like "Controls": ["Yes", "No"]) to the footer before calling render_latex

Is the workflow for this documented, ideally with an example? I think this is a very important feature and it should be documented before we merge.

Thanks! While you are at it, can you please turn off the performance warnings, which are irrelevant for tables

Is this done? Should not take long and should be done before we merge.

Thanks. Please make sure that when you call a test test_render_latex(), it actually calls the function render_latex. The test that is called this way should be test_center_align_integers_and_non_numeric_strings ... [see comment by HM]

Is the bug mentioned in this comment by HM fixed?

I had not looked at the code and did not realise that this is fully internal, i.e., it cannot be controlled by the user. This is something that should change eventually

While I agree that the default visualization of integer only columns should change (see #451), I am always hesitant to add yet another argument to give more control to the user. The whole philosophy of estimation_table is to give you something great by default and if you really need absolutely fine control, let you modify the generated DataFrame before it is rendered to latex or html.

Sorry that I ask for more changes, but I think we need to make sure that no bug that was uncovered during this PR slips through the review.

@janosg
Copy link
Member

janosg commented Mar 31, 2023

I think to really improve estimation_table it would be good to have a set of tables that need to look good out of the box. This could be a separate repo that uses pytask to create the tables and compiles them into one pdf / html document.

Do we have a volunteer to set this up / contribute examples? I think it should be a mix of real world examples and hand crafted examples that contain elements that do not look good in the current version.

Once we have that, I would be more motivated to spend a day on cleaning up the code and making sure everything looks nice.

@timmens could you look into the failing tests on Monday? I agree that they are not related to the current PR but it looks like they fall in your area of expertise.

@hmgaudecker
Copy link
Member

The whole philosophy of estimation_table is to give you something great by default and if you really need absolutely fine control, let you modify the generated DataFrame before it is rendered to latex or html.

I guess I tend to forget because so far, I have only used it in projects where core developers of this feature were involved and things ended up being fixed in PRs like this one :-)

Would also be great to document how that would work in practice. (not that I checked the website right now...)

@ChristianZimpelmann
Copy link
Member Author

Thanks!

"Total 2.2 observations" in a cell, would be treated as a float (not center-aligned) and also the first part "Total " would be stripped of.

Is this still the case? I don't think there ever was an intentional choice to have it and it is just a bug and we should fix it before merging.

No, this was fixed in this PR.

Allow to easily add a row of strings like "Controls": ["Yes", "No"]) to the footer before calling render_latex

Is the workflow for this documented, ideally with an example? I think this is a very important feature and it should be documented before we merge.

I am afraid it is not. Would you document it in a docstring or the Notebook?

Thanks! While you are at it, can you please turn off the performance warnings, which are irrelevant for tables

Is this done? Should not take long and should be done before we merge.

I don't see where it is turned off. But I don't get any performance warning when running pytest .\tests\visualization\test_estimation_table.py. Or when would you expect those warnings?

Thanks. Please make sure that when you call a test test_render_latex(), it actually calls the function render_latex. The test that is called this way should be test_center_align_integers_and_non_numeric_strings ... [see comment by HM]

Is the bug mentioned in this comment by HM fixed?

Yes!

I had not looked at the code and did not realise that this is fully internal, i.e., it cannot be controlled by the user. This is something that should change eventually

While I agree that the default visualization of integer only columns should change (see #451), I am always hesitant to add yet another argument to give more control to the user. The whole philosophy of estimation_table is to give you something great by default and if you really need absolutely fine control, let you modify the generated DataFrame before it is rendered to latex or html.

Sorry that I ask for more changes, but I think we need to make sure that no bug that was uncovered during this PR slips through the review.

Do we have a volunteer to set this up / contribute examples? I think it should be a mix of real world examples and hand crafted examples that contain elements that do not look good in the current version.

I could make a first draft of it next week.

@janosg
Copy link
Member

janosg commented Mar 31, 2023

The documentation should be here. Nobody reads the docstring.

The Performance warnings are probably only triggered when you have params with a MultiIndex. But it would not hurt to ignore these warnings preemptively for the entire estimation table function.

Thanks a lot for the PR and for setting up the repo!

@hmgaudecker I would say it's documented here but I am very open to any suggestions to improve this documentation.

@hmgaudecker
Copy link
Member

S+R on that page: fist->first, exmample

You can get almost limitless flexibility if you split the table generation into two steps. The fist generates a DataFrame which you can customize to your liking, the second renders that DataFrame in LaTeX or HTML.

I'd expand this with:

If you are interested in this feature, search for "render_inputs" below.

There should be some heading above the imports on that page, the way it is rendered it seems like they are somehow part of the introduction.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ChristianZimpelmann
Copy link
Member Author

ChristianZimpelmann commented Apr 6, 2023

I now updated the how-to notebook and added in particular the feature which was added in this PR.

I also implemented all suggestions by @hmgaudecker (see below). So far, I haven't renamed the how-to though. I don't have a strong opinion or a good idea what could be more "catchy".

I will work on the repo of example tables later today.

S+R on that page: fist->first, exmample

You can get almost limitless flexibility if you split the table generation into two steps. The fist generates a DataFrame which you can customize to your liking, the second renders that DataFrame in LaTeX or HTML.

I'd expand this with:

If you are interested in this feature, search for "render_inputs" below.

There should be some heading above the imports on that page, the way it is rendered it seems like they are somehow part of the introduction.

show_index_names=show_index_names,
show_col_names=show_col_names,
escape_special_characters=escape_special_characters,
with warnings.catch_warnings():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it would not hurt to ignore these warnings preemptively for the entire estimation table function.

@janosg, is this the implementation you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have renamed the current estimation_table to _estimation_table and defined a new estimation_table that calls _estimation_table in this catch warnings context. Otherwise we get a lot of indentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good idea! We should do that!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented it using a decorator in the last commit.

Do you have any more suggestions? Otherwise, we can merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side all good!

Copy link
Member

@janosg janosg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, @ChristianZimpelmann! I think we can merge!

@ChristianZimpelmann
Copy link
Member Author

ChristianZimpelmann commented Apr 10, 2023

One test is still (or again) failing.. unrelated to changes in this PR as far as I understand.

Should I just ignore it?

@janosg
Copy link
Member

janosg commented Apr 10, 2023

We are working on these random failures. Yes, it's unrelated. I'll merge.

@janosg janosg merged commit 2dcfe51 into main Apr 10, 2023
@janosg janosg deleted the improve_estimation_table branch April 10, 2023 09:10
@timmens timmens mentioned this pull request Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Two proposals for estimation_table
5 participants