Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add table to model cards #90

Merged
merged 13 commits into from
Aug 11, 2022

Conversation

BenjaminBossan
Copy link
Collaborator

@BenjaminBossan BenjaminBossan commented Aug 10, 2022

Solves #87

Description

This feature adds a new method to the Card class called add_table. It
adds a generic table to the end of the model card by creating a new
section (similar to plots).

Any generic table of type dict[str, list[Any]] is supported, as well as
DataFrames. However, the main use case is for adding CV results, as e.g.
created by GridSearchCV.

Implementation

I hope you can stomach the type-y implementation Adrin ;)

I wanted to make the extra sections more generic in PR #89 as a
preparation for this PR (thus avoiding a big PR with refactor + new
feature). However, the changes in #89 were not sufficient (I guess
that's the danger of splitting the work).

The problem was that the formatting of adding the plots in the save
method was very specific to plots, as a markdown link would be added.
This, of course, does not work for tables.

My solution is that _extra_sections now takes ExtraSection objects as
values instead of strings. These objects only need to implement the
format method (so technically, str would still be okay). So for a
plot, the PlotSection class has a format method that returns a markdown
link. And for tables, the TableSection class has a format method that
uses tabulate to return a table as string.

This approach should hopefully be flexible enough for us to later add
new section types if needed.

At first glance, the implementation might look overly complex. However,
I think it's better than having a bunch of if...else inside of save that
formats each possible type differently. But please LMK if you have a
better idea, I'm open to suggestions.

Tangential changes

So far, the plots were just added without an extra section to the bottom.
Now, there is an "# Additioinal Content" section and each plot and table
gets their own subsection.

Examples

Here is the hyperparameter table generated by plot_model_card.py:

Solves skops-dev#87

This feature adds a new method to the Card class called add_table. It
adds a generic table to the end of the model card by creating a new
section (similar to plots).

Any generic table of type dict[str, list[Any]] is supported, as well as
DataFrames. However, the main use case is for adding CV results, as e.g.
created by GridSearchCV.

Implementation

I hope you can stomach the type-y implementation Adrin ;)

I wanted to make the extra sections more generic in PR skops-dev#89 as a
preparation for this PR (thus avoiding a big PR with refactor + new
feature). However, the changes in skops-dev#89 were not sufficient (I guess
that's the danger of splitting the work).

The problem was that the formatting of adding the plots in the save
method was very specific to plots, as a markdown link would be added.
This, of course, does not work for tables.

My solution is that _extra_sections now takes ExtraSection objects as
values instead of strings. These objects only need to implement the
format method (so technically, str would still be okay). So for a
plot, the PlotSection class has a format method that returns a markdown
link. And for tables, the TableSection class has a format method that
uses tabulate to return a table as string.

This approach should hopefully be flexible enough for us to later add
new section types if needed.

At first glance, the implementation might look overly complex. However,
I think it's better than having a bunch of if...else inside of save that
formats each possible type differently. But please LMK if you have a
better idea, I'm open to suggestions.

Examples

Here is the hyperparameter table generated by plot_model_card.py:

https://huggingface.co/skops/hf_hub_example-b959cadc-ffb3-4f07-856b-2d05d2d8f8e0#hyperparameter-search-results
@BenjaminBossan
Copy link
Collaborator Author

BenjaminBossan commented Aug 10, 2022

@skops-dev/maintainers ready for review.

Regarding failing codecov: One instance is spurious, the other comes back to #53, which we haven't tackled yet.

(Note:

You may wonder why this is inside of _model_card.py instead of fixes.py:

if sys.version_info >= (3, 8):
    from typing import Protocol
else:
    from typing_extensions import Protocol

I actually tried to put it in fixes.py and it didn't work, resulting in a mypy error, so I had to put it in the module where it's used. If you have an idea what's going on, please let me know. What I tried but didn't help: remove # noqa, grouping and not grouping imports.)

@adrinjalali
Copy link
Member

Instead of an extra type, I'd just store the tables, and have a little function which takes the table and return what should be rendered in the md file. We don't need a protocol for adding tables to the model card.

@BenjaminBossan
Copy link
Collaborator Author

Instead of an extra type, I'd just store the tables, and have a little function which takes the table and return what should be rendered in the md file. We don't need a protocol for adding tables to the model card.

I assume you mean inside the save method. How would you know at that point what function to use to render the value? Sure, we could say str -> plot and dict -> table but I don't like that too much and it would probably not work with more types being added. IMO it also doesn't add much complexity -- you can directly see how different types of section are formatted and don't need to figure out which function is being called on them.

As to the protocol, it is not strictly needed, it's just to add a bit of extra type safety.

"""
for key, val in kwargs.items():
section = TableSection(table=val)
self._extra_sections[key] = section
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we could add to self._extra_tables instead and then we know they're tables. Each method adds to their corresponding attribute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid having separate containers for each type of addition. One reason is that it makes the code more lengthy (we'll later have to loop over each one separately), the other is that we lose the information of which order the extra sections are stored (e.g. I add a plot, a table, then a plot, but at the end I get plot, plot, table).

Another design I wanted to avoid is formatting already inside the add_* method, to not lock us in into a specific format at that point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so if the order is important (which I agree with), we should make it a list of named tuples instead of a dict (and relying on the dict preserving insertion order during iteration).

If we remove the protocol and not typing our private attribute, then I'm fine with the PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to make those changes.

Regarding the dict to list of tuples proposal: From my understanding, as long as you only add to the dict, the order should be preserved, right? So the difference would mainly be if we want to later allow deletion, or if a user wants to have multiple sections with the same name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true in modern new python versions. I think of it as an implementation detail, and rather not rely on it really.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the dict change was made in Python 3.6, it was indeed so:

The order-preserving aspect of this new implementation is considered an implementation detail and should not be relied upon

(link)

With Python 3.7 (which is our min version), the order is officially part of the language:

the insertion-order preservation nature of dict objects has been declared to be an official part of the Python language spec.

(link)

So this wouldn't be my main concern here. I'd rather think about these implications:

  1. We can have duplicate section names in the extra sections
  2. Time complexity to retrieve items is now O(N) and there is no guarantee that only a single key matches (due to 1.)
  3. More control about order (e.g. when deleting and reinserting items)

Due to the nature of the extra sections, which can contain anything, I think the trade-off is okay, whereas template sections should remain a dict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know all these details about python's dict, but not many people do, and a "dictionary" isn't necessarily ordered in other languages. I tend to avoid things which are somewhat common in different languages but slightly differ in python for some specific reason.

- _extra_sections is now a list of key, val tuples
- Remove the ExtraSection protocol
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like it now.

Things left:

  • user guide for the added functionality
  • changelog entry

@BenjaminBossan
Copy link
Collaborator Author

I'll work on that and let you know once it's ready for review.

@BenjaminBossan
Copy link
Collaborator Author

@adrinjalali This is now ready for review.

In the changelog, I have not yet added subsection for new features, bugfixes, etc. LMK if now is a good time.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!

docs/changes.rst Outdated Show resolved Hide resolved
examples/plot_model_card.py Outdated Show resolved Hide resolved
Comment on lines 383 to 398
def test_raise_error_if_no_cols(self):
msg = "Empty table added"
with pytest.raises(ValueError, match=msg):
TableSection(table={})

def test_raise_error_if_no_rows(self):
msg = "Empty table added"
with pytest.raises(ValueError, match=msg):
TableSection(table={"col": []})

def test_raise_error_df_empty(self):
pd = pytest.importorskip("pandas")
df = pd.DataFrame([])
msg = "Empty table added"
with pytest.raises(ValueError, match=msg):
TableSection(table=df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these could be a single parameterized test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Please review again.

BenjaminBossan and others added 3 commits August 11, 2022 13:42
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@adrinjalali adrinjalali changed the title Add table to model cards ENH Add table to model cards Aug 11, 2022
@adrinjalali adrinjalali merged commit 5a8184e into skops-dev:main Aug 11, 2022
@BenjaminBossan BenjaminBossan deleted the add-table-to-model-cards branch August 11, 2022 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants