-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add table to model cards #90
ENH Add table to model cards #90
Conversation
Solves skops-dev#87 This feature adds a new method to the Card class called add_table. It adds a generic table to the end of the model card by creating a new section (similar to plots). Any generic table of type dict[str, list[Any]] is supported, as well as DataFrames. However, the main use case is for adding CV results, as e.g. created by GridSearchCV. Implementation I hope you can stomach the type-y implementation Adrin ;) I wanted to make the extra sections more generic in PR skops-dev#89 as a preparation for this PR (thus avoiding a big PR with refactor + new feature). However, the changes in skops-dev#89 were not sufficient (I guess that's the danger of splitting the work). The problem was that the formatting of adding the plots in the save method was very specific to plots, as a markdown link would be added. This, of course, does not work for tables. My solution is that _extra_sections now takes ExtraSection objects as values instead of strings. These objects only need to implement the format method (so technically, str would still be okay). So for a plot, the PlotSection class has a format method that returns a markdown link. And for tables, the TableSection class has a format method that uses tabulate to return a table as string. This approach should hopefully be flexible enough for us to later add new section types if needed. At first glance, the implementation might look overly complex. However, I think it's better than having a bunch of if...else inside of save that formats each possible type differently. But please LMK if you have a better idea, I'm open to suggestions. Examples Here is the hyperparameter table generated by plot_model_card.py: https://huggingface.co/skops/hf_hub_example-b959cadc-ffb3-4f07-856b-2d05d2d8f8e0#hyperparameter-search-results
@skops-dev/maintainers ready for review. Regarding failing codecov: One instance is spurious, the other comes back to #53, which we haven't tackled yet. (Note: You may wonder why this is inside of if sys.version_info >= (3, 8):
from typing import Protocol
else:
from typing_extensions import Protocol I actually tried to put it in |
Instead of an extra type, I'd just store the tables, and have a little function which takes the table and return what should be rendered in the md file. We don't need a protocol for adding tables to the model card. |
I assume you mean inside the As to the protocol, it is not strictly needed, it's just to add a bit of extra type safety. |
skops/card/_model_card.py
Outdated
""" | ||
for key, val in kwargs.items(): | ||
section = TableSection(table=val) | ||
self._extra_sections[key] = section |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we could add to self._extra_tables
instead and then we know they're tables. Each method adds to their corresponding attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid having separate containers for each type of addition. One reason is that it makes the code more lengthy (we'll later have to loop over each one separately), the other is that we lose the information of which order the extra sections are stored (e.g. I add a plot, a table, then a plot, but at the end I get plot, plot, table).
Another design I wanted to avoid is formatting already inside the add_*
method, to not lock us in into a specific format at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so if the order is important (which I agree with), we should make it a list of named tuples instead of a dict (and relying on the dict preserving insertion order during iteration).
If we remove the protocol and not typing our private attribute, then I'm fine with the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to make those changes.
Regarding the dict to list of tuples proposal: From my understanding, as long as you only add to the dict, the order should be preserved, right? So the difference would mainly be if we want to later allow deletion, or if a user wants to have multiple sections with the same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true in modern new python versions. I think of it as an implementation detail, and rather not rely on it really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the dict change was made in Python 3.6, it was indeed so:
The order-preserving aspect of this new implementation is considered an implementation detail and should not be relied upon
(link)
With Python 3.7 (which is our min version), the order is officially part of the language:
the insertion-order preservation nature of dict objects has been declared to be an official part of the Python language spec.
(link)
So this wouldn't be my main concern here. I'd rather think about these implications:
- We can have duplicate section names in the extra sections
- Time complexity to retrieve items is now O(N) and there is no guarantee that only a single key matches (due to 1.)
- More control about order (e.g. when deleting and reinserting items)
Due to the nature of the extra sections, which can contain anything, I think the trade-off is okay, whereas template sections should remain a dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know all these details about python's dict, but not many people do, and a "dictionary" isn't necessarily ordered in other languages. I tend to avoid things which are somewhat common in different languages but slightly differ in python for some specific reason.
- _extra_sections is now a list of key, val tuples - Remove the ExtraSection protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like it now.
Things left:
- user guide for the added functionality
- changelog entry
I'll work on that and let you know once it's ready for review. |
@adrinjalali This is now ready for review. In the changelog, I have not yet added subsection for new features, bugfixes, etc. LMK if now is a good time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM!
skops/card/tests/test_card.py
Outdated
def test_raise_error_if_no_cols(self): | ||
msg = "Empty table added" | ||
with pytest.raises(ValueError, match=msg): | ||
TableSection(table={}) | ||
|
||
def test_raise_error_if_no_rows(self): | ||
msg = "Empty table added" | ||
with pytest.raises(ValueError, match=msg): | ||
TableSection(table={"col": []}) | ||
|
||
def test_raise_error_df_empty(self): | ||
pd = pytest.importorskip("pandas") | ||
df = pd.DataFrame([]) | ||
msg = "Empty table added" | ||
with pytest.raises(ValueError, match=msg): | ||
TableSection(table=df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these could be a single parameterized test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please review again.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Solves #87
Description
This feature adds a new method to the
Card
class calledadd_table
. Itadds a generic table to the end of the model card by creating a new
section (similar to plots).
Any generic table of type
dict[str, list[Any]]
is supported, as well asDataFrame
s. However, the main use case is for adding CV results, as e.g.created by
GridSearchCV
.Implementation
I hope you can stomach the type-y implementation Adrin ;)
I wanted to make the extra sections more generic in PR #89 as a
preparation for this PR (thus avoiding a big PR with refactor + new
feature). However, the changes in #89 were not sufficient (I guess
that's the danger of splitting the work).
The problem was that the formatting of adding the plots in the
save
method was very specific to plots, as a markdown link would be added.
This, of course, does not work for tables.
My solution is that
_extra_sections
now takesExtraSection
objects asvalues instead of strings. These objects only need to implement the
format
method (so technically,str
would still be okay). So for aplot, the
PlotSection
class has aformat
method that returns a markdownlink. And for tables, the
TableSection
class has aformat
method thatuses
tabulate
to return a table as string.This approach should hopefully be flexible enough for us to later add
new section types if needed.
At first glance, the implementation might look overly complex. However,
I think it's better than having a bunch of
if...else
inside of save thatformats each possible type differently. But please LMK if you have a
better idea, I'm open to suggestions.
Tangential changes
So far, the plots were just added without an extra section to the bottom.
Now, there is an "# Additioinal Content" section and each plot and table
gets their own subsection.
Examples
Here is the hyperparameter table generated by
plot_model_card.py: