Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Refactor Model Card #37

Merged
merged 64 commits into from
Jul 27, 2022
Merged

ENH Refactor Model Card #37

merged 64 commits into from
Jul 27, 2022

Conversation

merveenoyan
Copy link
Collaborator

Partially solves #29

@merveenoyan
Copy link
Collaborator Author

merveenoyan commented Jul 18, 2022

@adrinjalali @BenjaminBossan I cleaned up a little as we talked and updated example so you can see the usage.
Looks like this.

Note that this is a draft PR and I haven't implemented add_inspection and add methods yet.

@merveenoyan
Copy link
Collaborator Author

merveenoyan commented Jul 19, 2022

@adrinjalali this might be redundant/trivial but I wanted to confirm we previously talked about how we should enable people to create their own Displays both in #29 and the model card examination PR. I looked at source of Display objects and they're essentially matplotlib plots. I say add_inspection method doesn't limit to Display objects (which are again matplotlib plots) and we take matplotlib plots in general (which I thought of when I was writing the docstring for it). We can put two examples inside the plot_model_card one including a matplotlib plot and other including a Display.

@BenjaminBossan
Copy link
Collaborator

First of all, thanks for working on this and taking into account the feedback we gave. From my POV, this is going in the right direction.

I don't have a full blown review, but some more high level comments:

In the example, it is still required for the user to create a CardData object from model_cards. My personal preference would be that we don't require that. Instead, could we not have special fields on the Card class like tags, which we would use to populate the CardData when the user exports their model card? That way, they don't need to bother with that.

Probably, there is an argument to be made that this card data constitutes meta data and should thus be treated differently. In that case, we could have a special method add_metadata. WDYT?

The save method may need some cleanup.

  • A lot of the code is concerned with determining the template path, I think that should be factored out.
  • The method mutates self.template_sections, we should avoid this kind of side-effect if possible.
  • Do we want to write it in a more generic way? E.g. have an argument like format='huggingface' and call a specific method when for each format (only HF at the start).

Regarding the code in extract_estimator_config, I would propose that we use tabulate instead of concatenating strings to create tables, which becomes unwieldy very quickly. Of course, we should avoid adding dependencies if possible but tabulate has been very dependable for me in the past and is a very thin library.

@merveenoyan
Copy link
Collaborator Author

merveenoyan commented Jul 19, 2022

@BenjaminBossan I know about tabulate but if this is the only place we tabulate things I'm not in favor of adding an additional dependency.
I don't see why template_sections being mutated is a problem. The saving is where we pass the template_sections dictionary so it should be there.
I can refactor template_path related code, sure. Again I'm not nitpicking atm really given it's not open to review yet.
For CardData part, sure we can treat it like any other section and let people set things using add. 🙂
something like below should work:

metadata_keys = ["language", "license", "tags", "datasets", "metrics", "model-index"]
card_data_keys = {}
for key in metadata_keys:
    card_data_keys[key] = self.template_sections.pop(key)
card_data = CardData(**card_data_keys)

Then the rest can be passed to rendering directly.

For this comment
Do we want to write it in a more generic way? E.g. have an argument like format='huggingface' and call a specific method when for each format (only HF at the start). I like the idea.

@merveenoyan
Copy link
Collaborator Author

merveenoyan commented Jul 19, 2022

I wouldn't like to limit the user from adding as many plots as they wish, does something like below sound ok to you??

# append plot_name at the end of the template 
template = open(self.template_sections["template_path"], 'a')
for section in self.template_sections:
    if section not in template:
        template.write("\n\n")
        template.write(f"{section}\n")
        template.write("{{ " + f"{section}"+ " }}\n")
template.close()

which simply adds plot_name and a slot for it in template.
@adrinjalali @BenjaminBossan

@BenjaminBossan
Copy link
Collaborator

I know about tabulate but if this is the only place we tabulate things I'm not in favor of adding an additional dependency.

Okay, let's see if there will be other places we need it. If it's just this single one, I can see your argument.

I don't see why template_sections being mutated is a problem. The saving is where we pass the template_sections dictionary so it should be there.

In general, I would try to avoid (non-obvious) side effects if possible, since they make it harder to reason about the code. E.g. it can easily happen that calling the same method twice leads to different outcomes.

I wouldn't like to limit the user from adding as many plots as they wish

Yes, we shouldn't limit the user. As for the specific implementation, I can't really judge, I would need to better understand the involved code first.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this API looks a lot better. We can move to documentation and tests now IMO.

I left a few comments, but nothing major.

As for tabulate, I'd agree with @BenjaminBossan that it'd be useful, but I don't mind not adding it in this PR. I think we'd need more tabulation when we move to adding confusion matrix as a matrix as opposed to a plot and other tabular data we'd like to add in the future. We can add the dependency and refactor when those things happen.

skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
@merveenoyan merveenoyan marked this pull request as ready for review July 20, 2022 10:21
@merveenoyan
Copy link
Collaborator Author

Model card created with this implementation is here @adrinjalali

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good, but my previous review still stands, please have a look at them as well :)

examples/plot_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
@adrinjalali adrinjalali added this to the 0.1 milestone Jul 20, 2022
merveenoyan and others added 2 commits July 20, 2022 13:15
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@merveenoyan
Copy link
Collaborator Author

Model card with the latest version: https://huggingface.co/merve/sklearn-9b93776d-fb95-4b81-b267-fa378476bf15

I'll add tests and documentation now @adrinjalali

@merveenoyan
Copy link
Collaborator Author

merveenoyan commented Jul 20, 2022

@adrinjalali updated the examples, updated tests and added more tests. The hub_utils tests aren't passing though.

@merveenoyan
Copy link
Collaborator Author

merveenoyan commented Jul 20, 2022

WARNING: /Users/mervenoyan/Desktop/skops/skops/examples/plot_hf_hub.py failed to execute correctly: Traceback (most recent call last):
  File "/Users/mervenoyan/Desktop/skops/skops/examples/plot_hf_hub.py", line 100, in <module>
    token = os.environ["HF_HUB_TOKEN"]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'HF_HUB_TOKEN'

I get above failing while I try to get a local build of docs.
On my local:

skops/card/tests/test_card.py::test_save_model_card PASSED
skops/card/tests/test_card.py::test_hyperparameter_table PASSED
skops/card/tests/test_card.py::test_plot_model PASSED
skops/card/tests/test_card.py::test_add PASSED
skops/card/tests/test_card.py::test_add_inspection PASSED
skops/hub_utils/tests/test_hf_hub.py::test_validate_folder PASSED
skops/hub_utils/tests/test_hf_hub.py::test_create_config PASSED
skops/hub_utils/tests/test_hf_hub.py::test_init PASSED
skops/hub_utils/tests/test_hf_hub.py::test_push_download[True] FAILED
skops/hub_utils/tests/test_hf_hub.py::test_push_download[False] FAILED
skops/hub_utils/tests/test_hf_hub.py::test_get_config PASSED

@merveenoyan
Copy link
Collaborator Author

merveenoyan commented Jul 20, 2022

Every time user adds something to template the default template file (if it's used) itself changes, maybe it's better if I add a temporary template when user saves the template with added goods.

@adrinjalali
Copy link
Member

You can add your token to the local run by something like:

HF_HUB_TOKEN=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 make doc

Every time user adds something to template the default template file (if it's used) itself changes, maybe it's better if I add a temporary template when user saves the template with added goods.

Yeah the default template shouldn't change.

@osanseviero
Copy link
Contributor

Is that an actual token? 👀 I would invalidate it from the Hub if it is

@adrinjalali
Copy link
Member

Is that an actual token? eyes I would invalidate it from the Hub if it is

That token is already hard coded in hub_utils/tests/common.py. Related: #47

@merveenoyan
Copy link
Collaborator Author

eb834d4d-824d-4d74-a4b5-1913cfad4403_text

examples/plot_hf_hub.py Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final nits, then I'm happy to merge.

examples/plot_model_card.py Outdated Show resolved Hide resolved
examples/plot_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Show resolved Hide resolved
skops/card/_model_card.py Show resolved Hide resolved
@osanseviero
Copy link
Contributor

osanseviero commented Jul 26, 2022

I see many open discussions. Would be great if you could marked as resolved the ones that you verified are fixed or that don't require any additional action in this PR please 🙏 Great work! 🔥

@merveenoyan
Copy link
Collaborator Author

merveenoyan commented Jul 26, 2022

@adrinjalali codecov on windows fails for some reason. it wasn't failing before.

@adrinjalali
Copy link
Member

@merveenoyan seems like an issue on codecov side, re-running or waiting a few hours seems to be the suggested action lol

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

examples/plot_model_card.py Outdated Show resolved Hide resolved
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@adrinjalali
Copy link
Member

adrinjalali commented Jul 26, 2022

Since this is quite significant, I rather not merge with one approval. Let's wait for another one.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're almost good to go.

The only real issue I had was still with the save method implementation. Please take a look at my suggestions.

# read original template
with open(Path(root[0]) / "card" / "default_template.md") as f:
default_template = f.read()
f.seek(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the seek operation? It doesn't seem to do anything. I deleted it and the test still passed.

self._figure_paths[plot_name] = plot_path
return self

def save(self, path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few issues with this function, please let me know if you agree:

  1. It mutates self.template_sections

As a user, I would be surprised to find that the save method actually mutates the object itself. At the moment, that might not cause a bug, but this type of hidden mutation has the tendency to come back to bite you. Therefore, I would avoid it except if there is no good reason.

To achieve this, simply declare a copy at the beginning of the method, such as template_sections = self.template_sections.copy(). After that, each time you use self.template_sections in this method, use template_sections instead.

It would be even safer to use copy.deepcopy but here we're creating a whole duplicate, I'm not sure if that can be too expensive in some circumstances (doesn't seem so but I'm not sure).

If you agree with this change, here is a unit test that would fail without the change but passes with the change:

import copy
...
def test_template_sections_not_mutated_by_save(destination_path, model_card):
    # TODO
    template_sections_before = copy.deepcopy(model_card.template_sections)
    model_card.save(Path(destination_path) / "README.md")
    template_sections_after = copy.deepcopy(model_card.template_sections)
    assert template_sections_before == template_sections_after
  1. Remove if self._figure_paths: branch

I saw that there is some code duplication that we can avoid. It is not strictly necessary to check if self._figure_paths and then do an else. The reason it can be avoided is because the for plot in self._figure_paths already is a no-op if self._figure_paths is empty, so the outcome is the same. Thus this whole code:

        if self._figure_paths:
            with tempfile.TemporaryDirectory() as tmpdirname:
                shutil.copyfile(
                    self.template_sections["template_path"],
                    f"{tmpdirname}/temporary_template.md",
                )
                #  create a temporary template with the additional plots
                self.template_sections[
                    "template_path"
                ] = f"{tmpdirname}/temporary_template.md"
                # add plots at the end of the template
                with open(self.template_sections["template_path"], "a") as template:
                    for plot in self._figure_paths:
                        template.write(
                            f"\n\n{plot}\n"
                            + f"![{plot}]({self._figure_paths[plot]})\n\n"
                        )
                card = ModelCard.from_template(
                    card_data=card_data,
                    hyperparameter_table=self.hyperparameter_table,
                    model_plot=self.model_plot,
                    **self.template_sections,
                )

        else:
            card = ModelCard.from_template(
                card_data=card_data,
                hyperparameter_table=self.hyperparameter_table,
                model_plot=self.model_plot,
                **self.template_sections,
            )

can be simplified to:

        with tempfile.TemporaryDirectory() as tmpdirname:
            shutil.copyfile(
                template_sections["template_path"],
                f"{tmpdirname}/temporary_template.md",
            )
            #  create a temporary template with the additional plots
            template_sections["template_path"] = f"{tmpdirname}/temporary_template.md"
            # add plots at the end of the template
            with open(template_sections["template_path"], "a") as template:
                for plot in self._figure_paths:
                    template.write(
                        f"\n\n{plot}\n"
                        + f"![{plot}]({self._figure_paths[plot]})\n\n"
                    )

            card = ModelCard.from_template(
                card_data=card_data,
                hyperparameter_table=self.hyperparameter_table,
                model_plot=self.model_plot,
                **template_sections,
            )

There is, however, a small disadvantage to the proposed change: It will result in the temporary copy of the template file to be made even if there is no plot. IMHO this is a small price to pay for the code simplification but feel free to disagree.

  1. Breaking down the method

This should probably be done in another PR but I wanted to write it down or else I'll forget:

The save method does too many different things at the same time and should be broken down. E.g. there could be one sub-method to get the card_data and another one to get the card.

@merveenoyan
Copy link
Collaborator Author

@BenjaminBossan addressed your comments, used deepcopy.

merveenoyan and others added 2 commits July 27, 2022 13:40
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a big change, happy to see it ready.

@BenjaminBossan BenjaminBossan merged commit d8f2afe into skops-dev:main Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants