Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HFModelPusher component proposal #174

Merged
merged 9 commits into from
Aug 30, 2022
74 changes: 74 additions & 0 deletions proposals/20220823-huggingface_model_pusher.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#### SIG TFX-Addons
# Project Proposal

---

**Your name:** Chansung Park

**Your email:** deep.diver.csp@gmail.com

**Your company/organization:** Individual ([ML GDE](https://developers.google.com/community/experts/directory/profile/profile-chansung-park))

**Project name:** HuggingFace Model Pusher

## Project Description
HuggingFace Model Pusher(`HFModelPusher`) pushes blessed model to the [HuggingFace Model Hub](https://huggingface.co/models).

## Project Category
Component

## Project Use-Case(s)
The HuggingFace Model Hub lets us have [Git-LFS](https://git-lfs.github.com) enabled repositories in public and private modes. Supported models hosted on the HuggingFace Model Hub can be directly loaded/used with APIs provided by [transformers](https://huggingface.co/docs/transformers/index) package. However, it is not limited. We can host arbitrary types of models too.

HuggingFace Model Hub is easy to manage model versions, especially for those familiar with Git.

## Project Implementation
HFModelPusher is a class-based TFX component, and it inherits from TFX standard `Pusher` component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may not necessarily need to inherit either btw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, i agree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not need to. But having it inherited from Pusher will be beneficial. I guess that's what @deep-diver meant.


It takes the following inputs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to include a README and/or a model_card_metadata config as inputs for additional documentation and discoverability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codesue

Thank you for the suggestion! me and @sayakpaul had the same thought.

Specifically, it would be great to upload a model card generated by Evaluator and model-card-toolkit. However the model-card-toolkit is a on-going project to be ported into TFX Add-on, so I thought maybe upgrade HFModelPusher when model-card-toolkit is completed.

By the way, your suggestion on the model_card_metadata config sounds good too! But, it has many many information to fill in, so it would be inappropriate for a TFX component to fill in automatically. Do you have any idea how to make things easier for users to use this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to how mct fills information, it may be good to automatically fill some of this by adding Statistics and such as inputs to the component. See the existing ModeCardGenerator component for reference -> https://github.com/tensorflow/model-card-toolkit/blob/master/model_card_toolkit/tfx/executor.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, taking outputs from Evaluator and StatisticsGen optionally.

  • if both Artifacts exist, create a ModelCardToolkit
  • then fill some information from it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubt.

Even if the Evaluator and StatisticsGen output artifacts do not exist we'll create a model card with general info, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think modelcards package is useful for this purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sayakpaul @codesue @casassg

Or it would be easier to just put HTML contents generated by MCT into the markdown model card in HuggingFace Mode Repo. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would want to avoid the code for turning an HTML page into a separate markdown file for the Hugging Face Hub README. IMO, we could develop separate utilities for this purpose:

  • Have a template for the model card (TFX and HF have their own formats, it seems, but there's overlap in information).
  • Have a utility that generates HTML page from the populated template text
  • Have utility that generates a markdown file from the populated template text

Copy link
Contributor Author

@deep-diver deep-diver Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

Since this component could be over complicated if we include Model Card generation feature, I thought two possible solutions:

  1. Create another custom TFX evaluation component for HuggingFace Model Card / or HF Model Card Generator
  2. Or just run evaluate inside this component(HFModelPusher) over the test dataset and fill evaluation results into the Model Card
  3. Or leverage the existing Evaluator standard TFX component

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model Card Toolkit can also create markdown or any arbitrary file types since you can have custom templates and filenames, so it wouldn't require converting HTML to markdown. Anyway, I'm in favor of your and @sayakpaul's original idea to upgrade HFModelPusher to include a model card later in order to prevent increasing scope and to avoid adding dependencies and unknowns.

```
HFModelPusher(
username: str,
huggingface_access_token: str,
repo_name: Optional[str],
model: Optional[types.Channel] = None,
model_blessing: Optional[types.Channel] = None,
)
```
- `username` : username of the HuggingFace user (can be an individual user or an organization)
- `hf_access_token` : access token value of the HuggingFace user.
- `repo_name` : the repository name to push the current version of the model to. The default value is same as the TFX pipeline name
- `model` : the model artifact from the upstream TFX component such as `Trainer`
- `model_blessing` : the blessing artifact from the upstream TFX component such as `Evaluator`

It gives the follwing outputs:
- `pushed` : integer value to denote if the model is pushed or not. This is set to 0 when the input model is not blessed, and it is set to 1 when the model is successfully pushed
- `pushed_version` : string value to indicate the current model version. This is decided by `time.time()` Python built-in function
- `repo_id` : repository ID where the model is pushed to. This follows the format of f"{username}/{repo_name}"
- `branch` : branch name where the model is pushed to. The branch name is automatically assigned to the same value of `pushed_version`
- `commit_id` : the id from the commit history (branch name could be sufficient to retreive a certain version of the model)
- `repo_url` : repository URL. It is something like f"https://huggingface.co/{repo_id}/{branch}"

The behaviour of the component:
1. It pushes the model when the `model` is blessed, or it pushes the `model` when the `model_blessing` parameter is set to `None`. This behaviour inherits from the standard `Pusher` component
2. Creates HuggingFace Hub Repository object using the `huggingface-hub` package. It will clone one if there is already an existing repository
3. Checks out a new branch with the name as `pushed_version`. Since the model is pushed for experimental purpose, it would be good to track the versions of the model within separate branches (When the model is ready to be open to public, one can manually merge the right version(branch) into the main branch)
4. Copy all the model related files into a temporary directory in a local file system. All the model related files produced by the upstream component such as `Trainer`. They could be stored in GCS bucket, so `tf.io.gfile` module is a good choice since it handles files in location agnostic manner (GCS or local)
5. Add & commit the current status
6. Pushes the commit to the remote HuggingFace Model Repository


## Project Dependencies
- [tfx](https://pypi.org/project/tfx/)
- [huggingface-hub](https://pypi.org/project/huggingface-hub/)

## Project Team
- Chansung Park, @deep-diver, deep.diver.csp@gmail.com
- Sayak Paul, @sayakpaul, spsayakpaul@gmail.com

# Note
Please be aware of the processes and requirements which are outlined here:

* [SIG-TFX-Addons](https://github.com/tensorflow/tfx-addons)
* [Contributing Guidelines](https://github.com/tensorflow/tfx-addons/blob/main/CONTRIBUTING.md)
* [TensorFlow Code of Conduct](https://github.com/tensorflow/tfx-addons/blob/main/CODE_OF_CONDUCT.md)