Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datasets to model card #1015

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

ittailup
Copy link
Contributor

@ittailup ittailup commented Dec 29, 2023

In the spirit of #1004 and #1005, this automatically adds a "datasets" object to the model card through the "dataset_tags" attribute.

The reason we use dataset_tags and not dataset comes from here

Datasets are from model card if they are local directories, using the same method as in https://github.com/OpenAccess-AI-Collective/axolotl/blob/dec66d7c53a2de6cf74911faf9c1ad1f7f0fff14/src/axolotl/utils/data.py#L244

Running @hamelsmu tiny-mistral config from here we get the following model card:

---
library_name: peft
tags:
- generated_from_trainer
datasets:
- mhenrichsen/alpaca_2k_test
base_model: openaccess-ai-collective/tiny-mistral
model-index:
- name: temp_dir
  results: []
---

trainer.create_model_card(model_name=cfg.output_dir.lstrip("./"))
trainer.create_model_card(
model_name=cfg.output_dir.lstrip("./"),
dataset_tags=list(map(lambda d: d["path"], cfg["datasets"])),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, I do think we should try to figure out if a path is a local path or hf dataset though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking if local dir now, should be good. I tried to also check if we were saving the tokenized dataset to hfhub too, but when I enabled push_dataset_to_hub I would get

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'ittailup/alpaca-test/942a2dcbc346823803788fd18b4bb1bf'. Use `repo_type` argument if needed.

will report as separate issue.

@winglian
Copy link
Collaborator

I'm happy to fix the linti g tomorrow for you.

@hamelsmu
Copy link
Collaborator

@winglian I fixed the lint (my fault) in #1014

@hamelsmu
Copy link
Collaborator

hamelsmu commented Dec 29, 2023

Actually I'll fix the lint here too, its a small commit nevermind I can't push to this PR

Copy link

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

@@ -181,7 +181,12 @@ def terminate_handler(_, __, model):
model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)

if not cfg.hub_model_id:
trainer.create_model_card(model_name=cfg.output_dir.lstrip("./"))
trainer.create_model_card(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Given you're adding datasets, here you can potentially also add finetuned_from. This is equivalent to base_model in the configuration from axolotl and will link the fine-tuned models to their base models

See https://huggingface.co/docs/hub/model-cards#specifying-a-base-model

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this might already be specified 🔥

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I added it in my first try but once I saw the output and code I noticed it was already getting picked up (you can see it in hamel's example of the model card raw print.

I am wondering what other things could get pushed on to the model card metadata as a separate function. Perhaps tags could get auto-added from specific settings? I've created the obvious one (evals) at #1020 . The other one mentioned on the issue is CO2 usage, but could be a little bit more difficult.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osanseviero should we use something like hf_api.list_models(filter=...) to verify that the base_model specified isn't a locally downloaded model? or is there a simpler lightweight check?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can go with huggingface/huggingface_hub#36 (comment) (so catching an error upon doing model_info) - that would likely be the most precise

Copy link
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

Copy link
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just realized this only handles the case where it creates a model card when no hub_model_id is defined and it doesn't get pushed to hf. We need to add the dataset tags here as well for the case when the model gets automatically pushed to the hub and the model card gets created in that step.

@winglian
Copy link
Collaborator

@ittailup do you want me to help you with making the changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants