Add datasets to model card #1015

ittailup · 2023-12-29T04:43:29Z

In the spirit of #1004 and #1005, this automatically adds a "datasets" object to the model card through the "dataset_tags" attribute.

The reason we use dataset_tags and not dataset comes from here

Datasets are from model card if they are local directories, using the same method as in https://github.com/OpenAccess-AI-Collective/axolotl/blob/dec66d7c53a2de6cf74911faf9c1ad1f7f0fff14/src/axolotl/utils/data.py#L244

Running @hamelsmu tiny-mistral config from here we get the following model card:

---
library_name: peft
tags:
- generated_from_trainer
datasets:
- mhenrichsen/alpaca_2k_test
base_model: openaccess-ai-collective/tiny-mistral
model-index:
- name: temp_dir
  results: []
---

winglian · 2023-12-29T04:57:44Z

src/axolotl/train.py

-        trainer.create_model_card(model_name=cfg.output_dir.lstrip("./"))
+        trainer.create_model_card(
+            model_name=cfg.output_dir.lstrip("./"),
+            dataset_tags=list(map(lambda d: d["path"], cfg["datasets"])),


I like this, I do think we should try to figure out if a path is a local path or hf dataset though.

checking if local dir now, should be good. I tried to also check if we were saving the tokenized dataset to hfhub too, but when I enabled push_dataset_to_hub I would get

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'ittailup/alpaca-test/942a2dcbc346823803788fd18b4bb1bf'. Use `repo_type` argument if needed.

will report as separate issue.

winglian · 2023-12-29T06:05:04Z

I'm happy to fix the linti g tomorrow for you.

hamelsmu · 2023-12-29T06:19:59Z

@winglian I fixed the lint (my fault) in #1014

hamelsmu · 2023-12-29T06:21:34Z

~~Actually I'll fix the lint here too, its a small commit~~ nevermind I can't push to this PR

osanseviero

Very cool!

osanseviero · 2023-12-29T14:27:33Z

src/axolotl/train.py

@@ -181,7 +181,12 @@ def terminate_handler(_, __, model):
        model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization)

    if not cfg.hub_model_id:
-        trainer.create_model_card(model_name=cfg.output_dir.lstrip("./"))
+        trainer.create_model_card(


Very cool! Given you're adding datasets, here you can potentially also add finetuned_from. This is equivalent to base_model in the configuration from axolotl and will link the fine-tuned models to their base models

See https://huggingface.co/docs/hub/model-cards#specifying-a-base-model

Actually this might already be specified 🔥

Yes! I added it in my first try but once I saw the output and code I noticed it was already getting picked up (you can see it in hamel's example of the model card raw print.

I am wondering what other things could get pushed on to the model card metadata as a separate function. Perhaps tags could get auto-added from specific settings? I've created the obvious one (evals) at #1020 . The other one mentioned on the issue is CO2 usage, but could be a little bit more difficult.

@osanseviero should we use something like hf_api.list_models(filter=...) to verify that the base_model specified isn't a locally downloaded model? or is there a simpler lightweight check?

You can go with huggingface/huggingface_hub#36 (comment) (so catching an error upon doing model_info) - that would likely be the most precise

winglian

🙌

winglian

So I just realized this only handles the case where it creates a model card when no hub_model_id is defined and it doesn't get pushed to hf. We need to add the dataset tags here as well for the case when the model gets automatically pushed to the hub and the model card gets created in that step.

winglian · 2024-01-11T15:15:12Z

@ittailup do you want me to help you with making the changes?

ittailup added 3 commits December 28, 2023 23:38

Add datasets to model card

e01979a

lint

b1a1735

-

6c2b1b5

winglian reviewed Dec 29, 2023

View reviewed changes

ittailup added 3 commits December 29, 2023 00:56

Check if local dir before adding as dataset_tag

974c3bb

-

7ab68c9

-

cfc2528

ittailup added 2 commits December 29, 2023 09:21

lint

adabe87

Merge branch 'main' into patch-1

990548c

osanseviero reviewed Dec 29, 2023

View reviewed changes

winglian approved these changes Dec 29, 2023

View reviewed changes

winglian requested changes Dec 29, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add datasets to model card #1015

Add datasets to model card #1015

ittailup commented Dec 29, 2023 •

edited

Loading

winglian Dec 29, 2023

ittailup Dec 29, 2023

winglian commented Dec 29, 2023

hamelsmu commented Dec 29, 2023

hamelsmu commented Dec 29, 2023 •

edited

Loading

osanseviero left a comment

osanseviero Dec 29, 2023

osanseviero Dec 29, 2023

ittailup Dec 29, 2023

winglian Dec 29, 2023

osanseviero Dec 29, 2023

winglian left a comment

winglian left a comment

winglian commented Jan 11, 2024

Add datasets to model card #1015

Are you sure you want to change the base?

Add datasets to model card #1015

Conversation

ittailup commented Dec 29, 2023 • edited Loading

winglian Dec 29, 2023

Choose a reason for hiding this comment

ittailup Dec 29, 2023

Choose a reason for hiding this comment

winglian commented Dec 29, 2023

hamelsmu commented Dec 29, 2023

hamelsmu commented Dec 29, 2023 • edited Loading

osanseviero left a comment

Choose a reason for hiding this comment

osanseviero Dec 29, 2023

Choose a reason for hiding this comment

osanseviero Dec 29, 2023

Choose a reason for hiding this comment

ittailup Dec 29, 2023

Choose a reason for hiding this comment

winglian Dec 29, 2023

Choose a reason for hiding this comment

osanseviero Dec 29, 2023

Choose a reason for hiding this comment

winglian left a comment

Choose a reason for hiding this comment

winglian left a comment

Choose a reason for hiding this comment

winglian commented Jan 11, 2024

ittailup commented Dec 29, 2023 •

edited

Loading

hamelsmu commented Dec 29, 2023 •

edited

Loading