-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add datasets to model card #1015
base: main
Are you sure you want to change the base?
Conversation
src/axolotl/train.py
Outdated
trainer.create_model_card(model_name=cfg.output_dir.lstrip("./")) | ||
trainer.create_model_card( | ||
model_name=cfg.output_dir.lstrip("./"), | ||
dataset_tags=list(map(lambda d: d["path"], cfg["datasets"])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, I do think we should try to figure out if a path is a local path or hf dataset though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking if local dir now, should be good. I tried to also check if we were saving the tokenized dataset to hfhub too, but when I enabled push_dataset_to_hub
I would get
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'ittailup/alpaca-test/942a2dcbc346823803788fd18b4bb1bf'. Use `repo_type` argument if needed.
will report as separate issue.
I'm happy to fix the linti g tomorrow for you. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
@@ -181,7 +181,12 @@ def terminate_handler(_, __, model): | |||
model.save_pretrained(cfg.output_dir, safe_serialization=safe_serialization) | |||
|
|||
if not cfg.hub_model_id: | |||
trainer.create_model_card(model_name=cfg.output_dir.lstrip("./")) | |||
trainer.create_model_card( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Given you're adding datasets, here you can potentially also add finetuned_from
. This is equivalent to base_model
in the configuration from axolotl
and will link the fine-tuned models to their base models
See https://huggingface.co/docs/hub/model-cards#specifying-a-base-model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this might already be specified 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I added it in my first try but once I saw the output and code I noticed it was already getting picked up (you can see it in hamel's example of the model card raw print.
I am wondering what other things could get pushed on to the model card metadata as a separate function. Perhaps tags could get auto-added from specific settings? I've created the obvious one (evals) at #1020 . The other one mentioned on the issue is CO2 usage, but could be a little bit more difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@osanseviero should we use something like hf_api.list_models(filter=...)
to verify that the base_model specified isn't a locally downloaded model? or is there a simpler lightweight check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can go with huggingface/huggingface_hub#36 (comment) (so catching an error upon doing model_info
) - that would likely be the most precise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just realized this only handles the case where it creates a model card when no hub_model_id is defined and it doesn't get pushed to hf. We need to add the dataset tags here as well for the case when the model gets automatically pushed to the hub and the model card gets created in that step.
@ittailup do you want me to help you with making the changes? |
In the spirit of #1004 and #1005, this automatically adds a "datasets" object to the model card through the "dataset_tags" attribute.
The reason we use dataset_tags and not dataset comes from here
Datasets are from model card if they are local directories, using the same method as in https://github.com/OpenAccess-AI-Collective/axolotl/blob/dec66d7c53a2de6cf74911faf9c1ad1f7f0fff14/src/axolotl/utils/data.py#L244
Running @hamelsmu tiny-mistral config from here we get the following model card: