Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable gauntlet training #501

Merged
merged 79 commits into from
Aug 29, 2023
Merged

Enable gauntlet training #501

merged 79 commits into from
Aug 29, 2023

Conversation

bmosaicml
Copy link
Contributor

@bmosaicml bmosaicml commented Aug 2, 2023

This PR enables us to run the gauntlet during training. It also enables you to set a subset of batches to run eval over in order to speed up ICL during training.

It also changes eval to no longer use the InMemoryLogger and instead pull the metrics from the State.

Eval test run: all-eval-F2oe1D
Training test run: test-1b-C5QG5k test-1b-c7sqCh

https://wandb.ai/mosaic-ml/gauntlet/runs/rrbn0qbb?workspace=user-jemdohmann

image

Eval results still good:

| model_name               |   average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |
|:-------------------------|----------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|
| mosaicml/mpt-7b-instruct |  0.354255 |          0.398764 |                0.415097 |                 0.371509 |                   0.171216 |                0.414691 |

scripts/eval/yamls/mpt_eval.yaml Show resolved Hide resolved
scripts/eval/yamls/hf_eval.yaml Show resolved Hide resolved
scripts/eval/yamls/hf_eval.yaml Show resolved Hide resolved
scripts/eval/eval.py Outdated Show resolved Hide resolved
llmfoundry/callbacks/model_gauntlet_callback.py Outdated Show resolved Hide resolved
scripts/eval/eval.py Outdated Show resolved Hide resolved
llmfoundry/utils/builders.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this PR together, left a few comments:

  • the new yaml files have unexplained changes (batch size, fsdp config), and also have several hardcoded internal OCI bucket names and cluster names. Please remove the hardcoded values prior to pushing PRs.
  • can we avoid silently adding in an InMemoryLogger, purely for the purposes of passing around data? Seems like there's a deeper design issue if we have to resort to this approach.

llmfoundry/callbacks/model_gauntlet_callback.py Outdated Show resolved Hide resolved
llmfoundry/callbacks/model_gauntlet_callback.py Outdated Show resolved Hide resolved
llmfoundry/callbacks/model_gauntlet_callback.py Outdated Show resolved Hide resolved
mcli/mcli-hf-eval.yaml Outdated Show resolved Hide resolved
mcli/mcli-1b-gauntlet.yaml Outdated Show resolved Hide resolved
mcli/mcli-hf-eval.yaml Show resolved Hide resolved
scripts/eval/yamls/mpt_eval.yaml Show resolved Hide resolved
scripts/eval/yamls/mpt_eval.yaml Show resolved Hide resolved
scripts/train/train.py Outdated Show resolved Hide resolved
scripts/train/train.py Outdated Show resolved Hide resolved
Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth using state to get the metrics instead of inmemorylogger

llmfoundry/callbacks/model_gauntlet_callback.py Outdated Show resolved Hide resolved
llmfoundry/callbacks/model_gauntlet_callback.py Outdated Show resolved Hide resolved
scripts/train/train.py Outdated Show resolved Hide resolved
@bmosaicml
Copy link
Contributor Author

bmosaicml commented Aug 28, 2023

The tests that you have added are really slow, nearly doubling the time it takes CI to run

196.68s call     tests/test_training.py::test_train_gauntlet
161.70s call     tests/test_training.py::test_train[cpu]
120.18s call     tests/test_model_gauntlet.py::test_gauntlet_callback[True-False]
120.00s call     tests/test_model_gauntlet.py::test_gauntlet_callback[False-True]
119.20s call     tests/test_model_gauntlet.py::test_gauntlet_callback[True-True]
118.90s call     tests/test_model_gauntlet.py::test_gauntlet_callback[False-False]

Can you please reduce the intensity of the tests? Probably run less eval tasks or subset batches or something?

I don't think test_train_gauntlet can be sped up anymore as it's now about the same speed as test_train. Should I just cut it? Or make it GPU only? Alternatively we could cut test_train since it is merely a subset of the functionality of test_train_gauntlet

scripts/eval/eval.py Outdated Show resolved Hide resolved
scripts/train/train.py Outdated Show resolved Hide resolved
scripts/train/train.py Outdated Show resolved Hide resolved
scripts/eval/eval.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see remaining comments around backwards compatibility and a couple code changes that I think were a bad merge.

scripts/train/train.py Outdated Show resolved Hide resolved
@bmosaicml bmosaicml merged commit aabdb3c into main Aug 29, 2023
8 checks passed
@dakinggg dakinggg deleted the enable_gauntlet_training branch October 11, 2023 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants