Enable gauntlet training #501

bmosaicml · 2023-08-02T16:44:24Z

This PR enables us to run the gauntlet during training. It also enables you to set a subset of batches to run eval over in order to speed up ICL during training.

It also changes eval to no longer use the InMemoryLogger and instead pull the metrics from the State.

Eval test run: all-eval-F2oe1D
Training test run: test-1b-C5QG5k test-1b-c7sqCh

https://wandb.ai/mosaic-ml/gauntlet/runs/rrbn0qbb?workspace=user-jemdohmann

Eval results still good:

| model_name               |   average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |
|:-------------------------|----------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|
| mosaicml/mpt-7b-instruct |  0.354255 |          0.398764 |                0.415097 |                 0.371509 |                   0.171216 |                0.414691 |

llmfoundry/callbacks/model_gauntlet_callback.py

scripts/eval/yamls/mpt_eval.yaml

scripts/eval/yamls/hf_eval.yaml

scripts/eval/eval.py

llmfoundry/callbacks/model_gauntlet_callback.py

scripts/eval/eval.py

llmfoundry/utils/builders.py

hanlint

Thanks for putting this PR together, left a few comments:

the new yaml files have unexplained changes (batch size, fsdp config), and also have several hardcoded internal OCI bucket names and cluster names. Please remove the hardcoded values prior to pushing PRs.
can we avoid silently adding in an InMemoryLogger, purely for the purposes of passing around data? Seems like there's a deeper design issue if we have to resort to this approach.

llmfoundry/callbacks/model_gauntlet_callback.py

mcli/mcli-hf-eval.yaml

mcli/mcli-1b-gauntlet.yaml

mcli/mcli-hf-eval.yaml

scripts/eval/yamls/mpt_eval.yaml

scripts/train/train.py

eracah

Might be worth using state to get the metrics instead of inmemorylogger

llmfoundry/callbacks/model_gauntlet_callback.py

scripts/train/train.py

bmosaicml · 2023-08-28T02:12:34Z

The tests that you have added are really slow, nearly doubling the time it takes CI to run
196.68s call     tests/test_training.py::test_train_gauntlet
161.70s call     tests/test_training.py::test_train[cpu]
120.18s call     tests/test_model_gauntlet.py::test_gauntlet_callback[True-False]
120.00s call     tests/test_model_gauntlet.py::test_gauntlet_callback[False-True]
119.20s call     tests/test_model_gauntlet.py::test_gauntlet_callback[True-True]
118.90s call     tests/test_model_gauntlet.py::test_gauntlet_callback[False-False]
Can you please reduce the intensity of the tests? Probably run less eval tasks or subset batches or something?

I don't think test_train_gauntlet can be sped up anymore as it's now about the same speed as test_train. Should I just cut it? Or make it GPU only? Alternatively we could cut test_train since it is merely a subset of the functionality of test_train_gauntlet

…undry into enable_gauntlet_training

mcli/mcli-1b-eval.yaml

mcli/mcli-hf-eval.yaml

scripts/eval/eval.py

scripts/eval/yamls/tasks_lite.yaml

scripts/train/train.py

scripts/eval/eval.py

dakinggg

see remaining comments around backwards compatibility and a couple code changes that I think were a bad merge.

llmfoundry/callbacks/eval_gauntlet_callback.py

mcli/mcli-1b-eval.yaml

scripts/train/train.py

bmosaicml and others added 18 commits July 14, 2023 16:41

add subset num batches

7e8511b

add subset num batches

059c43e

remove tiktoken

75c455c

remove openai import

f028ad8

remove bad line

06fa54a

foo

3a139b2

add training callback

56a2c88

modify yamls

e16e86b

implement train

8341a76

fix indexing to get most recent eval result

6ff5cc5

finish

06560d5

Merge branch 'main' into enable_gauntlet_training

9e07ece

finish

989f61a

finish

4c316f1

finish

7de1b8c

finish

8a77e88

Merge branch 'main' into enable_gauntlet_training

61d682a

foo

6b2116d

abhi-databricks reviewed Aug 10, 2023

View reviewed changes

llmfoundry/callbacks/model_gauntlet_callback.py Outdated Show resolved Hide resolved

dakinggg reviewed Aug 10, 2023

View reviewed changes

hanlint requested changes Aug 10, 2023

View reviewed changes

foo

33d3165

eracah reviewed Aug 10, 2023

View reviewed changes

llmfoundry/callbacks/model_gauntlet_callback.py Outdated Show resolved Hide resolved

llmfoundry/callbacks/model_gauntlet_callback.py Outdated Show resolved Hide resolved

scripts/train/train.py Outdated Show resolved Hide resolved

bmosaicml added 7 commits August 11, 2023 23:14

working on debugging changeS

85c2641

[wip] removing logger dependency from model gauntlet

1b3944f

remove logger from eval

309570d

remove logger from eval

850bc8e

remove logger from eval

82cee97

Merge branch 'main' into enable_gauntlet_training

fe2c141

debug

df170de

bmosaicml and others added 9 commits August 28, 2023 12:55

try commenting out test_train

42633ff

Merge branch 'main' into enable_gauntlet_training

06a10e8

rm redundant test

44a5010

Merge branch 'enable_gauntlet_training' of github.com:mosaicml/llm-fo…

eb8969c

…undry into enable_gauntlet_training

remove broken test

efebf26

trying to fix test

053b9e0

done :))))))

66e63af

reduce training to 1ba

acf3d9e

Merge branch 'main' into enable_gauntlet_training

de13800