Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output eval logging (batch level) #2977

Merged

Conversation

maxisawesome
Copy link
Contributor

@maxisawesome maxisawesome commented Feb 8, 2024

What does this PR do?

Log eval outputs after each batch using logger.log_table. This is an alternate design to logging at the end of eval found here.

REQUIRES LLM-FOUNDRY BRANCH: mosaicml/llm-foundry#961

Eval only run: wandb
run name: test-batch-logging-kuIoME

bmosaicml and others added 30 commits September 12, 2023 12:03
Add pytorch nightly and CUDA 12.1 support for composer docker images

What issue(s) does this change relate to?
Related to https://mosaicml.atlassian.net/browse/GRT-2305

Tests
docker image: mosaicml/ci-staging:72744756-794c-4390-94db-72c212dd5e00 (cuda 12.1, pytorch 2.1.0)

mcli connect temp-test-ZAVxMh
Python 3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.version)
<module 'torch.version' from '/usr/lib/python3/dist-packages/torch/version.py'>
>>> print(torch.__version__)
2.1.0.dev20230623+cu121
>>> print(torch.version.cuda)
12.1
Integration Test
@mvpatel2000 has validated that this trains on initial mpt-2 experiments and speeds up training by +7-8% from 0.25 MFU to 0.27 MFU
* fix autoresume with slashed directory

* Revert "fix autoresume with slashed directory"

This reverts commit 3dfb5f5.

revert

* fix

* fix precommit

* Update in_context_learning_evaluation.py

* Update in_context_learning_evaluation.py

* Update in_context_learning_evaluation.py

* add tests
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Co-authored-by: Evan Racah <evan@mosaicml.com>
Co-authored-by: eracah <ejracah@gmail.com>
Upstreams and generalizes the callback that logs generations to wandb from foundry to composer.
…2476)

Upgrade torch docker nightly version to 08-23-23 so that we get nccl version 0.18.3 which was merged on 08-18-23.
* Update RTD build config with build.os
* Remove python.version

---------

Co-authored-by: Bandish Shah <bandish.shah@databricks.com>
# What does this PR do?
Security vulnerability in `semver` seen due to node. This PR upgrades the node version to bump up semver from 7.5.1 to 7.5.2

# Tests
Action Run: https://github.com/mosaicml/composer/actions/runs/6017539089
Correct version of semver seen after upgrade: 
```
mosaicml#14 [pytorch_stage  7/24] RUN npm list -g semver --depth=1
mosaicml#14 2.223 /usr/lib
mosaicml#14 2.223 `-- npm@9.8.0
mosaicml#14 2.223   `-- semver@7.5.2
mosaicml#14 2.223 
mosaicml#14 DONE 2.4s
```
* Gating tying modules w/ FSDP

* Changing weight tying filtering to be less aggressive

* precommit formatting
* Removing min_params

* formatting?

* removing overlap with another commit
* add fix

* fix tests

* qwf

* dsfg

* add key

* remove short

* add map test

* remove comment

* filter warning

* simplify wrapping

* checkdown

* fix torchmetrics

* 300

* fix tests

* remove metric

* cleanup

* bug fixes

* fix lint

* fix lint

* fix test

* lint

* remove cuda

* fix tests

* fix ignore

* fix loading

* fix test

* save ckpt

---------

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
Co-authored-by: Your Name <you@example.com>
* Adding some fixes to FSDP tests

* Add filter warnings
composer/loggers/in_memory_logger.py Outdated Show resolved Hide resolved
composer/metrics/nlp.py Outdated Show resolved Hide resolved
composer/metrics/nlp.py Outdated Show resolved Hide resolved
composer/metrics/nlp.py Outdated Show resolved Hide resolved
composer/metrics/nlp.py Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
maxisawesome and others added 2 commits March 4, 2024 16:23
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I mostly looked closely at the callbacks and loggers part

@eracah
Copy link
Contributor

eracah commented Mar 5, 2024

Also I just want to say great work! This is a herculean PR requiring deep, bespoke knowledge while juggling several different parts of the composer codebase. Not an easy one to wrangle and seems like you managed to make it work!

@mvpatel2000 mvpatel2000 self-requested a review March 5, 2024 19:51
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im holding until a few things are resolved. But this is my top priority to help u land :)

@maxisawesome maxisawesome merged commit 594eaef into mosaicml:dev Mar 9, 2024
14 checks passed
@maxisawesome maxisawesome deleted the error_logging_callback_in_batch branch March 9, 2024 00:54
@maxisawesome maxisawesome mentioned this pull request Apr 1, 2024
7 tasks
j316chuck added a commit that referenced this pull request May 16, 2024
* prelim commit

* fix max answer lengths for cot

* add output logger

* create eval output logger

* fix pyright; git push

* change dist reduce fx

* change dist reduce fx

* fix pyright

* Add nightly docker image (#2452)

Add pytorch nightly and CUDA 12.1 support for composer docker images

What issue(s) does this change relate to?
Related to https://mosaicml.atlassian.net/browse/GRT-2305

Tests
docker image: mosaicml/ci-staging:72744756-794c-4390-94db-72c212dd5e00 (cuda 12.1, pytorch 2.1.0)

mcli connect temp-test-ZAVxMh
Python 3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.version)
<module 'torch.version' from '/usr/lib/python3/dist-packages/torch/version.py'>
>>> print(torch.__version__)
2.1.0.dev20230623+cu121
>>> print(torch.version.cuda)
12.1
Integration Test
@mvpatel2000 has validated that this trains on initial mpt-2 experiments and speeds up training by +7-8% from 0.25 MFU to 0.27 MFU

* Fix local eval (#2465)

* fix autoresume with slashed directory

* Revert "fix autoresume with slashed directory"

This reverts commit 3dfb5f5.

revert

* fix

* fix precommit

* Update in_context_learning_evaluation.py

* Update in_context_learning_evaluation.py

* Update in_context_learning_evaluation.py

* add tests

* Add torch 2.1.0 args for github release-docker workflow

* Log system metrics on each event (#2412)


Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Co-authored-by: Evan Racah <evan@mosaicml.com>
Co-authored-by: eracah <ejracah@gmail.com>

* Fix torch 2.1.0 docker tag (#2472)

* Upstream Generate Callback  (#2449)

Upstreams and generalizes the callback that logs generations to wandb from foundry to composer.

* Upgrade torch nightly docker image for 0.18.3 NCCL version  (#2476)

Upgrade torch docker nightly version to 08-23-23 so that we get nccl version 0.18.3 which was merged on 08-18-23.

* Test pytorch 2.1.0 docker images on ci/cd (#2469)

Test pytorch 2.1.0 docker images on ci/cd #2469

* Fix huggingface tokenizer loading for slow tokenizers (#2483)

* Deprecate Fused LayerNorm (#2475)

Will be removed in v0.18.

* Transformers upgrade (#2489)

* Update RTD build config with build.os (#2490)

* Update RTD build config with build.os
* Remove python.version

---------

Co-authored-by: Bandish Shah <bandish.shah@databricks.com>

* Upgrade torch docker version and github workflow tests (#2488)

* upgrade node version (#2492)

# What does this PR do?
Security vulnerability in `semver` seen due to node. This PR upgrades the node version to bump up semver from 7.5.1 to 7.5.2

# Tests
Action Run: https://github.com/mosaicml/composer/actions/runs/6017539089
Correct version of semver seen after upgrade: 
```
#14 [pytorch_stage  7/24] RUN npm list -g semver --depth=1
#14 2.223 /usr/lib
#14 2.223 `-- npm@9.8.0
#14 2.223   `-- semver@7.5.2
#14 2.223 
#14 DONE 2.4s
```

* Gating tying modules w/ FSDP for torch 2.0 (#2467)

* Gating tying modules w/ FSDP

* Changing weight tying filtering to be less aggressive

* precommit formatting

* Removing min_params (#2494)

* Removing min_params

* formatting?

* removing overlap with another commit

* Fix torchmetrics backwards compatibility issue (#2468)

* add fix

* fix tests

* qwf

* dsfg

* add key

* remove short

* add map test

* remove comment

* filter warning

* simplify wrapping

* checkdown

* fix torchmetrics

* 300

* fix tests

* remove metric

* cleanup

* bug fixes

* fix lint

* fix lint

* fix test

* lint

* remove cuda

* fix tests

* fix ignore

* fix loading

* fix test

* save ckpt

---------

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
Co-authored-by: Your Name <you@example.com>

* Adding some fixes to FSDP tests (#2495)

* Adding some fixes to FSDP tests

* Add filter warnings

* fail count (#2496)

* Remove PR curve metrics from backward compatibility test and skip torch 1.13 (#2497)

* filter warning (#2500)

* bump version (#2498)

* Skip metrics in state dict (#2501)

* skip metrics in state dict

* fix unit tests

* Add peak memory stats (#2504)

* add peak memory stats

* fix tests

* fix sharded ckpt (#2505)

* Bump gitpython from 3.1.31 to 3.1.34 (#2509)

Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.31 to 3.1.34.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](gitpython-developers/GitPython@3.1.31...3.1.34)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Annotate `torch_prof_remote_file_name` as Optional (#2512)

The `torch_prof_remote_file_name` argument of `Profiler` is passed
as the `remote_file_name` argument of `TorchProfiler`, which supports
passing `None` to disable uploading trace files. Prior to this
commit, passing `None` to `Profiler` to do this whilst using a
static type checker led to a type error.

* fix: when there is no train_metrics, do not checkpoint (#2502)

* Remove metric saving (#2514)

* no metric save

* fix docs

* checkdown

* fix tests

* filter warning

* move to device

* fix device gpu

* Update composer/core/state.py

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

---------

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

* Fix daily tests by removing gpu marker (#2515)

* Refactor mosaic_fsdp.py (#2506)

* Refactor mosaic_fsdp.py

* Format file

* Rename monkey patch function

* Fix import path

* Format files

* Fix version

* fix pr (#2517)

* Add custom sharding to ChunkShardingSpec (#2507)

* Refactor mosaic_fsdp.py

* Format file

* Rename monkey patch function

* Fix import path

* Format files

* Fix version

* Fix import path

* Monkey patch ChunkShardingSpec to dynamically detect sharding dim

* Format file

* Add non divisible functionality to ChunkShardingSpec

* Format file

* Format file

* Update nightly docker image to torch nightly 09-03-23 (#2518)

* Update pre-commit in setup.py (#2522)

* Add FSDP custom wrap with torch 2.1 (#2460)

* add torch2

* add code

* tag more changes

* Update composer/trainer/mosaic_fsdp.py

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>

* monkeypatch init

* raise pins

* add print

* more logs

* change if statements

* remove imports

* remove imports

* fix init

* fix versioning

* add hybrid shard

* checkdown

* revert hsdp

* add peak memory stats

* lint

* imports

* Update composer/trainer/mosaic_fsdp.py

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

* fix wrap

* fix gate

* lint

* test

* change thresh

* import typing

* fix checks

* nuke pyright

* typo

* Update composer/trainer/mosaic_fsdp.py

Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com>

* Update composer/trainer/mosaic_fsdp.py

Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com>

* Update composer/trainer/mosaic_fsdp_utils.py

Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com>

* resolve comments

* add comments

* add comments

---------

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com>

* Fix GCSObjectStore bug where hmac keys auth doesn't work (#2519)

* prelim commit

* add output logger

* create eval output logger

* change dist reduce fx

* Bump gitpython from 3.1.34 to 3.1.35 (#2525)

Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.34 to 3.1.35.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](gitpython-developers/GitPython@3.1.34...3.1.35)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump pytest from 7.4.0 to 7.4.2 (#2523)

Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.4.0 to 7.4.2.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](pytest-dev/pytest@7.4.0...7.4.2)

---
updated-dependencies:
- dependency-name: pytest
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Upgrade to mlflow version 2.5.0 (#2528)

* disable cifar daily (#2527)

* mosaicml logger robustness improvements (#2530)

* Fix metrics keys sort in DecoupledAdamW for OptimizerMonitor FSDP metric agreggation (#2531)

* Fix github actions for GCS integration testing (#2532)

* fix github actions

* make gpu test

* change dist reduce fx

* fix pyright

* Fix GCS tests (#2535)

* add PR tests

* fix test

* remove pr daily

* remove pr daily

* finish error logging cb

* fix

* add import to init

* add import to init

* add import to init

* add file writing

* add file writing

* add file writing

* add file writing

* add file writing

* move tensors to cpu

* remove tensors

* remove tensors

* remove tensors

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* add prompt to qa

* try debugging dist sync issue

* nit

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* fix syncing of non tensor state

* added gpu test

* fix error

* finish testing callback

* fix all errors

* test commit

* roll back test commit

* remove ranks

* re-tesT

* add custome gen kwargs and stopping on eos token

* modify test

* modify test

* finish

* finish

* finish

* finish

* finish pr

* implement early stop

* add tesT

* merge

* fix

* finish

* finish

* fix bug

* finish

* bug fix

* add keys

* add correcT

* modify sync

* diff split

* fix typo

* edit condition

* broken wip

* design demonstration commit

* simplify pr

* further simplify

* wip

* add comments

* add other icl metrics

* wip

* change dict method, add more stuff to logging

* fix typos, change some comments

* decode tensors, fix wrong dict key

* fix mc

* 1 to 0 lol

* wip linting

* adjust to step logging

* adjust logging names

* add mflow, rm batch keys

* add comments, check for dict in huggingface model update_metric

* add user specified logging

* move metric_name duplication to update_metric

* wip fix testing

* fix input shape error

* rm init

* rm eval_after_all

* step=None

* step=state.timestamp.batch.value

* update name to include step

* linting, wip on test

* fix test

* pyright wip

* add non-batch warning

* pyright

* debug

* rm this commit that wasn't the right branch

* log at the end of training

* rm silly wandb table logging

* add run_name

* add docstring

* add debug logging

* more logging

* rm info logging

* improve comments

* Update composer/callbacks/eval_output_logging_callback.py

Co-authored-by: Evan Racah <ejracah@gmail.com>

* rm logging bool

* fix logging for schema tasks

* fix schema / mc tasks

* yapf

* rm reshape

* fix tests

* cleanup test

* pyright

* pyright

* docstring

* pyright

* update tests

* rm attention mask requirement

* Update composer/metrics/nlp.py

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

* Update composer/metrics/nlp.py

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

* rm todo

* lint

* lint

* lint

* more lint

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Jeremy Dohmann <jeremy@mosaicml.com>
Co-authored-by: Jeremy D <115047575+bmosaicml@users.noreply.github.com>
Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com>
Co-authored-by: Rishab Parthasarathy <56666587+rishab-partha@users.noreply.github.com>
Co-authored-by: Prithvi Kannan <46332835+prithvikannan@users.noreply.github.com>
Co-authored-by: Evan Racah <evan@mosaicml.com>
Co-authored-by: eracah <ejracah@gmail.com>
Co-authored-by: Irene Dea <14367635+irenedea@users.noreply.github.com>
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
Co-authored-by: nik-mosaic <101217697+nik-mosaic@users.noreply.github.com>
Co-authored-by: bandish-shah <86627118+bandish-shah@users.noreply.github.com>
Co-authored-by: Bandish Shah <bandish.shah@databricks.com>
Co-authored-by: bcui19 <bcui8377@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Scott Stevenson <scott@stevenson.io>
Co-authored-by: furkanbiten <furkanbiten@gmail.com>
Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com>
Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
Co-authored-by: Nicholas Garcia <nicholasgcgarcia@gmail.com>
Co-authored-by: Mikhail Kolesov <30723114+m1kol@users.noreply.github.com>
Co-authored-by: root <jecdohmann@gmail.com>
Co-authored-by: Tessa Barton <tbarton16@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.