-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor mosaic_fsdp.py #2506
Merged
Merged
Refactor mosaic_fsdp.py #2506
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mvpatel2000
approved these changes
Sep 7, 2023
bmosaicml
pushed a commit
to bmosaicml/composer
that referenced
this pull request
Sep 14, 2023
* Refactor mosaic_fsdp.py * Format file * Rename monkey patch function * Fix import path * Format files * Fix version
maxisawesome
added a commit
that referenced
this pull request
Mar 9, 2024
* prelim commit * fix max answer lengths for cot * add output logger * create eval output logger * fix pyright; git push * change dist reduce fx * change dist reduce fx * fix pyright * Add nightly docker image (#2452) Add pytorch nightly and CUDA 12.1 support for composer docker images What issue(s) does this change relate to? Related to https://mosaicml.atlassian.net/browse/GRT-2305 Tests docker image: mosaicml/ci-staging:72744756-794c-4390-94db-72c212dd5e00 (cuda 12.1, pytorch 2.1.0) mcli connect temp-test-ZAVxMh Python 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print(torch.version) <module 'torch.version' from '/usr/lib/python3/dist-packages/torch/version.py'> >>> print(torch.__version__) 2.1.0.dev20230623+cu121 >>> print(torch.version.cuda) 12.1 Integration Test @mvpatel2000 has validated that this trains on initial mpt-2 experiments and speeds up training by +7-8% from 0.25 MFU to 0.27 MFU * Fix local eval (#2465) * fix autoresume with slashed directory * Revert "fix autoresume with slashed directory" This reverts commit 3dfb5f5. revert * fix * fix precommit * Update in_context_learning_evaluation.py * Update in_context_learning_evaluation.py * Update in_context_learning_evaluation.py * add tests * Add torch 2.1.0 args for github release-docker workflow * Log system metrics on each event (#2412) Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com> Co-authored-by: Evan Racah <evan@mosaicml.com> Co-authored-by: eracah <ejracah@gmail.com> * Fix torch 2.1.0 docker tag (#2472) * Upstream Generate Callback (#2449) Upstreams and generalizes the callback that logs generations to wandb from foundry to composer. * Upgrade torch nightly docker image for 0.18.3 NCCL version (#2476) Upgrade torch docker nightly version to 08-23-23 so that we get nccl version 0.18.3 which was merged on 08-18-23. * Test pytorch 2.1.0 docker images on ci/cd (#2469) Test pytorch 2.1.0 docker images on ci/cd #2469 * Fix huggingface tokenizer loading for slow tokenizers (#2483) * Deprecate Fused LayerNorm (#2475) Will be removed in v0.18. * Transformers upgrade (#2489) * Update RTD build config with build.os (#2490) * Update RTD build config with build.os * Remove python.version --------- Co-authored-by: Bandish Shah <bandish.shah@databricks.com> * Upgrade torch docker version and github workflow tests (#2488) * upgrade node version (#2492) # What does this PR do? Security vulnerability in `semver` seen due to node. This PR upgrades the node version to bump up semver from 7.5.1 to 7.5.2 # Tests Action Run: https://github.com/mosaicml/composer/actions/runs/6017539089 Correct version of semver seen after upgrade: ``` #14 [pytorch_stage 7/24] RUN npm list -g semver --depth=1 #14 2.223 /usr/lib #14 2.223 `-- npm@9.8.0 #14 2.223 `-- semver@7.5.2 #14 2.223 #14 DONE 2.4s ``` * Gating tying modules w/ FSDP for torch 2.0 (#2467) * Gating tying modules w/ FSDP * Changing weight tying filtering to be less aggressive * precommit formatting * Removing min_params (#2494) * Removing min_params * formatting? * removing overlap with another commit * Fix torchmetrics backwards compatibility issue (#2468) * add fix * fix tests * qwf * dsfg * add key * remove short * add map test * remove comment * filter warning * simplify wrapping * checkdown * fix torchmetrics * 300 * fix tests * remove metric * cleanup * bug fixes * fix lint * fix lint * fix test * lint * remove cuda * fix tests * fix ignore * fix loading * fix test * save ckpt --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> Co-authored-by: Your Name <you@example.com> * Adding some fixes to FSDP tests (#2495) * Adding some fixes to FSDP tests * Add filter warnings * fail count (#2496) * Remove PR curve metrics from backward compatibility test and skip torch 1.13 (#2497) * filter warning (#2500) * bump version (#2498) * Skip metrics in state dict (#2501) * skip metrics in state dict * fix unit tests * Add peak memory stats (#2504) * add peak memory stats * fix tests * fix sharded ckpt (#2505) * Bump gitpython from 3.1.31 to 3.1.34 (#2509) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.31 to 3.1.34. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.31...3.1.34) --- updated-dependencies: - dependency-name: gitpython dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Annotate `torch_prof_remote_file_name` as Optional (#2512) The `torch_prof_remote_file_name` argument of `Profiler` is passed as the `remote_file_name` argument of `TorchProfiler`, which supports passing `None` to disable uploading trace files. Prior to this commit, passing `None` to `Profiler` to do this whilst using a static type checker led to a type error. * fix: when there is no train_metrics, do not checkpoint (#2502) * Remove metric saving (#2514) * no metric save * fix docs * checkdown * fix tests * filter warning * move to device * fix device gpu * Update composer/core/state.py Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> --------- Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * Fix daily tests by removing gpu marker (#2515) * Refactor mosaic_fsdp.py (#2506) * Refactor mosaic_fsdp.py * Format file * Rename monkey patch function * Fix import path * Format files * Fix version * fix pr (#2517) * Add custom sharding to ChunkShardingSpec (#2507) * Refactor mosaic_fsdp.py * Format file * Rename monkey patch function * Fix import path * Format files * Fix version * Fix import path * Monkey patch ChunkShardingSpec to dynamically detect sharding dim * Format file * Add non divisible functionality to ChunkShardingSpec * Format file * Format file * Update nightly docker image to torch nightly 09-03-23 (#2518) * Update pre-commit in setup.py (#2522) * Add FSDP custom wrap with torch 2.1 (#2460) * add torch2 * add code * tag more changes * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com> * monkeypatch init * raise pins * add print * more logs * change if statements * remove imports * remove imports * fix init * fix versioning * add hybrid shard * checkdown * revert hsdp * add peak memory stats * lint * imports * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * fix wrap * fix gate * lint * test * change thresh * import typing * fix checks * nuke pyright * typo * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com> * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com> * Update composer/trainer/mosaic_fsdp_utils.py Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com> * resolve comments * add comments * add comments --------- Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com> Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com> * Fix GCSObjectStore bug where hmac keys auth doesn't work (#2519) * prelim commit * add output logger * create eval output logger * change dist reduce fx * Bump gitpython from 3.1.34 to 3.1.35 (#2525) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.34 to 3.1.35. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.34...3.1.35) --- updated-dependencies: - dependency-name: gitpython dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump pytest from 7.4.0 to 7.4.2 (#2523) Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.4.0 to 7.4.2. - [Release notes](https://github.com/pytest-dev/pytest/releases) - [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst) - [Commits](pytest-dev/pytest@7.4.0...7.4.2) --- updated-dependencies: - dependency-name: pytest dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Upgrade to mlflow version 2.5.0 (#2528) * disable cifar daily (#2527) * mosaicml logger robustness improvements (#2530) * Fix metrics keys sort in DecoupledAdamW for OptimizerMonitor FSDP metric agreggation (#2531) * Fix github actions for GCS integration testing (#2532) * fix github actions * make gpu test * change dist reduce fx * fix pyright * Fix GCS tests (#2535) * add PR tests * fix test * remove pr daily * remove pr daily * finish error logging cb * fix * add import to init * add import to init * add import to init * add file writing * add file writing * add file writing * add file writing * add file writing * move tensors to cpu * remove tensors * remove tensors * remove tensors * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * try debugging dist sync issue * nit * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * fix syncing of non tensor state * added gpu test * fix error * finish testing callback * fix all errors * test commit * roll back test commit * remove ranks * re-tesT * add custome gen kwargs and stopping on eos token * modify test * modify test * finish * finish * finish * finish * finish pr * implement early stop * add tesT * merge * fix * finish * finish * fix bug * finish * bug fix * add keys * add correcT * modify sync * diff split * fix typo * edit condition * broken wip * design demonstration commit * simplify pr * further simplify * wip * add comments * add other icl metrics * wip * change dict method, add more stuff to logging * fix typos, change some comments * decode tensors, fix wrong dict key * fix mc * 1 to 0 lol * wip linting * adjust to step logging * adjust logging names * add mflow, rm batch keys * add comments, check for dict in huggingface model update_metric * add user specified logging * move metric_name duplication to update_metric * wip fix testing * fix input shape error * rm init * rm eval_after_all * step=None * step=state.timestamp.batch.value * update name to include step * linting, wip on test * fix test * pyright wip * add non-batch warning * pyright * debug * rm this commit that wasn't the right branch * log at the end of training * rm silly wandb table logging * add run_name * add docstring * add debug logging * more logging * rm info logging * improve comments * Update composer/callbacks/eval_output_logging_callback.py Co-authored-by: Evan Racah <ejracah@gmail.com> * rm logging bool * fix logging for schema tasks * fix schema / mc tasks * yapf * rm reshape * fix tests * cleanup test * pyright * pyright * docstring * pyright * update tests * rm attention mask requirement * Update composer/metrics/nlp.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * Update composer/metrics/nlp.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * rm todo * lint * lint * lint * more lint --------- Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Jeremy Dohmann <jeremy@mosaicml.com> Co-authored-by: Jeremy D <115047575+bmosaicml@users.noreply.github.com> Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com> Co-authored-by: Rishab Parthasarathy <56666587+rishab-partha@users.noreply.github.com> Co-authored-by: Prithvi Kannan <46332835+prithvikannan@users.noreply.github.com> Co-authored-by: Evan Racah <evan@mosaicml.com> Co-authored-by: eracah <ejracah@gmail.com> Co-authored-by: Irene Dea <14367635+irenedea@users.noreply.github.com> Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> Co-authored-by: nik-mosaic <101217697+nik-mosaic@users.noreply.github.com> Co-authored-by: bandish-shah <86627118+bandish-shah@users.noreply.github.com> Co-authored-by: Bandish Shah <bandish.shah@databricks.com> Co-authored-by: bcui19 <bcui8377@gmail.com> Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Scott Stevenson <scott@stevenson.io> Co-authored-by: furkanbiten <furkanbiten@gmail.com> Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com> Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com> Co-authored-by: Nicholas Garcia <nicholasgcgarcia@gmail.com> Co-authored-by: Mikhail Kolesov <30723114+m1kol@users.noreply.github.com> Co-authored-by: root <jecdohmann@gmail.com> Co-authored-by: Tessa Barton <tbarton16@gmail.com>
j316chuck
added a commit
that referenced
this pull request
May 16, 2024
* prelim commit * fix max answer lengths for cot * add output logger * create eval output logger * fix pyright; git push * change dist reduce fx * change dist reduce fx * fix pyright * Add nightly docker image (#2452) Add pytorch nightly and CUDA 12.1 support for composer docker images What issue(s) does this change relate to? Related to https://mosaicml.atlassian.net/browse/GRT-2305 Tests docker image: mosaicml/ci-staging:72744756-794c-4390-94db-72c212dd5e00 (cuda 12.1, pytorch 2.1.0) mcli connect temp-test-ZAVxMh Python 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print(torch.version) <module 'torch.version' from '/usr/lib/python3/dist-packages/torch/version.py'> >>> print(torch.__version__) 2.1.0.dev20230623+cu121 >>> print(torch.version.cuda) 12.1 Integration Test @mvpatel2000 has validated that this trains on initial mpt-2 experiments and speeds up training by +7-8% from 0.25 MFU to 0.27 MFU * Fix local eval (#2465) * fix autoresume with slashed directory * Revert "fix autoresume with slashed directory" This reverts commit 3dfb5f5. revert * fix * fix precommit * Update in_context_learning_evaluation.py * Update in_context_learning_evaluation.py * Update in_context_learning_evaluation.py * add tests * Add torch 2.1.0 args for github release-docker workflow * Log system metrics on each event (#2412) Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com> Co-authored-by: Evan Racah <evan@mosaicml.com> Co-authored-by: eracah <ejracah@gmail.com> * Fix torch 2.1.0 docker tag (#2472) * Upstream Generate Callback (#2449) Upstreams and generalizes the callback that logs generations to wandb from foundry to composer. * Upgrade torch nightly docker image for 0.18.3 NCCL version (#2476) Upgrade torch docker nightly version to 08-23-23 so that we get nccl version 0.18.3 which was merged on 08-18-23. * Test pytorch 2.1.0 docker images on ci/cd (#2469) Test pytorch 2.1.0 docker images on ci/cd #2469 * Fix huggingface tokenizer loading for slow tokenizers (#2483) * Deprecate Fused LayerNorm (#2475) Will be removed in v0.18. * Transformers upgrade (#2489) * Update RTD build config with build.os (#2490) * Update RTD build config with build.os * Remove python.version --------- Co-authored-by: Bandish Shah <bandish.shah@databricks.com> * Upgrade torch docker version and github workflow tests (#2488) * upgrade node version (#2492) # What does this PR do? Security vulnerability in `semver` seen due to node. This PR upgrades the node version to bump up semver from 7.5.1 to 7.5.2 # Tests Action Run: https://github.com/mosaicml/composer/actions/runs/6017539089 Correct version of semver seen after upgrade: ``` #14 [pytorch_stage 7/24] RUN npm list -g semver --depth=1 #14 2.223 /usr/lib #14 2.223 `-- npm@9.8.0 #14 2.223 `-- semver@7.5.2 #14 2.223 #14 DONE 2.4s ``` * Gating tying modules w/ FSDP for torch 2.0 (#2467) * Gating tying modules w/ FSDP * Changing weight tying filtering to be less aggressive * precommit formatting * Removing min_params (#2494) * Removing min_params * formatting? * removing overlap with another commit * Fix torchmetrics backwards compatibility issue (#2468) * add fix * fix tests * qwf * dsfg * add key * remove short * add map test * remove comment * filter warning * simplify wrapping * checkdown * fix torchmetrics * 300 * fix tests * remove metric * cleanup * bug fixes * fix lint * fix lint * fix test * lint * remove cuda * fix tests * fix ignore * fix loading * fix test * save ckpt --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> Co-authored-by: Your Name <you@example.com> * Adding some fixes to FSDP tests (#2495) * Adding some fixes to FSDP tests * Add filter warnings * fail count (#2496) * Remove PR curve metrics from backward compatibility test and skip torch 1.13 (#2497) * filter warning (#2500) * bump version (#2498) * Skip metrics in state dict (#2501) * skip metrics in state dict * fix unit tests * Add peak memory stats (#2504) * add peak memory stats * fix tests * fix sharded ckpt (#2505) * Bump gitpython from 3.1.31 to 3.1.34 (#2509) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.31 to 3.1.34. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.31...3.1.34) --- updated-dependencies: - dependency-name: gitpython dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Annotate `torch_prof_remote_file_name` as Optional (#2512) The `torch_prof_remote_file_name` argument of `Profiler` is passed as the `remote_file_name` argument of `TorchProfiler`, which supports passing `None` to disable uploading trace files. Prior to this commit, passing `None` to `Profiler` to do this whilst using a static type checker led to a type error. * fix: when there is no train_metrics, do not checkpoint (#2502) * Remove metric saving (#2514) * no metric save * fix docs * checkdown * fix tests * filter warning * move to device * fix device gpu * Update composer/core/state.py Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> --------- Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * Fix daily tests by removing gpu marker (#2515) * Refactor mosaic_fsdp.py (#2506) * Refactor mosaic_fsdp.py * Format file * Rename monkey patch function * Fix import path * Format files * Fix version * fix pr (#2517) * Add custom sharding to ChunkShardingSpec (#2507) * Refactor mosaic_fsdp.py * Format file * Rename monkey patch function * Fix import path * Format files * Fix version * Fix import path * Monkey patch ChunkShardingSpec to dynamically detect sharding dim * Format file * Add non divisible functionality to ChunkShardingSpec * Format file * Format file * Update nightly docker image to torch nightly 09-03-23 (#2518) * Update pre-commit in setup.py (#2522) * Add FSDP custom wrap with torch 2.1 (#2460) * add torch2 * add code * tag more changes * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com> * monkeypatch init * raise pins * add print * more logs * change if statements * remove imports * remove imports * fix init * fix versioning * add hybrid shard * checkdown * revert hsdp * add peak memory stats * lint * imports * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> * fix wrap * fix gate * lint * test * change thresh * import typing * fix checks * nuke pyright * typo * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com> * Update composer/trainer/mosaic_fsdp.py Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com> * Update composer/trainer/mosaic_fsdp_utils.py Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com> * resolve comments * add comments * add comments --------- Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com> Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com> * Fix GCSObjectStore bug where hmac keys auth doesn't work (#2519) * prelim commit * add output logger * create eval output logger * change dist reduce fx * Bump gitpython from 3.1.34 to 3.1.35 (#2525) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.34 to 3.1.35. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.34...3.1.35) --- updated-dependencies: - dependency-name: gitpython dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump pytest from 7.4.0 to 7.4.2 (#2523) Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.4.0 to 7.4.2. - [Release notes](https://github.com/pytest-dev/pytest/releases) - [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst) - [Commits](pytest-dev/pytest@7.4.0...7.4.2) --- updated-dependencies: - dependency-name: pytest dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Upgrade to mlflow version 2.5.0 (#2528) * disable cifar daily (#2527) * mosaicml logger robustness improvements (#2530) * Fix metrics keys sort in DecoupledAdamW for OptimizerMonitor FSDP metric agreggation (#2531) * Fix github actions for GCS integration testing (#2532) * fix github actions * make gpu test * change dist reduce fx * fix pyright * Fix GCS tests (#2535) * add PR tests * fix test * remove pr daily * remove pr daily * finish error logging cb * fix * add import to init * add import to init * add import to init * add file writing * add file writing * add file writing * add file writing * add file writing * move tensors to cpu * remove tensors * remove tensors * remove tensors * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * add prompt to qa * try debugging dist sync issue * nit * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * fix syncing of non tensor state * added gpu test * fix error * finish testing callback * fix all errors * test commit * roll back test commit * remove ranks * re-tesT * add custome gen kwargs and stopping on eos token * modify test * modify test * finish * finish * finish * finish * finish pr * implement early stop * add tesT * merge * fix * finish * finish * fix bug * finish * bug fix * add keys * add correcT * modify sync * diff split * fix typo * edit condition * broken wip * design demonstration commit * simplify pr * further simplify * wip * add comments * add other icl metrics * wip * change dict method, add more stuff to logging * fix typos, change some comments * decode tensors, fix wrong dict key * fix mc * 1 to 0 lol * wip linting * adjust to step logging * adjust logging names * add mflow, rm batch keys * add comments, check for dict in huggingface model update_metric * add user specified logging * move metric_name duplication to update_metric * wip fix testing * fix input shape error * rm init * rm eval_after_all * step=None * step=state.timestamp.batch.value * update name to include step * linting, wip on test * fix test * pyright wip * add non-batch warning * pyright * debug * rm this commit that wasn't the right branch * log at the end of training * rm silly wandb table logging * add run_name * add docstring * add debug logging * more logging * rm info logging * improve comments * Update composer/callbacks/eval_output_logging_callback.py Co-authored-by: Evan Racah <ejracah@gmail.com> * rm logging bool * fix logging for schema tasks * fix schema / mc tasks * yapf * rm reshape * fix tests * cleanup test * pyright * pyright * docstring * pyright * update tests * rm attention mask requirement * Update composer/metrics/nlp.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * Update composer/metrics/nlp.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> * rm todo * lint * lint * lint * more lint --------- Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Jeremy Dohmann <jeremy@mosaicml.com> Co-authored-by: Jeremy D <115047575+bmosaicml@users.noreply.github.com> Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com> Co-authored-by: Rishab Parthasarathy <56666587+rishab-partha@users.noreply.github.com> Co-authored-by: Prithvi Kannan <46332835+prithvikannan@users.noreply.github.com> Co-authored-by: Evan Racah <evan@mosaicml.com> Co-authored-by: eracah <ejracah@gmail.com> Co-authored-by: Irene Dea <14367635+irenedea@users.noreply.github.com> Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> Co-authored-by: nik-mosaic <101217697+nik-mosaic@users.noreply.github.com> Co-authored-by: bandish-shah <86627118+bandish-shah@users.noreply.github.com> Co-authored-by: Bandish Shah <bandish.shah@databricks.com> Co-authored-by: bcui19 <bcui8377@gmail.com> Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Scott Stevenson <scott@stevenson.io> Co-authored-by: furkanbiten <furkanbiten@gmail.com> Co-authored-by: Brian <23239305+b-chu@users.noreply.github.com> Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com> Co-authored-by: Nicholas Garcia <nicholasgcgarcia@gmail.com> Co-authored-by: Mikhail Kolesov <30723114+m1kol@users.noreply.github.com> Co-authored-by: root <jecdohmann@gmail.com> Co-authored-by: Tessa Barton <tbarton16@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Separate the monkey patch helper functions and the actual monkey patching to make the code more readable.
What issue(s) does this change relate to?
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)