Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add profiler support in llm foundry #678

Merged
merged 17 commits into from
Oct 18, 2023
Merged

Conversation

j316chuck
Copy link
Contributor

@j316chuck j316chuck commented Oct 17, 2023

Description

Add profiler support for llm foundry

Along for the ride:

Adding a yaml to support training mpt models in CPU mode. Ths is useful so you don't have to spin up an interactive session, wait for an interactive session download, set up interactive session, get/download your data, and let's you test quickly on a small model for cpu only features. Only con is no gpu 😛

Tests

composer train/train.py \ train/yamls/pretrain/mpt-small-cpu.yaml \ data_local=my-copy-c4 \ train_loader.dataset.split=train_small \ eval_loader.dataset.split=val_small \ max_duration=10ba \ eval_interval=0 \ save_folder=mpt-125m

Produces the chrome traces: composer_traces/ep0-ba6-rank0.json. Example:

 {"ph": "X", "cat": "python_function", "name": "queue.py(213): _put", "pid": 0, "tid": 1981795, "ts": 1697505580241206, "dur": 0, "args": {"Ev Idx": 2178228, "Python id": 1145107, "Python parent id": 1145104}},
{"ph": "X", "cat": "python_function", "name": "<built-in method append of collections.deque object at 0x2c5582890>", "pid": 0, "tid": 1981795, "ts": 1697505580241206, "dur": 0, "args": {"Ev Idx": 2178229, "Python id": 1145108, "Python parent id": 1145107}},
{"ph": "X", "cat": "python_function", "name": "threading.py(359): notify", "pid": 0, "tid": 1981795, "ts": 1697505580241207, "dur": 0, "args": {"Ev Idx": 2178230, "Python id": 1145109, "Python parent id": 1145104}},
{"ph": "X", "cat": "python_function", "name": "threading.py(279): _is_owned", "pid": 0, "tid": 1981795, "ts": 1697505580241207, "dur": 0, "args": {"Ev Idx": 2178231, "Python id": 1145110, "Python parent id": 1145109}},

Produces the pytorch traces: torch_traces/rank0.6.pt.trace.json. Example:

  {
    "ph": "X", "cat": "python_function", "name": "/Users/chuck.tang/composer/composer/utils/string_enum.py(69): __eq__", "pid": 54200, "tid": 1981795,
    "ts": 1697505581229305, "dur": 0,
    "args": {
      "Ev Idx": 2233416, "Python id": 1200295, "Python parent id": 1200286
    }
  },

Useful for profiling memory and time usage

Screen.Recording.2023-10-16.at.8.59.10.PM.mov

Perfetto View:
Screenshot 2023-10-16 at 9 35 07 PM

S3:

loggers:
  s3: {bucket_uri: s3://mosaicml-internal-checkpoints-shared/ }

aws s3 cp --recursive s3://mosaicml-internal-checkpoints-shared/chuck/mpt_causal_lm_cpu/traces/

Full training run:
mpt-7b-gpu-8-chinchilla-light-profile-ynjNZ2, mpt-7b-gpu-8-chinchilla-full-profile-uJSCOF, mpt-7b-gpu-8-chinchilla-none-profile-Cwm3GA

README.md Outdated Show resolved Hide resolved
@j316chuck j316chuck enabled auto-merge (squash) October 17, 2023 19:24
@j316chuck j316chuck merged commit 92bd673 into main Oct 18, 2023
12 checks passed
@dakinggg dakinggg deleted the chuck/add_profiler_flags branch November 17, 2023 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants