Fix TE HF checkpoint saving #1280

j316chuck · 2024-06-13T21:12:39Z

Description

Fixes HF Checkpoint callback for TransformerEngine FP8 saving. This PR ensures we serialize the io.BytesIO extra_state tensors as regular tensors insave_pretrained so the code does not error.

Tests

Added unit test, skipped on A100 GPU ✔️
Added unit test, manually ran on H100 GPU ✅

tests/a_scripts/inference/test_convert_composer_to_hf.py::test_huggingface_conversion_callback[1ba-1ba-1ba-1-1-amp_fp8-full-mpt-True-None]
  /usr/lib/python3/dist-packages/transformer_engine/pytorch/module/base.py:394: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:1524.)
    state_serialized = torch.frombuffer(pickle.dumps(state), dtype=torch.uint8)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================================== 25 passed, 11 skipped, 1621 deselected, 266 warnings in 92.95s (0:01:32) ======================================================================================
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.

Before: failed-hf-checkpointer-fp8-llama3-8b-metamath-4ep-KOTaOP 🔴
After: success-hf-checkpointer-fp8-llama3-8b-metamath-4ep-yxNFTK ✅

Issues

Closes https://databricks.atlassian.net/browse/RGENAI-255

mvpatel2000

Can i load a TE ckpt into a non TEd model?

LGTM but also @dakinggg wdyt

llmfoundry/callbacks/hf_checkpointer.py

j316chuck · 2024-06-14T20:31:12Z

@mvpatel2000 loading from fp8 and training with bf16 seens to work with test run example here: torch-231-bf16-load-from-fp8-bR8NzC.

Curious what the use case is in which you would do that though?

dakinggg

will review fully once CI passes

mvpatel2000

LGTM but same comment on waiting for CI/CD topass

llmfoundry/callbacks/hf_checkpointer.py

add te hf ckpt

018e8f1

j316chuck requested a review from a team as a code owner June 13, 2024 21:12

j316chuck requested review from irenedea, mvpatel2000, milocress and dakinggg June 13, 2024 21:12

j316chuck changed the title ~~Add fix for TE HF Ckpt~~ Fix TE HF checkpoint saving Jun 13, 2024

j316chuck and others added 2 commits June 13, 2024 14:15

Merge branch 'main' into chuck/te_hf_ckpt

f066dff

commit change

5ce7839

mvpatel2000 reviewed Jun 14, 2024

View reviewed changes

dakinggg reviewed Jun 14, 2024

View reviewed changes

llmfoundry/callbacks/hf_checkpointer.py Show resolved Hide resolved

add precision

72e1356

j316chuck force-pushed the chuck/te_hf_ckpt branch from fbf0967 to 72e1356 Compare June 14, 2024 20:27

Merge branch 'main' into chuck/te_hf_ckpt

453c745

j316chuck requested review from mvpatel2000 and dakinggg June 14, 2024 20:27

Merge branch 'main' into chuck/te_hf_ckpt

7f8678c

dakinggg reviewed Jun 14, 2024

View reviewed changes

mvpatel2000 reviewed Jun 17, 2024

View reviewed changes

Chuck Tang and others added 6 commits June 17, 2024 15:34

add pytest skips

7c0cf69

Merge branch 'main' into chuck/te_hf_ckpt

9097103

commit change

7e3949e

commit change

412e0cd

commit change

9f00f61

commit change

593e6cd

j316chuck requested review from mvpatel2000 and dakinggg June 17, 2024 23:44

dakinggg reviewed Jun 18, 2024

View reviewed changes

llmfoundry/callbacks/hf_checkpointer.py Outdated Show resolved Hide resolved

commit change

84feb07

j316chuck requested a review from dakinggg June 18, 2024 03:45

dakinggg approved these changes Jun 18, 2024

View reviewed changes

j316chuck merged commit c23be4a into main Jun 18, 2024
10 of 11 checks passed

dakinggg deleted the chuck/te_hf_ckpt branch August 6, 2024 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TE HF checkpoint saving #1280

Fix TE HF checkpoint saving #1280

j316chuck commented Jun 13, 2024 •

edited

Loading

mvpatel2000 left a comment

j316chuck commented Jun 14, 2024 •

edited

Loading

dakinggg left a comment

mvpatel2000 left a comment

Fix TE HF checkpoint saving #1280

Fix TE HF checkpoint saving #1280

Conversation

j316chuck commented Jun 13, 2024 • edited Loading

Description

Tests

Issues

mvpatel2000 left a comment

Choose a reason for hiding this comment

j316chuck commented Jun 14, 2024 • edited Loading

dakinggg left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

j316chuck commented Jun 13, 2024 •

edited

Loading

j316chuck commented Jun 14, 2024 •

edited

Loading