Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for s3 checkpoint #220

Merged
9 commits merged into from
Sep 9, 2024
Merged

Conversation

eliebak
Copy link
Contributor

@eliebak eliebak commented Aug 22, 2024

Goal

Allow the upload checkpoint to s3 during training + resuming from ckpt

How

Add S3UploadArgs class, define like this in the config

s3_upload:
  remove_after_upload: true
  s5cmd_concurrency: 5
  s5cmd_numworkers: 16
  s5cmd_path: path_to_s5cmd 
  upload_s3_path: path_to_s3

Add support for s3 path in resume_checkpoint_path and it will copy (with s5cmd) it toocheckpoints_path. This is done in the pre_init phase with the parse_ckpt_path(config=self.config, parallel_context=self.parallel_context) function.

@eliebak eliebak changed the title [WIP] add support for s3 checkpoint: need a fix on check_path_is_local [WIP] add support for s3 checkpoint Aug 22, 2024
@eliebak eliebak changed the title [WIP] add support for s3 checkpoint add support for s3 checkpoint Aug 23, 2024
@eliebak eliebak marked this pull request as draft August 30, 2024 03:43
@3outeille 3outeille self-assigned this Aug 30, 2024
@eliebak eliebak marked this pull request as ready for review September 3, 2024 06:25
if after != orig_vocab_size:
print("i'm in")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup comment

@@ -3,6 +3,7 @@
import os
import random
import socket
import re
Copy link
Member

@3outeille 3outeille Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove as it seems to be unused

@3outeille
Copy link
Member

3outeille commented Sep 9, 2024

Please run the pre-commit so that it fix automatically everything related to coding styles

pip install pre-commit
pre-commit install

In any case, good job !

@xrsrke xrsrke closed this pull request by merging all changes into huggingface:main in 38d64fb Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants