Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convert_dataset_hf to CLI #1348

Merged
merged 35 commits into from
Jul 16, 2024
Merged

Add convert_dataset_hf to CLI #1348

merged 35 commits into from
Jul 16, 2024

Conversation

KuuCi
Copy link
Contributor

@KuuCi KuuCi commented Jul 11, 2024

This PR allows users to call llmfoundry convert_dataset_hf {ARGS} while maintaining correctness with existing convert_dataset_hf script. The motivation is for DLE where we want to make the CLI much more intuitive in the docker images

@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 11, 2024

mcli logs test-data-hf-ynEM4L runs:
python data_prep/convert_dataset_hf.py --dataset c4 --data_subset en --out_root my-copy-c4 --splits train_small val_small --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' --compression zstd

mcli logs test-data-hf-253z0P runs:
llmfoundry convert_dataset_hf --dataset c4 --data_subset en --out_root my-copy-c4 --splits train_small,val_small --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' --compression zstd

@KuuCi KuuCi marked this pull request as ready for review July 11, 2024 23:19
@KuuCi KuuCi requested a review from a team as a code owner July 11, 2024 23:19
@KuuCi KuuCi requested review from irenedea and b-chu July 11, 2024 23:24
@KuuCi KuuCi requested a review from snarayan21 July 12, 2024 21:44
Copy link
Contributor

@irenedea irenedea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes around typer usage and naming

llmfoundry/cli/cli.py Outdated Show resolved Hide resolved
llmfoundry/cli/cli.py Outdated Show resolved Hide resolved
llmfoundry/cli/cli.py Outdated Show resolved Hide resolved
llmfoundry/cli/cli.py Outdated Show resolved Hide resolved
llmfoundry/cli/cli.py Outdated Show resolved Hide resolved
llmfoundry/cli/cli.py Outdated Show resolved Hide resolved
llmfoundry/cli/cli.py Outdated Show resolved Hide resolved
@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 13, 2024

mcli logs test-data-txt-MYSMm1 runs:
` python convert_dataset_hf.py
--dataset c4 --data_subset en
--out_root my-copy-c4 --splits train_small val_small
--concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>'
--compression zstd

mcli logs hf-cli-vdGO9f runs:
llmfoundry convert_dataset_hf \ --dataset c4 --data-subset en \ --out-root my-copy-c4 --splits train_small,val_small \ --concat-tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos-text '<|endoftext|>' \ --compression zstd

image

@KuuCi KuuCi requested a review from irenedea July 13, 2024 03:52
@KuuCi KuuCi requested review from b-chu and dakinggg July 15, 2024 17:46
@KuuCi KuuCi marked this pull request as draft July 16, 2024 00:53
@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 16, 2024

mcli logs hf-cli-j8u5rB runs:
llmfoundry data_prep convert_dataset_hf \ --dataset c4 --data-subset en \ --out-root my-copy-c4 --splits train_small,val_small \ --concat-tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos-text '<|endoftext|>' \ --compression zstd

mcli logs hf-orig-TGBbOl runs:
python convert_dataset_hf.py \ --dataset c4 --data_subset en \ --out_root my-copy-c4 --splits train_small val_small \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' \ --compression zstd

image

@KuuCi KuuCi marked this pull request as ready for review July 16, 2024 21:36
llmfoundry/cli/cli.py Outdated Show resolved Hide resolved
@KuuCi KuuCi merged commit e7bf8db into main Jul 16, 2024
9 checks passed
@dakinggg dakinggg deleted the dataprep-convert_dataset_hf-cli branch August 6, 2024 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants