Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convert_text_to_mds to CLI #1352

Merged
merged 23 commits into from
Jul 18, 2024
Merged

Add convert_text_to_mds to CLI #1352

merged 23 commits into from
Jul 18, 2024

Conversation

KuuCi
Copy link
Contributor

@KuuCi KuuCi commented Jul 12, 2024

This PR allows users to call llmfoundry convert_text_to_mds {ARGS} while maintaining correctness with existing convert_text_to_mds script. The motivation is for DLE where we want to make the CLI much more intuitive in the docker images

@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 12, 2024

test-data-txt-LJgtNP runs:
llmfoundry convert_text_to_mds \ --output_folder my-copy-shakespeare \ --input_folder shakespeare \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b \ --compression zstd

test-data-txt-du60oE runs:
python data_prep/convert_text_to_mds.py \ --output_folder my-copy-shakespeare \ --input_folder shakespeare \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b \ --compression zstd image: mosaicml/llm-foundry:2.3.1_cu121-latest

@KuuCi KuuCi marked this pull request as ready for review July 12, 2024 02:10
@KuuCi KuuCi requested a review from a team as a code owner July 12, 2024 02:10
@KuuCi KuuCi requested review from irenedea and b-chu July 12, 2024 02:10
@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 13, 2024

txt-cli-GVwCPT Runs:
llmfoundry convert_text_to_mds \ --output-folder my-copy-shakespeare \ --input-folder shakespeare \ --concat-tokens 2048 --tokenizer EleutherAI/gpt-neox-20b \ --compression zstd

txt-orig-Zagm3I Runs:
python convert_text_to_mds.py \ --output_folder my-copy-shakespeare \ --input_folder shakespeare \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b \ --compression zstd

image

@KuuCi KuuCi requested a review from dakinggg July 15, 2024 17:57
@KuuCi KuuCi marked this pull request as draft July 16, 2024 00:54
@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 17, 2024

txt-cli-Y2HXxb runs:
llmfoundry data_prep convert_text_to_mds \ --output-folder my-copy-shakespeare \ --input-folder shakespeare \ --concat-tokens 2048 --tokenizer EleutherAI/gpt-neox-20b \ --compression zstd

txt-orig-tFXNED runs:
python convert_text_to_mds.py \ --output_folder my-copy-shakespeare \ --input_folder shakespeare \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b \ --compression zstd

image

@KuuCi KuuCi marked this pull request as ready for review July 17, 2024 21:21
llmfoundry/command_utils/__init__.py Outdated Show resolved Hide resolved
llmfoundry/command_utils/__init__.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
@KuuCi KuuCi requested a review from dakinggg July 18, 2024 00:31
@KuuCi KuuCi enabled auto-merge (squash) July 18, 2024 00:37
@KuuCi KuuCi merged commit 59b9c2a into main Jul 18, 2024
9 checks passed
@dakinggg dakinggg deleted the dataprep-convert_text_to_mds-cli branch August 6, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants