Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convert_dataset_json to CLI #1349

Merged
merged 19 commits into from
Jul 18, 2024
Merged

Add convert_dataset_json to CLI #1349

merged 19 commits into from
Jul 18, 2024

Conversation

KuuCi
Copy link
Contributor

@KuuCi KuuCi commented Jul 11, 2024

This PR allows users to call llmfoundry convert_dataset_json {ARGS} while maintaining correctness with existing convert_dataset_json script. The motivation is for DLE where we want to make the CLI much more intuitive in the docker images

@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 12, 2024

mcli logs test-data-json-6Ti6eD runs:
python data_prep/convert_dataset_json.py \ --path data_oogabooga.jsonl \ --out_root my-copy-dolly --split train \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' \ --compression zstd

mcli logs test-data-json-NnIuip runs:
llmfoundry convert_dataset_json \ --path data_oogabooga.jsonl \ --out_root my-copy-dolly --split train \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' \ --compression zstd

@KuuCi KuuCi marked this pull request as ready for review July 12, 2024 01:49
@KuuCi KuuCi requested a review from a team as a code owner July 12, 2024 01:49
@KuuCi KuuCi requested review from irenedea and b-chu July 12, 2024 02:11
@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 13, 2024

mcli logs mcli logs json-orig-00agCS runs:
python convert_dataset_json.py \ --path ./example_data/arxiv.jsonl \ --out_root my-copy-arxiv --split train \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' \ --compression zstd

mcli logs json-cli-ekVbHZ runs:
cd llm-foundry/scripts/data_prep llmfoundry convert_dataset_json \ --path ./example_data/arxiv.jsonl \ --out-root my-copy-arxiv --split train \ --concat-tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos-text '<|endoftext|>' \ --compression zstd

image

@KuuCi KuuCi requested a review from dakinggg July 15, 2024 17:57
@KuuCi KuuCi marked this pull request as draft July 16, 2024 00:54
@KuuCi KuuCi marked this pull request as ready for review July 16, 2024 23:15
@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 16, 2024

Test cases showing it works:

json-orig-q58J0w runs:
python convert_dataset_json.py \ --path ./example_data/arxiv.jsonl \ --out_root my-copy-arxiv --split train \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' \ --compression zstd

json-cli-k2wpwr runs:
llmfoundry data_prep convert_dataset_json \ --path ./example_data/arxiv.jsonl \ --out-root my-copy-arxiv --split train \ --concat-tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos-text '<|endoftext|>' \ --compression zstd

image

@KuuCi KuuCi enabled auto-merge (squash) July 18, 2024 00:14
@KuuCi KuuCi merged commit 6f87962 into main Jul 18, 2024
9 checks passed
@dakinggg dakinggg deleted the dataprep-convert_dataset_json-cli branch August 6, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants