Adding more token encoding types #1254

snarayan21 · 2024-06-05T17:36:33Z

Depending on the vocab size, users can encode their token IDs using various int formats. Previously, we only allowed for int64, which covers an absurdly high vocab size. Enabling decoding tokens in uint32 or uint16, for example, would let people save space on their datasets since the max vocab sizes supported would be ~4 million with uint32, or ~65k with uint16. This has been added to both the text and finetuning datasets.

This PR also lets users specify their MDS dataset columns using ndarray types to enable automatically encoding/decoding samples. This was already present for finetuning dataset, so the functionality has been added for the generic text dataset. Accordingly, I've changed the default value in our MDS conversion scripts to use ndarray:uint32 instead of bytes and made relevant changes to get this working.

Added unit tests checking that this works for text and finetuning datasets and that an error is thrown for uncompatible encoding types.

Moved a util function that was applicable to both the text and finetuning dataloaders to a common location to import. This had been written twice to perform the same functionality.

Ran the following scripts successfully: convert_dataset_hf.py, convert_dataset_json.py, convert_finetuning_dataset.py, convert_text_to_mds.py. Updated data prep readme to have instructions for convert_text_to_mds.py.

Using the shakespeare text file here, models trained with/without this branch have deterministic loss curves. One set of runs was with global batch size 32, the other with global batch size 256. See wandb project here.

Foundry regression tests are partially borked right now because of a small bug that's getting addressed in the release branch, but the tests that did run all succeeded. See here.

llmfoundry/data/text_data.py

codestar12

Looks good other then default value

XiaohanZhangCMU

LGTM. Suggest keeping the previous default for now.

codestar12

looks good

llmfoundry/data/text_data.py

llmfoundry/data/data.py

llmfoundry/data/finetuning/tasks.py

llmfoundry/data/text_data.py

tests/data/test_data_encodings.py

dakinggg

lgtm, I think it would be good to manual test at least one of the scripts? maybe do a training run using the new version of convert text to mds and make sure it still trains? and run the regression test suite to make sure existing datasets didn't get broken.

llmfoundry/data/finetuning/tasks.py

llmfoundry/data/text_data.py

scripts/data_prep/convert_dataset_json.py

tests/a_scripts/data_prep/test_convert_text_to_mds.py

…undry into saaketh/tok_encodings merging main

snarayan21 · 2024-06-06T18:48:15Z

Ran the following scripts successfully: convert_dataset_hf.py, convert_dataset_json.py, convert_finetuning_dataset.py, convert_text_to_mds.py. Updated data prep readme to have instructions for convert_text_to_mds.py.

Using the shakespeare text file here, models trained with/without this branch have deterministic loss curves. One set of runs was with global batch size 32, the other with global batch size 256. See wandb project here.

Foundry regression tests are partially borked right now because of a small bug that's getting addressed in the release branch, but the tests that did run all succeeded. See here.

dakinggg

Thanks @snarayan21! And thanks for leaving the documentation better than you found it :)

* add more token encoing types * add more token encoing types * add tests * add tests * ft support, tests * linting is shortening my lifespan * linting is shortening my lifespan * long tensor * long tensor * long tensor * feedbacc * import * import --------- Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com> (cherry picked from commit 42c2d9a)

* add more token encoing types * add more token encoing types * add tests * add tests * ft support, tests * linting is shortening my lifespan * linting is shortening my lifespan * long tensor * long tensor * long tensor * feedbacc * import * import --------- Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

add more token encoing types

71b5bf4

snarayan21 requested a review from a team as a code owner June 5, 2024 17:36

snarayan21 requested review from codestar12, dakinggg and b-chu June 5, 2024 17:36

add more token encoing types

866bfe6

codestar12 reviewed Jun 5, 2024

View reviewed changes

llmfoundry/data/text_data.py Outdated Show resolved Hide resolved

codestar12 suggested changes Jun 5, 2024

View reviewed changes

XiaohanZhangCMU approved these changes Jun 5, 2024

View reviewed changes

add tests

7efd189

snarayan21 requested review from XiaohanZhangCMU and codestar12 June 5, 2024 18:26

add tests

91ad932

codestar12 approved these changes Jun 5, 2024

View reviewed changes

ft support, tests

d7f4c98

snarayan21 requested review from codestar12 and irenedea June 5, 2024 19:57

linting is shortening my lifespan

700d0ce

XiaohanZhangCMU reviewed Jun 5, 2024

View reviewed changes

llmfoundry/data/text_data.py Outdated Show resolved Hide resolved

dakinggg reviewed Jun 5, 2024

View reviewed changes

snarayan21 added 4 commits June 5, 2024 17:15

linting is shortening my lifespan

558d39c

long tensor

0ef131d

long tensor

66ea6dc

long tensor

a08a164

snarayan21 requested a review from dakinggg June 6, 2024 02:55

dakinggg reviewed Jun 6, 2024

View reviewed changes

dakinggg and others added 4 commits June 6, 2024 02:02

Merge branch 'main' into saaketh/tok_encodings

7c8fb91

feedbacc

8a9bb68

Merge branch 'saaketh/tok_encodings' of github.com-me:mosaicml/llm-fo…

bf95e9e

…undry into saaketh/tok_encodings merging main

Merge branch 'main' into saaketh/tok_encodings

4cdc513

snarayan21 added 2 commits June 6, 2024 10:00

import

abea2f5

import

38a1f51

snarayan21 requested a review from dakinggg June 6, 2024 18:51

dakinggg approved these changes Jun 6, 2024

View reviewed changes

snarayan21 merged commit 42c2d9a into main Jun 6, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding more token encoding types #1254

Adding more token encoding types #1254

snarayan21 commented Jun 5, 2024 •

edited

Loading

codestar12 left a comment

XiaohanZhangCMU left a comment

codestar12 left a comment

dakinggg left a comment

snarayan21 commented Jun 6, 2024

dakinggg left a comment

Adding more token encoding types #1254

Adding more token encoding types #1254

Conversation

snarayan21 commented Jun 5, 2024 • edited Loading

codestar12 left a comment

Choose a reason for hiding this comment

XiaohanZhangCMU left a comment

Choose a reason for hiding this comment

codestar12 left a comment

Choose a reason for hiding this comment

dakinggg left a comment

Choose a reason for hiding this comment

snarayan21 commented Jun 6, 2024

dakinggg left a comment

Choose a reason for hiding this comment

snarayan21 commented Jun 5, 2024 •

edited

Loading