Preprocess dataset size fix #1131

winglian · 2024-01-17T00:54:27Z

Description

This PR fixes the obscene disk usage when preprocessing datasets. this is mostly during the tokenization map step. By keeping this step completely in memory, it solves ALL the additional disk requirements (which is as much as 10x). Additionally, when running the -m axolotl.cli.preprocess sometimes it would rely on the cache, but since pre-processing is an explicit step, the expectation is that it should ignore the cache in most cases.

winglian · 2024-01-17T00:56:37Z

also, when running training, make sure dataset_prepared_path is set

NanoCode012

Does this mean, that cli.preprocess now loads from cache?

NanoCode012 · 2024-01-17T03:27:10Z

also, when running training, make sure dataset_prepared_path is set

Do you mean when loading from cache or we need it all the time now?

winglian · 2024-01-17T14:58:15Z

Does this mean, that cli.preprocess now loads from cache?

the opposite, pre-process always skips the cache and overwrites it

winglian · 2024-01-17T14:59:23Z

also, when running training, make sure dataset_prepared_path is set

Do you mean when loading from cache or we need it all the time now?

it's more of a reminder to make sure you use that when you've pre-processed the data.

winglian added 3 commits January 16, 2024 19:42

overwrite cache on preprocess step

feab595

don't cache the TokenizedPromptDataset at all

d3caa18

load_from_cache_file no longer needed

0e60e0c

winglian requested review from casper-hansen and NanoCode012 January 17, 2024 00:54

NanoCode012 reviewed Jan 17, 2024

View reviewed changes

NanoCode012 approved these changes Jan 17, 2024

View reviewed changes

winglian merged commit 7570446 into main Jan 17, 2024
7 checks passed

winglian deleted the preprocess-dataset-size-fix branch January 17, 2024 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess dataset size fix #1131

Preprocess dataset size fix #1131

winglian commented Jan 17, 2024

winglian commented Jan 17, 2024

NanoCode012 left a comment

NanoCode012 commented Jan 17, 2024

winglian commented Jan 17, 2024

winglian commented Jan 17, 2024

Preprocess dataset size fix #1131

Preprocess dataset size fix #1131

Conversation

winglian commented Jan 17, 2024

Description

winglian commented Jan 17, 2024

NanoCode012 left a comment

Choose a reason for hiding this comment

NanoCode012 commented Jan 17, 2024

winglian commented Jan 17, 2024

winglian commented Jan 17, 2024