Fix pretraining with iterable/streaming Dataset #556

jphme · 2023-09-11T19:33:47Z

Due to various changes pretraining_dataset didn't work anymore, this should fix it (using it without problems with a streaming dataset, works for local + remote).

NanoCode012 · 2023-09-12T14:12:54Z

What would happen if max_steps is set to a really large number? Would it default to only running through the entire dataset? I'm a bit curious how to determine this value.

jphme · 2023-09-12T14:42:04Z

What would happen if max_steps is set to a really large number? Would it default to only running through the entire dataset? I'm a bit curious how to determine this value.

No, it would run max_steps (and determine the number of epochs accordingly), see https://github.com/huggingface/transformers/blob/6acc27eea853885270dba5313181443d43e31f2c/src/transformers/trainer.py#L1605 .

For IterableDatasets, max_steps is necessary (see https://github.com/huggingface/transformers/blob/6acc27eea853885270dba5313181443d43e31f2c/src/transformers/trainer.py#L574 ).

kostum123 · 2023-09-12T20:19:22Z

Does this pr fix finetune with raw corpus?
dataset type:
completion: raw corpus
{"text": "..."}

src/axolotl/utils/data.py

winglian

lgtm. thanks!

* return without packing prep/len * fix remove columns * fix encode arguments * add error when max steps not set * fix test --------- Co-authored-by: Jan Philipp Harries <jphme@users.noreply.github.com>

jphme added 5 commits September 11, 2023 08:36

return without packing prep/len

b7b55ce

fix remove columns

c92fee4

fix encode arguments

9628dce

add error when max steps not set

906de90

fix test

3ae4178

winglian reviewed Sep 13, 2023

View reviewed changes

src/axolotl/utils/data.py Show resolved Hide resolved

winglian approved these changes Sep 13, 2023

View reviewed changes

winglian merged commit 2f586d1 into axolotl-ai-cloud:main Sep 13, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pretraining with iterable/streaming Dataset #556

Fix pretraining with iterable/streaming Dataset #556

jphme commented Sep 11, 2023

NanoCode012 commented Sep 12, 2023

jphme commented Sep 12, 2023

kostum123 commented Sep 12, 2023

winglian left a comment

Fix pretraining with iterable/streaming Dataset #556

Fix pretraining with iterable/streaming Dataset #556

Conversation

jphme commented Sep 11, 2023

NanoCode012 commented Sep 12, 2023

jphme commented Sep 12, 2023

kostum123 commented Sep 12, 2023

winglian left a comment

Choose a reason for hiding this comment