Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pretraining with iterable/streaming Dataset #556

Merged
merged 5 commits into from
Sep 13, 2023

Conversation

jphme
Copy link
Contributor

@jphme jphme commented Sep 11, 2023

Due to various changes pretraining_dataset didn't work anymore, this should fix it (using it without problems with a streaming dataset, works for local + remote).

@NanoCode012
Copy link
Collaborator

What would happen if max_steps is set to a really large number? Would it default to only running through the entire dataset? I'm a bit curious how to determine this value.

@jphme
Copy link
Contributor Author

jphme commented Sep 12, 2023

What would happen if max_steps is set to a really large number? Would it default to only running through the entire dataset? I'm a bit curious how to determine this value.

No, it would run max_steps (and determine the number of epochs accordingly), see https://github.com/huggingface/transformers/blob/6acc27eea853885270dba5313181443d43e31f2c/src/transformers/trainer.py#L1605 .

For IterableDatasets, max_steps is necessary (see https://github.com/huggingface/transformers/blob/6acc27eea853885270dba5313181443d43e31f2c/src/transformers/trainer.py#L574 ).

@kostum123
Copy link

Does this pr fix finetune with raw corpus?
dataset type:
completion: raw corpus
{"text": "..."}

Copy link
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thanks!

@winglian winglian merged commit 2f586d1 into axolotl-ai-cloud:main Sep 13, 2023
3 checks passed
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
* return without packing prep/len

* fix remove columns

* fix encode arguments

* add error when max steps not set

* fix test

---------

Co-authored-by: Jan Philipp Harries <jphme@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants