Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarity on training data for each of the codegen versions #76

Open
sundar7D0 opened this issue Aug 5, 2023 · 0 comments
Open

Clarity on training data for each of the codegen versions #76

sundar7D0 opened this issue Aug 5, 2023 · 0 comments

Comments

@sundar7D0
Copy link

In the "lessons learnt from codegen2" paper, it's discussed that data mix of pile and thestarcoder data is a better choice to undertake if enough compute is available, but it's not clear if codegen2 or codegen2.5 (base models not instruct models) were trained with natural language data like ThePile etc. Is there any small model <=7B which is trained on both ThePile and TheStarCoder data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant