-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama version in Minillm #218
Comments
The distilled models and the experiments in our paper are based on LLaMA-1. |
Should it be easy to use this repo to KD llama2? |
Yes. We have implemented the model parallism and SFT for LLaMA2. The KD scripts are easy to be adapted from LLaMA-1. |
It seems that there is no need to modify the src code to adapt LLaMA2, just simply changing the script is enough? |
Exactly. |
Hello Yuxian, I'm a little curious about the minillm process and the dataset for using, and want to check my understanding. I have two questions
The first one is also used for
As for my understanding, should these two DPT stand for different meanings? The first
Such two redundant questions and really appreciate your responses, thanks! |
|
Thanks, that makes sense! BTW, I used lora to sft the model of student and teacher(the step to get the initialization models, later be used in minim), which is due to the GPU number constraint for full dimension sft. I still train for 10 epochs, will using lora to do the sft hugely affect the performance of minillm? I have just finished the lora sft of llama2-1.1B, and in |
Hello Yuxian, would you mind also sharing the link to the dataset of I'm just using Llama1's processed roberta for training minillm of llama2. And I saw there is only a little difference between llama1's and llama2's tokenizer. I think the influence is small, but it would be better if you could share the |
We haven't tried using lora for MiniLLM. I guess it would not affect the performance much. Choosing the final model based on the 'rougeL score' is fine. |
It would take some time for us to get the roberta dataset ready. We construct roberta dataset simply by merging those sub-datasets and tokenizing them. Since the dataset is used for regularization and only a small subset of the data is acturally used in training (less than a epoch), little difference in merging sub-datasets will not make great difference. |
Is it Llama1 or Llama2? Thx
The text was updated successfully, but these errors were encountered: