Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[P0] Fixing LoReFT rotation layer hot loading problem (#114) #123

Merged
merged 2 commits into from
Jul 25, 2024

Conversation

frankaging
Copy link
Collaborator

Descriptions:

As reported in #114, LoReFT experiments with GLUE seem to be not reproducing the results. In fact, the evaluation results are far off. We suspect that model loading from the best checkpoint is not working. Evidence is given:

  1. when running a standalone eval script, the evaluation results are stable and match expected results;
  2. when running eval after loading the best checkpoint (either manually, or through the HF trainer), the evaluation results do not match.

Thanks to @m-dev12, the issue seems to be that the loaded rotation layer weights are incorrect. The rotation layer is saved as low-rank matrix to save disk space; when loading it back, we overwrite corresponding columns of the rotation weight matrix. It seems like, it does not overwrite.

To resolve this, we modify the loading function inside the intervention to make sure it is properly loaded.

@frankaging
Copy link
Collaborator Author

frankaging commented Jul 25, 2024

This change also fixes another dtype issue with the GLUE trainer (label dtype is not correct) + another minor issue with pkg dependency. Test logs:

Command,

python train.py -task glue -train_dataset stsb -model FacebookAI/roberta-base -seed 42 -l all -r 1 -p f3 -e 5 -lr 8e-3 -type LoreftIntervention -gradient_accumulation_steps 1 -batch_size 32 -eval_batch_size 32 -test_split validation -max_length 256 --metric_for_best_model pearson --dropout 0.00 --weight_decay 0.0000 --warmup_ratio 0.06 --logging_steps 20 --allow_cls_grad

Before,

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 40.43it/s]
{'validation_pearson': 0.14262418203758725, 'validation_spearmanr': 0.13899118144358605, 'validation_combined_score': 0.14080768174058667}
Training results can be found in ./official_results/roberta-base.glue.stsb.validation.20240724161727112780

After,

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 30.21it/s]
{'validation_pearson': 0.8391224830717346, 'validation_spearmanr': 0.8385444933269972, 'validation_combined_score': 0.838833488199366}
Training results can be found in ./official_results/roberta-base.glue.stsb.validation.20240724183608588062

@frankaging
Copy link
Collaborator Author

reproducing one of the paper result with STS-B to further validate this change:

Running command:

$ python train.py -task glue -train_dataset stsb -model FacebookAI/roberta-base -seed 45 -l all -r 1 -p f3 -e 60 -lr 6e-4 -type LoreftIntervention -gradient_accumulation_steps 1 -batch_size 32 -eval_batch_size 32 -test_split validation -max_length 256 --metric_for_best_model pearson --dropout 0.05 --weight_decay 0.0000 --warmup_ratio 0.03 --logging_steps 20 --allow_cls_grad

Results:

{'loss': 0.277, 'grad_norm': 3.5031702518463135, 'learning_rate': 1.145475372279496e-06, 'epoch': 59.89}                                                                                                              
{'loss': 0.2966, 'grad_norm': 6.497849464416504, 'learning_rate': 0.0, 'epoch': 60.0}                                                                                                                                 
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 38.08it/s]
{'eval_pearson': 0.9022589082956544, 'eval_spearmanr': 0.9016221832172824, 'eval_combined_score': 0.9019405457564684, 'epoch': 60.0}                                                                                  
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10800/10800 [10:14<00:00, 19.58it/s]Directory './official_results/roberta-base.glue.stsb.validation.20240724203908668458/checkpoint-10800/intervenable_model' created successfully.
Loading best model from ./official_results/roberta-base.glue.stsb.validation.20240724203908668458/checkpoint-8280 (score: 0.9030545788928331).
{'train_runtime': 614.9703, 'train_samples_per_second': 560.905, 'train_steps_per_second': 17.562, 'train_loss': 0.48447934751157407, 'epoch': 60.0}                                                                  
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10800/10800 [10:14<00:00, 17.56it/s]
{'n_params': 18444}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 37.89it/s]
{'validation_pearson': 0.9017251334987769, 'validation_spearmanr': 0.8985514125829985, 'validation_combined_score': 0.9001382730408878}
Training results can be found in ./official_results/roberta-base.glue.stsb.validation.20240724203908668458

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant