Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] label_list not being set for NLP token classification training if distillation teacher and student labels do not match #1414

Merged
merged 4 commits into from
Mar 4, 2023

Conversation

markurtz
Copy link
Member

@markurtz markurtz commented Mar 2, 2023

Fix for: https://app.asana.com/0/1204070232568744/1204101088568500/f

Testing:
Rerunning the procedure described here and ensuring convergence: https://colab.research.google.com/drive/1WuWJMYY-_S-JP711bLYSRxBbcb66X1ft

Rerunning the following example and ensuring there isn't a crash:

sparseml.transformers.train.token_classification \
  --model_name_or_path zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none \
  --recipe zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none \
  --distill_teacher zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/base-none \
  --dataset_name conll2003 \
  --output_dir sparse_bert-token_classification_conll2003 \
  --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --preprocessing_num_workers 6 \
  --do_train --do_eval --evaluation_strategy epoch --fp16 --seed 29204  \
  --save_strategy epoch --save_total_limit 1

…ing if distillation teacher and student labels do not match
dbogunowicz
dbogunowicz previously approved these changes Mar 3, 2023
Copy link
Contributor

@dbogunowicz dbogunowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…cher if teacher is a string; prioritizing teacher labels to student labels if teacher labels are string and student's int
Copy link
Member

@bfineran bfineran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending summary of further testing

Copy link
Member

@bfineran bfineran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking land due to some issues raised during testing

@bfineran
Copy link
Member

bfineran commented Mar 3, 2023

fixes added to ensure plaintext string labels are preserved on fix - the only time pre 1.4.1 behavior is overwritten is when the original bug condition is met

Copy link
Member

@bfineran bfineran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verified convergence and string label names are preserved for sparsification training runs with:

  • conll from HF datasets (w/ distillation)
  • conll from local (w/ distillation)
  • wnut from HF datasets (no distillation)
  • wnut from local (no distillation)

@bfineran bfineran changed the title [Fix] Fix label_list not being set for NLP token classification training if distillation teacher and student labels do not match [Fix] label_list not being set for NLP token classification training if distillation teacher and student labels do not match Mar 3, 2023
@bfineran bfineran merged commit 1097e65 into main Mar 4, 2023
@bfineran bfineran deleted the nlp-token-fix branch March 4, 2023 05:40
bfineran added a commit that referenced this pull request Mar 4, 2023
…if distillation teacher and student labels do not match (#1414)

* [Fix] Fix label_list not being set for NLP token classification training if distillation teacher and student labels do not match

* Added two fixes: omitting the labels/indices matching for student/teacher if teacher is a string; prioritizing teacher labels to student labels if teacher labels are string and student's int

* revert previous int label patch - allow int labels to let given dataset be the source of truth

* only override label_list when teacher and student labels sets are equal

---------

Co-authored-by: Damian <damian@neuralmagic.com>
Co-authored-by: Benjamin <ben@neuralmagic.com>
bfineran added a commit that referenced this pull request Mar 4, 2023
…if distillation teacher and student labels do not match (#1414)

* [Fix] Fix label_list not being set for NLP token classification training if distillation teacher and student labels do not match

* Added two fixes: omitting the labels/indices matching for student/teacher if teacher is a string; prioritizing teacher labels to student labels if teacher labels are string and student's int

* revert previous int label patch - allow int labels to let given dataset be the source of truth

* only override label_list when teacher and student labels sets are equal

---------

Co-authored-by: Damian <damian@neuralmagic.com>
Co-authored-by: Benjamin <ben@neuralmagic.com>
bfineran added a commit that referenced this pull request Mar 4, 2023
…if distillation teacher and student labels do not match (#1414) (#1416)

* [Fix] Fix label_list not being set for NLP token classification training if distillation teacher and student labels do not match

* Added two fixes: omitting the labels/indices matching for student/teacher if teacher is a string; prioritizing teacher labels to student labels if teacher labels are string and student's int

* revert previous int label patch - allow int labels to let given dataset be the source of truth

* only override label_list when teacher and student labels sets are equal

---------

Co-authored-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Co-authored-by: Damian <damian@neuralmagic.com>
bfineran added a commit that referenced this pull request Mar 4, 2023
…if distillation teacher and student labels do not match (#1414) (#1415)

* [Fix] Fix label_list not being set for NLP token classification training if distillation teacher and student labels do not match

* Added two fixes: omitting the labels/indices matching for student/teacher if teacher is a string; prioritizing teacher labels to student labels if teacher labels are string and student's int

* revert previous int label patch - allow int labels to let given dataset be the source of truth

* only override label_list when teacher and student labels sets are equal

---------

Co-authored-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Co-authored-by: Damian <damian@neuralmagic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants