-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use accuracy as default metric for text classification Evaluator #128
Conversation
The documentation is not available anymore as the PR was closed or merged. |
from datasets import Dataset, load_dataset, load_metric | ||
from transformers import AutoTokenizer, BertForSequenceClassification, pipeline | ||
from datasets import load_dataset | ||
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty of cleaning this up a bit to follow the conventions in transformers
(use autoclasses over model-specific ones where possible)
|
||
|
||
class TestEvaluator(TestCase): | ||
def setUp(self): | ||
self.data = Dataset.from_dict(load_dataset("imdb")["test"][:2]) | ||
self.data = load_dataset("imdb", split="test[:2]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty of using something more datasets
-like here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Looks good to me, I'd wait for @lvwerra to weigh in as he was suggesting F1 as the default metric here
LGTM! 🚀 |
This PR replaces the current default metric in the text classification
Evaluator
fromf1
toaccuracy
.The reason for doing so is that the
Evaluator
fails when the dataset is multiclass because the default average inf1
isbinary
. Although there are well known limitations with using accuracy as the only metric, I think it's better to have defaults that "just work"Remark: it probably would make sense to expand the unit tests to test the evaluator on binary / multiclass / multilabel datasets. I'm happy to implement that if you think that's useful.