-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NIST metric #250
Add NIST metric #250
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great already! My main comment is around input format: I'd prefer to input untokenized strings, similar to BLEU. We can either use the same tokenizer as is used there or a whitespace tokenizer. The motivation is that this way we use the same format for both metrics (as well as many other NLP metrics) which will allow us to easily combine
them. What do you think?
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Thanks for having a look! I rather stick to the original paper as close as possible and not start mix-matching tokenizers. There are probably only minor differences, but relying on a standard implementation for each metric makes things easier to keep track of imo. But you are right that having the same input format would be useful! What about using the nltk NIST tokenizer? The paper mentions (p. 138) that all text is lower-cased before calculating the metric, so lowercase should be False. In the future, I think it'd be nice to support token-level input though, in addition to sentence-level. As is clear from the example here, these tokenizers here have their own options but in init and in calling their tokenization methods (e.g. |
Sounds good. Like in BLEU we can use it as the default but let the user experiment with other tokenizers should it be appropriate (e.g. for langs that require special tokenization). |
Hey @lvwerra. I had some time to work on this. I can't seem to figure out why the tests are given errors though. My input is a string for predictions and a list for the references. This is also visible in my features. But pytest still throws errors. Can you see what I'm missing here? |
@lvwerra Tests run fine locally for NIST specifically, so I guess I have incorporated the test suite incorrectly. Isn't having a |
Hi @BramVanroy, thanks for working on this and sorry for getting back so late. The I think the issue is still that |
Hi @lvwerra, sorry for taking so long to get back to this. I've swapped out the tokenizers_kwargs with explicit kwargs. Hope that helps. |
The Windows errors seem unrelated to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also only minor comments :)
metrics/nist_mt/nist_mt.py
Outdated
|
||
|
||
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION) | ||
class Nist_mt(evaluate.Metric): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit uneasy with that class name :) Can we make it
class Nist_mt(evaluate.Metric): | |
class NistMt(evaluate.Metric): |
or something similar? Class names are not supposed to have underscores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I don't like it either, not sure how that slipped in. My bad. I think you mentioned before that we should not just call it NIST because there are other NIST metrics out there, so I guess NistMt is the best option?
metrics/nist_mt/nist_mt.py
Outdated
|
||
def _compute(self, predictions, references, n: int = 5, lowercase=False, western_lang=True): | ||
tokenizer = NISTTokenizer() | ||
if isinstance(predictions, str) and isinstance(references[0], str): # sentence nist_mt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think predictions are always a list so this should be
if isinstance(predictions, str) and isinstance(references[0], str): # sentence nist_mt | |
if isinstance(predictions[0], str) and isinstance(references[0], str): # sentence nist_mt |
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I do not think so because for NIST you can have multiple references, so references also has one dimension more than predictions.
I wanted to account for both single sentences and batches of sentences. But I think you are right: _compute always works on batches, right? So I guess I can just remove these if-clauses. And I think I should then also be able to remove the first feature in self.features. Those features only indicate the type of a batch, right? Not single instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the if/else is fine, but it should be predictions[0]
in both cases whereas for refs it is references[0]
and ``references[0][0]` for one or more references.
self.features
show the type of one element of the batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the metric, assuming that a given sample is always prediction (str) and reference (Sequence[str]). That should take away some confusion, I believe.
@@ -105,6 +105,7 @@ | |||
TESTS_REQUIRE = [ | |||
# test dependencies | |||
"absl-py", | |||
"nltk", # for NIST and probably others |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something must have already installed it since the tests worked until now, but ok to have it explicitely :)
NIST is a somewhat older but well known metric for MT that is similar to BLEU. I'd like to add it to the base arsenal of
evaluate
.Core work is done. Still need to write README, examples, and test cases.