Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update comet #443

Merged
merged 2 commits into from
Jun 24, 2023
Merged

Update comet #443

merged 2 commits into from
Jun 24, 2023

Conversation

ricardorei
Copy link
Contributor

COMET metric was recently updated to v2.0 and the predict interface returns a single class instead of two floats.

This pull request adds a simple verification to the package version and changes the behaviour accordingly.

Also, we updated the metric README which was slightly outdated. We recently released better and improved metrics that we developed for the WMT22 Metrics shared task. For versions above 2.0 the default metric is wmt22-comet-da instead of wmt20-comet-da. This new model performs better across language pairs and domain while being more interpretable. You can check the blogpost here.

@manueldeprada
Copy link
Contributor

I am not a maintainer, but I have just tested this, and it works as advertised.

Thanks, @ricardorei, I hope this gets merged soon

@BramVanroy
Copy link
Contributor

@lvwerra Friendly ping. Can this be merged (and maybe even with a pip upgrade)? Thanks!

@joao-alves97
Copy link

joao-alves97 commented May 3, 2023

This is not working as expected. I run the example:

source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)

I expected to see a Comet score, but the output was:

>>> comet_score 
{'mean_score': 'system_score', 'scores': 'scores'}

This output clearly differs from what I expected. @ricardorei can you help? Thanks

@BramVanroy
Copy link
Contributor

This is not working as expected. I run the example:

source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)

I expected to see a Comet score, but the output was:

>>> comet_score 
{'mean_score': 'system_score', 'scores': 'scores'}

This output clearly differs from what I expected. @ricardorei can you help? Thanks

That is the expected score: mean_score is the COMET score (the mean of all sentence scores).

@ricardorei
Copy link
Contributor Author

@BramVanroy @joao-alves97 what comet version are you using? I was not able to replicate what you are referring.

I tested unbabel-comet==1.1.3, 2.0.0 and 2.0.1

@BramVanroy
Copy link
Contributor

@ricardorei I think what @joao-alves97 is saying that the output is a dictionary with keys mean_score and scores but they expected the output to be a float. But I clarified that that is not how it works, and that the mean_score is the aggreggated corpus score that they should use.

@lvwerra lvwerra merged commit af3c305 into huggingface:main Jun 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants