Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Behavior before this PR:
if length_penalty = none
AVG SCORE = total logprobs of all sentences / total nb of words ===> ("average per word log prob")
PPL = exp(above) ===> ("average per word perplexity" even if not really relevant)
if length_penalty = avg (or wu)
AVG SCORE and PPL are completely wrong since scores were normalized twice.
Behavior after this PR:
if length_penalty = none
SCORE = total logprobs of all sentences / number of sentences ===> (gives an average score per sentence, but again not 100% precise, we would need to report scores at the sentence level and then average)
PPL = exp (above) ===> Approx of a per sentence perplexity
if length_penalty = avg (or wu)
SCORE = total normalized logprob (per token) / number of sentences ===> average normalized log prob
PPL = exp(above) ===> Approx 'average per word perplexity' similar to the case none before this PR.