Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add language to word timestamps for Whisper #31572

Merged

Conversation

robinderat
Copy link
Contributor

@robinderat robinderat commented Jun 24, 2024

Add language to word timestamps for Whisper

Fixes 29520

This fix enables whisper to return word-level timestamps and the predicted language at the same time. Before, enabling word-level timestamps would discard the language prediction.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@kamilakesbi
@sanchit-gandhi

_collate_word_timestamps uses the return_language flag to determine whether the language of the chunk should be added to the word's information
added missing comma
@kamilakesbi
Copy link
Contributor

Hi @robinderat, thanks for working on this!

Could you please add a test in test_tokenization_whisper.py to verify that we can indeed return both word-level timestamps and the predicted language at the same time?

test that the pipeline can return both the language and timestamp
@kamilakesbi
Copy link
Contributor

LGTM! thanks for iterating on this :)

cc @amyeroberts for final review!

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @robinderat! 🤗

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@AvivSham
Copy link
Contributor

@amyeroberts gentle reminder 😄

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping - just some questions to better understand the diff

@@ -1197,12 +1197,16 @@ def _find_longest_common_sequence(sequences, token_timestamp_sequences=None):
return total_sequence, []


def _collate_word_timestamps(tokenizer, tokens, token_timestamps, language):
def _collate_word_timestamps(tokenizer, tokens, token_timestamps, language, return_language):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is return_language always a bool, or could it be None too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be a bool or None. However, the behavior for return_language=None is equal to that of return_language=False

)
data = load_dataset("openslr/librispeech_asr", "clean", split="test", streaming=True, trust_remote_code=True)
sample = next(iter(data))
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="en", task="transcribe")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this modification necessary for people to be able to use return_language with the pipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I based my test on an existing one that had this modification in it. However, it does not seem to affect the tests, so I have now removed them.

Specifying return_language=True in the pipeline is all that is required for this to work

Removed model configurations that do not influence test results
Removed model configurations that do not influence test results
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding and iterating on this!

@amyeroberts amyeroberts merged commit b31d595 into huggingface:main Jul 17, 2024
18 checks passed
@robinderat robinderat deleted the whisper-language-with-word-timestamps branch July 18, 2024 07:21
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jul 19, 2024
* add language to words

_collate_word_timestamps uses the return_language flag to determine whether the language of the chunk should be added to the word's information

* ran style checks

added missing comma

* add new language test

test that the pipeline can return both the language and timestamp

* remove model configuration in test

Removed model configurations that do not influence test results

* remove model configuration in test

Removed model configurations that do not influence test results
MHRDYN7 pushed a commit to MHRDYN7/transformers that referenced this pull request Jul 23, 2024
* add language to words

_collate_word_timestamps uses the return_language flag to determine whether the language of the chunk should be added to the word's information

* ran style checks

added missing comma

* add new language test

test that the pipeline can return both the language and timestamp

* remove model configuration in test

Removed model configurations that do not influence test results

* remove model configuration in test

Removed model configurations that do not influence test results
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Jul 24, 2024
* add language to words

_collate_word_timestamps uses the return_language flag to determine whether the language of the chunk should be added to the word's information

* ran style checks

added missing comma

* add new language test

test that the pipeline can return both the language and timestamp

* remove model configuration in test

Removed model configurations that do not influence test results

* remove model configuration in test

Removed model configurations that do not influence test results
itazap pushed a commit that referenced this pull request Jul 25, 2024
* add language to words

_collate_word_timestamps uses the return_language flag to determine whether the language of the chunk should be added to the word's information

* ran style checks

added missing comma

* add new language test

test that the pipeline can return both the language and timestamp

* remove model configuration in test

Removed model configurations that do not influence test results

* remove model configuration in test

Removed model configurations that do not influence test results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Whisper language is not returned when using word timestamps
6 participants