Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate speeding up the movie review model #775

Open
loostrum opened this issue May 29, 2024 · 1 comment
Open

Investigate speeding up the movie review model #775

loostrum opened this issue May 29, 2024 · 1 comment

Comments

@loostrum
Copy link
Member

In PR #773 some changes were made to the movie reviews model runner. We apparently had two versions:

  • One that runs the input sentences through the tokenizer and model one-by-one
  • One that runs all inputs through the tokenizer, then all through the model at once.

The second version is faster, but fails on some special-character inputs so we're using the first now. It would be good if we could fix the second version and use that instead.

The reason for the errors is that all inputs (after masking + tokenization) given to the model in one go needs to have the same length. Some combinations of special characters result in a different amount of tokens and hence a crash. The reason for the differing amount of tokens has to do with the tokenizer, it's best illustrated by running the tokenizer with a few (masked) inputs containing special characters. I'm not sure we can completely fix it at the tokenizer level.

One solution might be to pad the inputs to all match the length of the longest one. There is a special padding token that can be used for this. The question is how this affects the output. I.e., is running the model sentence-by-sentence the same as running all sentences in one go but with added padding? This is to be investigated before it is implemented in DIANNA. A similar padding function was used during model training, see here and here.

Note that the implementation is in several locations: the lime text tutorial, rise text tutorial, _movie_model.py in the dashboard and finally tests/utils.py

@elboyran
Copy link
Contributor

I can confirm that now, dianna can handle special characters, but the text tutorials run very slowly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants