Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json2token performance #266

Open
benjaminfh opened this issue Nov 1, 2023 · 1 comment · May be fixed by #268
Open

json2token performance #266

benjaminfh opened this issue Nov 1, 2023 · 1 comment · May be fixed by #268

Comments

@benjaminfh
Copy link

Hi team, thanks for all the efforts on this model – it looks super promising and powerful and is a great contribution. Context on my side is that I'm been using this as a learning exercise to get to grips with the latest model architectures (aka, this issue might have some built-in naiveté!)

I've prepped a dataset for my specific use case that I'd like to train Donut on. I have close enough to 10k images to throw at it if I wish. When playing with a training set of this scale I noticed the perf when tokenizing the gt_parse data is noticeably slow – it'd cost me about 2 hours of GPU time to do the tokenizing. I traced it back to this line in utils.py

self.gt_token_sequences.append(

It calls a fairly expensive tokenizer method in transformers/tokenization_utils_base (https://github.com/huggingface/transformers/blob/c9e72f55b2dc4b9be4edb986dce0552582b328f2/src/transformers/tokenization_utils_base.py#L873) (about 50 ms per call, and for me, I need to call it about 20 times per row of training data == 1+ sec!). 99.999% of the time, this method returns that 0 special tokens were added. They were of course added from a previous row of data but it takes us 50 ms to determine this.

Since special tokens are not modified by the tokenizer (is my understanding), why do we need to use the tokenizer to test whether a given token should be added? Genuine question, conscious of my own potential ignorance for nuance here. Can we instead maintain a set of special tokens that have already been successfully tokenized and added in donut/utils and remove these from future calls to the tokenizer, crucially, skipping the call altogether when there are no new/unique special tokens for it to consider?

Happy to create a PR if this sounds sensible / I'm not missing something special that the tokenizer call is doing for <token_x> the nth time over :)

Cheers!

@benjaminfh
Copy link
Author

benjaminfh commented Nov 1, 2023

I think this should do the trick https://github.com/clovaai/donut/pull/268/files
In fact, I think there might have been a bug all along, in the sense that for each piece of training data read in, any json keys were (rightly) added to the decoder special tokens but a few layers down, the encoder's add_special_tokens method has a replace flag boolean with default value of True – basically this is dropping any special tokens added in a previous iteration.

By fixing this flag, we can then check (successfully!) whether the current iteration's special tokens are already in the vocab, and skip the next call if so. This saves potentially hours of calls to tokenizer's method, all of which were in vain anyway because they'd soon be overwritten :)

Ben

@benjaminfh benjaminfh linked a pull request Nov 14, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant