A question about tokenizer file #2

hideinhat · 2024-04-11T02:22:18Z

Hello, I am trying to finetune the TrOCR to recognize latex math expression on my custom dataset.
Do I need to train a custom tokenizer for latex format?
Or is the pretrained tokenizer one included with TrOCR good enough?
Any pointers or help would be greatly appreciated.
Thank you!

win5923 · 2024-04-11T03:04:03Z

Yes, my results using the pre-trained tokenizer were not satisfactory. I believe it's necessary to train a dedicated tokenizer based on mathematical symbols. Perhaps there already exists.

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")

hideinhat · 2024-04-11T04:27:32Z

Yes, my results using the pre-trained tokenizer were not satisfactory. I believe it's necessary to train a dedicated tokenizer based on mathematical symbols. Perhaps there already exists.
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")

Thank you for reply. I find one on https://huggingface.co/witiko/mathberta. I will try on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about tokenizer file #2

A question about tokenizer file #2

hideinhat commented Apr 11, 2024

win5923 commented Apr 11, 2024 •

edited

Loading

hideinhat commented Apr 11, 2024

A question about tokenizer file #2

A question about tokenizer file #2

Comments

hideinhat commented Apr 11, 2024

win5923 commented Apr 11, 2024 • edited Loading

hideinhat commented Apr 11, 2024

win5923 commented Apr 11, 2024 •

edited

Loading