Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about tokenizer file #2

Open
hideinhat opened this issue Apr 11, 2024 · 2 comments
Open

A question about tokenizer file #2

hideinhat opened this issue Apr 11, 2024 · 2 comments

Comments

@hideinhat
Copy link

Hello, I am trying to finetune the TrOCR to recognize latex math expression on my custom dataset.
Do I need to train a custom tokenizer for latex format?
Or is the pretrained tokenizer one included with TrOCR good enough?
Any pointers or help would be greatly appreciated.
Thank you!

@win5923
Copy link
Owner

win5923 commented Apr 11, 2024

Yes, my results using the pre-trained tokenizer were not satisfactory. I believe it's necessary to train a dedicated tokenizer based on mathematical symbols. Perhaps there already exists.

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")

@hideinhat
Copy link
Author

Yes, my results using the pre-trained tokenizer were not satisfactory. I believe it's necessary to train a dedicated tokenizer based on mathematical symbols. Perhaps there already exists.

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")

Thank you for reply. I find one on https://huggingface.co/witiko/mathberta. I will try on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants