-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to extract complete text from the document? #292
Comments
You can run inference on the base-model, which has not been fine tuned to any json schema, to do an OCR prediction just like in the pre-training task. Here is a code snippet that should get you started: from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
model_path = "donut-base"
processor = DonutProcessor.from_pretrained(model_path)
model = VisionEncoderDecoderModel.from_pretrained(model_path)
# scale = 1 # optionally change ingput image scale
# h, w = 1263 // scale, 893 // scale
# processor.image_processor.size = {"height": h, "width": w}
# model.config.encoder.image_size = [h, w]
max_new_tokens = 1024 # increase to get more text
image = Image.open("../test-document.png").convert("RGB")
task_prompt = "<s_iitcdip>" # Prompt of pretraining, can be reused for OCR
decoder_input_ids = processor.tokenizer(
task_prompt, add_special_tokens=False, return_tensors="pt"
).input_ids
pixel_values = processor(image, return_tensors="pt").pixel_values
predict_ids = model.generate(
pixel_values,
decoder_input_ids=decoder_input_ids,
max_new_tokens=max_new_tokens,
use_cache=True,
bad_words_ids=[[0, 1, 2, 3, 57522]],
eos_token_id=2,
pad_token_id=1,
)
predict_seq = processor.tokenizer.batch_decode(predict_ids)
print(predict_seq) |
@felixvor thanks for sharing the code it works. There's a little problem in my documents where there are some handwritten texts in the document, it's not able to pick it up. Here's my code:
And the output looks as follows: When I process the same document using fine-tuned version
Is there any way I can parse complete text including handwritten text using the base model or take the complete text output from the fine-tuned model? |
In my opinion the strength of donut is not it's ocr generation but the possibility to fine tune on specific tasks. At the moment I can't think of a straight forward way to use the qa model for ocr generation. Maybe it could work somehow but I don't think it would be efficient. |
I am trying to extract the complete text from the document maintaining the basic sequence and structure of texts as they appear in the input document. Is there a way to achieve this using Donut?
All the examples and documentation that I came across are related to either JSON extraction or DocVQA. Any help is appreciated.
The text was updated successfully, but these errors were encountered: