Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Document AI Object to Preserve Layout Text? #159

Open
raad-altaie opened this issue Aug 30, 2023 · 6 comments
Open

Convert Document AI Object to Preserve Layout Text? #159

raad-altaie opened this issue Aug 30, 2023 · 6 comments
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@raad-altaie
Copy link

raad-altaie commented Aug 30, 2023

Is your feature request related to a problem? Please describe.

I've been using Google Document AI for text extraction from scanned documents, and it's been working well in terms of extracting text. However, I'm facing an issue when it comes to preserving the layout of the text.

In AWS Textract, there's a tool called "pretty print" that helps maintain the layout of extracted text. Tesseract, on the other hand, allows for preserving interword spaces using the config='-c preserve_interword_spaces=1' option which is kind of does the same thing.
I really wish if "python-documentai-toolbox" could support such output.

Describe the solution you'd like

documentai object => preserved layout text

Describe alternatives you've considered

Extracting text using the pdftotext library seemed like a viable option, but surprisingly, "python-documentai-toolbox" doesn't offer support for PDF output, which is rather baffling.

@meredithslota meredithslota added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Sep 7, 2023
@holtskinner
Copy link
Member

holtskinner commented Sep 8, 2023

Can you provide more information on what you mean by "preserving the layout of the text"?

Do you want all of the text to be printed to the screen or a TXT file in the same general locations as the source document?

An example of an input document and the output text would be useful.

This will likely be difficult to implement since the layout information extracted from Document AI is using Bounding Boxes with X, Y coordinates (which doesn't apply cleanly to TXT files.)

Document AI by design doesn't fill in the Document.text field with extra spaces/tabs to signify where the text sits on the page.

It could be possible to use the Document.Page.Block field to identify blocks of text and place them generally in the same order, but again this isn't going to be very exact since Coordinates don't have a 1-1 relationship in text files.

@raad-altaie
Copy link
Author

@holtskinner thank you for your response!
what i am looking for something like the example below.

image:

input

and the output I am getting is as follows:

Someto the left
Someto the left

Some in the middle
Some in the middle

Some with some tab
Some with some tab

Some with some space between them
Some with some space between them

Sometext here
Sometext here

this much
this much

How do I get the desired output string as of the same structure in image?

i.e. as follows:

 										         Some text here
 										         Some text here

Some to the left
Some to the left

 					Some in the middle
 					Some in the middle

 		Some with some tab
 		Some with some tab

Some with some space between them						this much
Some with some space between them						this much
  • also do you have an example how i can use Document.Page.Block to restructure the document ( ill give it a try)?

@think-diff
Copy link

we want to do the same thing here!

@ThreeHAN
Copy link

ThreeHAN commented Dec 5, 2023

At there very least, ensuring there are spaces between words in the text output from document AI would be of great assistance. Sometimes, when words are in different entities but next to each other, the Document AI text blob shows them as twowords as opposed to two words. Having a helper function ensure spaces are there would reduce custom post processing for us.

@nonlocalStream
Copy link

+1 I want the same thing. Currently I'm using PyMuPdf cli to achieve this python -m fitz gettext https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-text-in-natural-reading-order

Wish the same thing for the document generic OCR (I think the underlying mechanism should be similar, basically reconstructing the layout from the bounding box information https://github.com/pymupdf/PyMuPDF/blob/c0ae13746155e9bb5c11ab7e9a42c2e73758422e/src/__main__.py#L802)

@zkalson
Copy link

zkalson commented Apr 18, 2024

Hey all, I was able to get this mostly working! Here's a rough overview of the process for Python:
-For each page in a document, create a reportlab Canvas object
-Create a text layer on the Canvas object and write the text onto it, using the bounding box data
-Save the PDF and use poppler or pypdf to extract the text layer into a layout-preserved .txt file

The one issue I'm still stuck on is handling documents when GCP performs preprocessing on them see my issue here

If someone is able to help me use the transforms field, I'm happy to invest some time tidying up my code and making a PR with the feature!

Attached is an example input and output.
Input-SampleDocumentAITextLayout.pdf
Output-SampleDocumentAITextLayout.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

7 participants