You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
partition_pdf() correctly detects the table, but it does not include within the element the numbers as text in the table. These numbers are indeed detected by the process_file_with_pdfminer() function.
This information dissapears with the clean_pdfminer_inner_elements() and clean_pdfminer_duplicate_image_elements() functions.
Element text:
'Affectations to houses and public buildings Quantity Houses affected Houses flooded Houses destroyed Other buildings affected Collapsed walls'
To Reproduce
elements = partition_pdf(filename=pdf_path,
include_page_breaks=True,
strategy="hi_res",
extract_image_block_types=['Image', 'Table'],
model_name="yolox",
infer_table_structure=False,
extract_image_block_to_payload=True)
Expected behavior
Output the table numbers in the unstructured element text.
This PR implements splitting of `pdfminer` elements (`groups of text
chunks`) into smaller bounding boxes (`text lines`). This implementation
prevents loss of information from the object detection model and
facilitates more effective removal of duplicated `pdfminer` text. This
PR also addresses #3430.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
partition_pdf() correctly detects the table, but it does not include within the element the numbers as text in the table. These numbers are indeed detected by the process_file_with_pdfminer() function.
This information dissapears with the clean_pdfminer_inner_elements() and clean_pdfminer_duplicate_image_elements() functions.
Element text:
'Affectations to houses and public buildings Quantity Houses affected Houses flooded Houses destroyed Other buildings affected Collapsed walls'
To Reproduce
elements = partition_pdf(filename=pdf_path,
include_page_breaks=True,
strategy="hi_res",
extract_image_block_types=['Image', 'Table'],
model_name="yolox",
infer_table_structure=False,
extract_image_block_to_payload=True)
Expected behavior
Output the table numbers in the unstructured element text.
18466.pdf
The text was updated successfully, but these errors were encountered: