Refactor: Explore replacing PikePDF with PyMuPDF for efficiency #252

holtskinner · 2024-02-05T20:10:55Z

Explore switching PDF Splitter from PikePDF to PyMuPDF

See if efficiency/code readability improves

https://pymupdf.readthedocs.io/en/latest/about.html

holtskinner · 2024-02-06T17:06:55Z

Running this code for testing:

import timeit

from google.cloud.documentai_toolbox import document

document_json_path = "documentai_SampleDocuments_PROCUREMENT_DOCUMENT_SPLIT_PROCESSOR_pretrained-procurement-splitter-v1.2-2022-08-19_output.json"
document_pdf_path = "documentai_SampleDocuments_PROCUREMENT_DOCUMENT_SPLIT_PROCESSOR_procurement_multi_document.pdf"

doc = document.Document.from_document_path(document_json_path)

output_path = "/"

# Test the PikePDF function
pikepdf_time = timeit.timeit(lambda: doc.split_pdf(document_pdf_path, output_path), number=10)

# Test the PyMuPDF function
mupdf_time = timeit.timeit(lambda: doc.split_pdf_mupdf(document_pdf_path, output_path), number=10)

print(f"PikePDF Time: {pikepdf_time} seconds")
print(f"PyMuPDF Time: {mupdf_time} seconds")

print(f"difference is {pikepdf_time-mupdf_time} seconds")

Got this result

python pymupdf_test.py
PikePDF Time: 0.0616944160001367 seconds
PyMuPDF Time: 0.06151633399986167 seconds
difference is 0.00017808200027502608 seconds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Explore replacing PikePDF with PyMuPDF for efficiency #252

Refactor: Explore replacing PikePDF with PyMuPDF for efficiency #252

holtskinner commented Feb 5, 2024

holtskinner commented Feb 6, 2024

Refactor: Explore replacing PikePDF with PyMuPDF for efficiency #252

Refactor: Explore replacing PikePDF with PyMuPDF for efficiency #252

Comments

holtskinner commented Feb 5, 2024

holtskinner commented Feb 6, 2024