Skip to content

christopher-w-murphy/Digital-HTC-Architecture

Repository files navigation

Digital-HTC-Architecture

Streamlit UI

This is a simple program for OCR and Machine Translation of PDF documents.

The backend uses Imagemagick for image processing, Tesseract for OCR, and DeepL for translation. The frontend is powered by Streamlit.

Example

Original Document Original document OCR text in original language OCR text in original language Translated text Translated text

Instructions

DeepL is used for translation, and an authentication key is needed to access their API. You can get a key for free, but registration is required.

macOS

Strictly speaking, you need macOS Catalina (10.15) or higher with a 64-bit processor. However, I've tested that this works on macOS Mojave (10.14).

To start, one option is to clone this repo. Open a terminal and run:

git clone https://github.com/christopher-w-murphy/Digital-HTC-Architecture.git

Alternatively, one may download the code by clicking the green Code button and then Download ZIP.

In either case, move into the repo directory and run the installer script:

cd Digital-HTC-Architecture/
bash macos_installer.sh

While still in the Digital-HTC-Architecture directory, start the program by running the following command in a terminal:

bash ocr_and_machine_translation.sh

You can now view the Streamlit app in your browser. To stop running the app enter control+c in the terminal.

Docker

Docker users can pull the image from Dockerhub

docker pull murphycw/digital-htc-architecture

Run the image as a container to start the program

docker run -d -p 8501:8501 murphycw/digital-htc-architecture

Note that the Docker image has Tesseract v4 as opposed to v5, and can only OCR English and French documents.

Windows

Programming Historian has instructions for a Windows for an OCR and translation program.