Releases · among/fusus

Cornelis provided the seminal ideas, organized the project and procured the funding.
Cornelis and Dirk discussed every problem and issue underway in Slack.

The main results are (between brackets the location in this repo)

the fusus code: OCR pipeline and PDF text extraction (fusus)
example data (examples) (attached as example.zip)
output data: Lakhnawi TF, TSV, HTML, PDF; Affifi TF, TSV, HTML, PDF (ur) (attached as Lakhnawi.zip and Affifi.zip)
documentation: Readme, doc-strings in the fusus code, extra markdown files (fusus/docs), (the built site is attached to this release as site.zip)
notebooks (notebooks) - view them on nb-viewer

Assets 10

24 Dec 11:31

dirkroorda

v0.2

62c9db3

Fusus-Lakhnawi converted

The pdf with the Lakhnawi editon of the Fusus has been converted to plain unicode.
Only the original text is retained. Footnotes and page numbers have been removed.

From there I made some exports to html and pdf, which are attached.
This is just for informational purposes.

Later we plain to produce a Text-Fabric version of this text, which will include the exact positions of all words in the original pdf.
From there you can get a plain text easily.

There might still be some rough edges.

Assets 4

07 Dec 16:19

dirkroorda

v0.1

49df867

Pipeline works

The pipeline from scanned images via cleaning to OCR works.
A few hundred pages have been done.
There is still a lot of tweaking to do.

The OCR results are delivered as tab separated files, with position and confidence information, at
word and character levels.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: among/fusus

Moved the main tf files

Aligned Lakhnawi and Afifi

New Lakhnawi version 0.6

Delivered

Fusus-Lakhnawi converted

Pipeline works