Skip to content

Releases: among/fusus

Moved the main tf files

11 Apr 17:47
Compare
Choose a tag to compare
v0.8

moved tf files

Aligned Lakhnawi and Afifi

16 Nov 13:17
Compare
Choose a tag to compare

Data versions for fususl and fususa have been bumped to 0.7.
There is now also fusus data (i.e. aligned and merged fususa and fususl) in version 0.7.

New Lakhnawi version 0.6

02 Nov 14:12
Compare
Choose a tag to compare

Lakhnawi tf generation:
The numbers of the proper bezels were not correct.
Fixed it and created a new data version.

Delivered

15 Feb 09:55
Compare
Choose a tag to compare

This release markes the handing over of this repostiory from Dirk to Cornelis as main contributer.

So far, Dirk has written most of the code, although all of the work is the result of a close cooperation between Cornelis and Dirk.

Cornelis provided the seminal ideas, organized the project and procured the funding.
Cornelis and Dirk discussed every problem and issue underway in Slack.

The main results are (between brackets the location in this repo)

  • the fusus code: OCR pipeline and PDF text extraction (fusus)
  • example data (examples) (attached as example.zip)
  • output data: Lakhnawi TF, TSV, HTML, PDF; Affifi TF, TSV, HTML, PDF (ur) (attached as Lakhnawi.zip and Affifi.zip)
  • documentation: Readme, doc-strings in the fusus code, extra markdown files (fusus/docs), (the built site is attached to this release as site.zip)
  • notebooks (notebooks) - view them on nb-viewer

Fusus-Lakhnawi converted

24 Dec 11:31
Compare
Choose a tag to compare

The pdf with the Lakhnawi editon of the Fusus has been converted to plain unicode.
Only the original text is retained. Footnotes and page numbers have been removed.

From there I made some exports to html and pdf, which are attached.
This is just for informational purposes.

Later we plain to produce a Text-Fabric version of this text, which will include the exact positions of all words in the original pdf.
From there you can get a plain text easily.

There might still be some rough edges.

Pipeline works

07 Dec 16:19
Compare
Choose a tag to compare

The pipeline from scanned images via cleaning to OCR works.
A few hundred pages have been done.
There is still a lot of tweaking to do.

The OCR results are delivered as tab separated files, with position and confidence information, at
word and character levels.