Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with tesseract 4 #273

Closed
deepio opened this issue Apr 6, 2019 · 23 comments
Closed

Compatibility with tesseract 4 #273

deepio opened this issue Apr 6, 2019 · 23 comments

Comments

@deepio
Copy link

deepio commented Apr 6, 2019

  • Audiveris: 5.1.0:6780b1f91
  • OCR Engine: Tesseract OCR, version 3.04.01

When will we see support for Tesseract 4.0?

@maximumspatium
Copy link
Contributor

maximumspatium commented Apr 7, 2019

When will we see support for Tesseract 4.0?

I'm currently working on supporting Tesseract 4.0. Unfortunately, an upgrade attempt has revealed unforeseen problems:

  1. an old "won't fix" bug has slipped into the 4.x branch causing Java Virtual Machine to crash when accessing libtesseract. The problem is that Tesseract maintainers stubbornly refuse to fix this issue repeatedly telling "that is not our bug" and thus just breaking 3rd party software. Fortunately, Samuel Audet from Javacpp project has recently fixed it, see tesseract 4.0.0-1.4.4 crashes on Mac OS bytedeco/javacpp-presets#694

  2. Tesseract public API seems to have undergone some changes so the way Audiveris communicate with the engine doesn't work anymore. This need to be troubleshoot and worked around.

Moreover, Tesseract's full page mode has been proven to perform rather poor on text recognition in presence of musical symbols. Especially, lyrics and chords are often affected because they use uncommon layout vs. grammar. This is something we cannot work around easily. For the time being, Audiveris let Tesseract to perform one-shot text detection and recognition relying on algorithms we have no control over. This need to be reworked to allow multistage recognition/rejection using different parameters, see #44.

@maximumspatium maximumspatium changed the title Compatibility with tesseract Compatibility with tesseract 4 Apr 22, 2019
@maximumspatium
Copy link
Contributor

maximumspatium commented May 11, 2019

Update:

  1. (JVM crash)

fixed

  1. Tesseract public API seems to have undergone some changes so the way Audiveris communicate with the engine doesn't work anymore.

Audiveris relies on the information returned by LTRResultIterator::WordFontAttributes that includes font properties (bold, italic etc) as well as font size. All this is required by the Audiveris UI for displaying recognized text over the original picture.

Tesseract 4 has been redesigned in such a way that the font information except character size isn't available anymore, see tesseract-ocr/tesseract#1074

A support for font attributes is feasible but isn't available yet. According to the principal Tesseract developer, Ray Smith, this is one more reason for delaying deprecation of the v3 engine.

Many people recommend to stick to the old engine instead of switching to the recent one. The reality is a bit different:

  • most of the package managers already switched to Tesseract 4; installing the older version is somewhat difficult requiring compiling from sources
  • Tesseract shows an improved accuracy and faster recognition for musical scores than its predecessor

I'm currently redesigning Tesseract-related classes to support the new engine. Results will be reported shortly...

@maximumspatium
Copy link
Contributor

maximumspatium commented May 25, 2019

After spending several days analyzing Tesseract's 4 output via TessAPI, I found out several heavy-weight problems preventing further adoption of the LSTM engine for our OMR task.
The biggest roadblock is that Tesseract 4 reports sometimes random characters with bounding box set to the whole page. This is a known issue reported by several people and still unfixed, see tesseract-ocr/tesseract#1192

I therefore decided to wait for the Tesseract team to fix all bounding box related issues first. Audiveris will stick to Tesseract 3.x for now.

@maximumspatium
Copy link
Contributor

I just tested Tesseract 4 in the legacy engine mode (OEM_TESSERACT_ONLY). It seems to work as expected. The updated code was pushed to the tess4 feature branch.

Please test it and give me a feedback.

@deepio
Copy link
Author

deepio commented Jul 3, 2019

I can confirm that it does not crash and it produces musicxml files, but the musicxml files are almost completely empty for the handful of tiff files I tried. This is the full musicxml file output.

<?xml version="1.0" ?>
<sheet last-persistent-id="0" number="1">
  <glyph-index></glyph-index>
</sheet>

@deepio
Copy link
Author

deepio commented Apr 18, 2020

Will close this issue and create a new one for the new issue.

@stweil
Copy link
Contributor

stweil commented Jan 9, 2022

I just tested Tesseract 4 in the legacy engine mode (OEM_TESSERACT_ONLY). It seems to work as expected.

That's what I have expected, too. Tesseract 4 and even the latest Tesseract 5.0.1 are still compatible with Tesseract 3 in legacy mode. Why was the update abandoned? I noticed that there exist pre-built jar files which can be used for 4.0.0, but I could not find jar files for newer releases.

@stweil
Copy link
Contributor

stweil commented Jan 9, 2022

Now I could at least build with Tesseract 4.1.1 (based on your tess4 branch). See https://github.com/stweil/audiveris/tree/tess4.

@hbitteur
Copy link
Contributor

hbitteur commented Jan 9, 2022

@stweil
While investigating a new classifier in mid 2020 (see head-classifier branch), I used Tesseract 4.1.0 to be able to build the whole software set.

But I did not really use OCR by this time, I was focusing on a new attempt of head recognition via a patch classifier. This work is still on pause right now, it should get resurrected some day, but that's another story.
Purpose of my remark is to call your attention on the fact that the software can be built, but that says nothing about the quality of OCR recognition (vs the 3.x Tesseract engine) when applied to sparse textual elements as found on a music score.

If you could spend some time to evaluate the actual OCR results (of 4.x, and perhaps 5.x as you mentioned), we would all benefit from such experience.

@stweil
Copy link
Contributor

stweil commented Jan 9, 2022

In theory Tesseract 4 and 5 in legacy mode should produce identical results as Tesseract 3 because all use the same OCR engine (and the same kind of models), so the quality would be identical. Tesseract 5 would still be faster, include a lot of bug fixes and support more platforms (ARM, Apple M1, ...).

I have much experience with Tesseract, so I can help on that side. And I have no experience with Audiveris.

@maximumspatium
Copy link
Contributor

maximumspatium commented Jan 9, 2022

@hbitteur Stefan asked for the reason to not merging the tess4 branch into master/development.
As for now, Audiveris still uses the ancient Tesseract 3.04, see

ext.tessVersion = '3.04.01'

To my understanding, nothing prevents us from switching to the newer Tesseract 4.1 or even 5.x as long as they run in the legacy engine mode. This will require changes available in the tess4 branch because the underlying API for accessing the new OCR engine was updated several years ago.

@maximumspatium
Copy link
Contributor

@stweil

I noticed that there exist pre-built jar files which can be used for 4.0.0, but I could not find jar files for newer releases.

Audiveris doesn't use pre-built binaries. It uses the javacpp-presets wrapper for accessing Tesseract. The recent javacpp-presets release supports Tesseract 5.0 by default.

@maximumspatium maximumspatium reopened this Jan 9, 2022
@hbitteur
Copy link
Contributor

hbitteur commented Jan 9, 2022

@maximumspatium
Yes, let's try to move to 5.x in legacy mode

@maximumspatium
Copy link
Contributor

I'll go ahead and switch to javacpp-presets 1.5.6 then.

@stweil
Copy link
Contributor

stweil commented Jan 9, 2022

I just tried Audiveris with 4.1.1, and that seems to work fine. The modifications from your tess4 branch were sufficient (I only rebased those changes to the latest code in https://github.com/stweil/audiveris/tree/tess4).

@maximumspatium
Copy link
Contributor

@stweil I finally switched Audiveris to Tesseract 4.1.1, see ce97610

I also tried Audiveris with Tesseract 5.0.1. Unfortunately, libtesseract crashes the JVM in my macOS 10.13, probably because the binaries were compiled for macOS 10.15. I need to rebuild Javacpp-presets for my system to be able to test the recent OCR engine.

@maximumspatium
Copy link
Contributor

I have much experience with Tesseract, so I can help on that side

@stweil We're experiencing issues with Tesseract sometimes reporting unreliable symbol positions when running in the full page mode.
Original image:
Original text

Recognition result:
Recognized Text with wrong letter positions

Selecting the area and letting Tesseract recognize it again usually produces better results:
Fixed Text

It looks like a bug in the Tesseract API I never managed to catch.
Any idea how to fix that?

@stweil
Copy link
Contributor

stweil commented Apr 22, 2022

Do you get those wrong positions also when the same page is processed by the tesseract executable?

@maximumspatium
Copy link
Contributor

maximumspatium commented Apr 23, 2022

Do you get those wrong positions also when the same page is processed by the tesseract executable?

The tesseract executable reports correct symbol positions in the XML output.

I tried two different page segmentation modes and got similar results:

PSM=3, Tesseract's default, also used by Audiveris:

<TextLine ID="line_3" HPOS="139" VPOS="392" WIDTH="536" HEIGHT="39">
   <String ID="string_10" HPOS="139" VPOS="392" WIDTH="226" HEIGHT="39" WC="0.84" CONTENT="Arrangement"/><SP WIDTH="14" VPOS="392" HPOS="365"/>
   <String ID="string_11" HPOS="379" VPOS="402" WIDTH="4" HEIGHT="21" WC="0.89" CONTENT=":"/><SP WIDTH="16" VPOS="402" HPOS="383"/>
   <String ID="string_12" HPOS="399" VPOS="392" WIDTH="94" HEIGHT="31" WC="0.83" CONTENT="Alain"/><SP WIDTH="12" VPOS="392" HPOS="493"/>
   <String ID="string_13" HPOS="505" VPOS="393" WIDTH="170" HEIGHT="30" WC="0.89" CONTENT="BRUNET"/>
</TextLine>

PSM=11 i.e. "find as much test as possible":

<TextLine ID="line_2" HPOS="139" VPOS="392" WIDTH="536" HEIGHT="39">
   <String ID="string_8" HPOS="139" VPOS="392" WIDTH="226" HEIGHT="39" WC="0.82" CONTENT="Arrangement"/><SP WIDTH="14" VPOS="392" HPOS="365"/>
   <String ID="string_9" HPOS="379" VPOS="402" WIDTH="4" HEIGHT="21" WC="0.89" CONTENT=":"/><SP WIDTH="16" VPOS="402" HPOS="383"/>
   <String ID="string_10" HPOS="399" VPOS="392" WIDTH="94" HEIGHT="31" WC="0.83" CONTENT="Alain"/><SP WIDTH="12" VPOS="392" HPOS="493"/>
   <String ID="string_11" HPOS="505" VPOS="393" WIDTH="170" HEIGHT="30" WC="0.89" CONTENT="BRUNET"/>
</TextLine>

I assume a bug somewhere in the public API.

tessimg

@stweil
Copy link
Contributor

stweil commented Apr 24, 2022

I finally switched Audiveris to Tesseract 4.1.1, see ce97610

I just wanted to try the new code, but it looks like the Javacpp-presets are unavailable for M1 MacOS.

@maximumspatium
Copy link
Contributor

maximumspatium commented Apr 24, 2022

I just wanted to try the new code, but it looks like the Javacpp-presets are unavailable for M1 MacOS.

@stweil That's true. Apparently, it's very easy to adapt Javacpp-presets to an unsupported architecture. Each preset includes a build script that compiles both the native library as well as its JNI bridge. If you can compile Tesseract in your M1, you will be able to compile its Java bindings. Unfortunately, I can't do it because I don't own a M1 Mac :)
Anyway, I compiled Javacpp-presets from source several times in the past. It was pretty easy.

FYI: bytedeco/javacpp-presets#1069

@maximumspatium
Copy link
Contributor

Let's move our discussion regarding OCR issues to #575.

@saudet
Copy link

saudet commented May 9, 2022

FYI, macosx-arm64 builds are now available, see issue bytedeco/javacpp-presets#814.
Please give it a try with the snapshots: http://bytedeco.org/builds/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants