Question: Splitting runons #17

pirolen · 2023-02-08T13:18:22Z

Hi, I wonder if there is a way to have analiticcl generate variants that involve a whitespace: i.e. in case of runon errors, suggesting the split form.

Suppose that 'holygrail' is actually a runon error after OCR, then I would like to be able to return a suggestion of 'holy grail'.

Is there a way to do it?

The other way round it works, i.e. for erroneous splits the concatenated forms are retrieved, e.g.

bitter sweet bittersweet 0.7604166666666666 bittersweets 0.6666666666666667

proycon · 2023-02-08T21:25:21Z

Yes, it should be possible to let analiticcl generate variants involving a whitespace. It simply entails heaving such bigrams *explicitly* in your input lexicon (it need not be constrained to single words). There's also a possibility if you use search mode, where you can load a language model. Though I'm not entirely how that would play out in such cases. Itmight still need an expanded lexicon. There may be room for improvement in this area.

pirolen · 2023-02-09T08:53:02Z

Thanks!
I now simply used
analiticcl search --alphabet simple.alphabet.tsv --lexicon eng.aspell.lexicon --lm-order 3

(accepting standard input; enter text to search for variants, output may be delayed until end of input, enter an empty line to force output earlier)
holygrail

holygrail       0:9

Please don't hesitate to suggest meaningful parameter usage for my case.

My primary problem is that analiticcl does not generate anagrams from the alphabet file and lexicon I sent you (historical Slavonic) so I cannot seem to use it in any mode :-(

pirolen · 2023-02-20T17:29:55Z

My primary problem is that analiticcl does not generate anagrams from the alphabet file and lexicon I sent you (historical Slavonic) so I cannot seem to use it in any mode :-(

Short update: I also tried to run analiticcl in a Colab notebook.
model.build() prints no output there at all to stdout, unlike in the tutorial.
Calling it from the command line on Ubuntu prints

Computing anagram values for all items in the lexicon...
 - Found 99999 instances
Adding all instances to the index...
 - Found 1 anagrams
Creating sorted secondary index...
Sorting secondary index...
 - Found 1 anagrams of length 1
Constructing Language Model...
 - No language model provided

And no matter what word I query with find_variants, the result is always the same 2 lines, returning the first two items in the lexicon file :-o

{'text': 'и', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'lexicons': ['/home/pirol/quanti/devel/analiti/01_jan_car_lexfile_plain.tsv']}
{'text': 'не', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'lexicons': ['/home/pirol/quanti/devel/analiti/01_jan_car_lexfile_plain.tsv']}

The files are UTF-8.
I am going to see if I can convert them to Unicode Normal Form C and whether that makes a difference.

I think by now I tested installations by both cargo and pip.

proycon · 2023-02-21T09:46:41Z

Short update: I also tried to run analiticcl in a Colab notebook.

Can you share the notebook? (along with all input files). Then I can check if I can see what's happening.

pirolen · 2023-02-21T10:21:38Z

Thanks very much! I have sent an invitation to your email address.

proycon · 2023-02-21T13:04:39Z

Got it, something's going wrong with the anagram computation based on the alphabet file. I'm investigating...

…tibyte characters #17 Also added a 'testinput' mode and made alphabet debugging more verbose

proycon · 2023-02-21T16:21:23Z

There was a serious bug in the multibyte handling that came to light thanks to your example. I'm doing a new analiticcl release tonight (v0.4.5) that will fix this.

proycon · 2023-02-21T16:38:46Z

Released now! (both on crates.io and pypi)

proycon · 2023-02-21T16:45:59Z

Example output from your test in the new situation:

$ analiticcl query --alphabet alphabet.true.tsv --lexicon 01_jan_car_lexfile_plain.lexicon.tsv      
...
Querying the model...
(accepting standard input; enter input to match, one per line, output may be delayed until end of input due to parallellisation)
жизни
жизни   жизни   1               жиѕни   1               жиꙁни   1               жизнї   1               ѡжизни  0.775           жизнїю  0.775           жиꙁни∙  0.775           жизныи  0.75            изни    0.725

proycon · 2023-02-21T16:51:22Z

I also added a testinput mode which you can use to check if a particular input is covered by your alphabet:

$ analiticcl testinput --alphabet alphabet.true.tsv --lexicon 01_jan_car_lexfile_plain.lexicon.tsv
ꙗванi
OK: ꙗванi       9710701 [4, 23, 5, 28, 3]
blah
UNKNOWN: blah   50308609        [37, 37, 5, 37]

(the highest number in the array (37) corresponds to an unknown character, all the non-cyrillic once in this case).
This may help in improving coverage of your alphabet.

pirolen · 2023-02-21T16:56:31Z

Fantastic, thank you so much! I'm excited to test it asap!

pirolen · 2023-02-21T21:18:46Z

Awesome, both the module from pypi and the CLI version now work fine! I am going to explore the different modes.

pirolen · 2023-03-24T21:40:28Z

#17 (comment)

Not sure if this is of interest, but if I run the same CLI command with testinput with the same files now, and copy-paste ꙗванi from this issue into the CLI, although all of its letters are in my alphabet file, it does not get accepted :-(

UNKNOWN: ꙗванi 111299573 [4, 23, 35, 28, 3]

Also, if I copy-paste some tokens from the lexocon file opened in my VS Code editor into VS Code Terminal CLI, I get surprises:

ихъ
UNKNOWN: ихъ 202193 [35, 16, 8]
хъ
OK: хъ 1357 [16, 8]
UNKNOWN: и 149 [35]

But 'и' is in the alphabet file, it can be searched for and is found.

proycon added the question Further information is requested label Feb 8, 2023

proycon self-assigned this Feb 8, 2023

proycon added a commit that referenced this issue Feb 21, 2023

Fixed important bug in anahashing and normalizing to alphabet for mul…

3323391

…tibyte characters #17 Also added a 'testinput' mode and made alphabet debugging more verbose

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Splitting runons #17

Question: Splitting runons #17

pirolen commented Feb 8, 2023

proycon commented Feb 8, 2023 via email

pirolen commented Feb 9, 2023

pirolen commented Feb 20, 2023

proycon commented Feb 21, 2023

pirolen commented Feb 21, 2023

proycon commented Feb 21, 2023

proycon commented Feb 21, 2023

proycon commented Feb 21, 2023

proycon commented Feb 21, 2023

proycon commented Feb 21, 2023

pirolen commented Feb 21, 2023

pirolen commented Feb 21, 2023

pirolen commented Mar 24, 2023 •

edited

Loading

Question: Splitting runons #17

Question: Splitting runons #17

Comments

pirolen commented Feb 8, 2023

proycon commented Feb 8, 2023 via email

pirolen commented Feb 9, 2023

pirolen commented Feb 20, 2023

proycon commented Feb 21, 2023

pirolen commented Feb 21, 2023

proycon commented Feb 21, 2023

proycon commented Feb 21, 2023

proycon commented Feb 21, 2023

proycon commented Feb 21, 2023

proycon commented Feb 21, 2023

pirolen commented Feb 21, 2023

pirolen commented Feb 21, 2023

pirolen commented Mar 24, 2023 • edited Loading

pirolen commented Mar 24, 2023 •

edited

Loading