Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Splitting runons #17

Open
pirolen opened this issue Feb 8, 2023 · 13 comments
Open

Question: Splitting runons #17

pirolen opened this issue Feb 8, 2023 · 13 comments
Assignees
Labels
question Further information is requested

Comments

@pirolen
Copy link

pirolen commented Feb 8, 2023

Hi, I wonder if there is a way to have analiticcl generate variants that involve a whitespace: i.e. in case of runon errors, suggesting the split form.

Suppose that 'holygrail' is actually a runon error after OCR, then I would like to be able to return a suggestion of 'holy grail'.

Is there a way to do it?

The other way round it works, i.e. for erroneous splits the concatenated forms are retrieved, e.g.

bitter sweet bittersweet 0.7604166666666666 bittersweets 0.6666666666666667

@proycon
Copy link
Owner

proycon commented Feb 8, 2023 via email

@proycon proycon added the question Further information is requested label Feb 8, 2023
@proycon proycon self-assigned this Feb 8, 2023
@pirolen
Copy link
Author

pirolen commented Feb 9, 2023

Thanks!
I now simply used
analiticcl search --alphabet simple.alphabet.tsv --lexicon eng.aspell.lexicon --lm-order 3

(accepting standard input; enter text to search for variants, output may be delayed until end of input, enter an empty line to force output earlier)
holygrail

holygrail       0:9

Please don't hesitate to suggest meaningful parameter usage for my case.

My primary problem is that analiticcl does not generate anagrams from the alphabet file and lexicon I sent you (historical Slavonic) so I cannot seem to use it in any mode :-(

@pirolen
Copy link
Author

pirolen commented Feb 20, 2023

My primary problem is that analiticcl does not generate anagrams from the alphabet file and lexicon I sent you (historical Slavonic) so I cannot seem to use it in any mode :-(

Short update: I also tried to run analiticcl in a Colab notebook.
model.build() prints no output there at all to stdout, unlike in the tutorial.
Calling it from the command line on Ubuntu prints

Computing anagram values for all items in the lexicon...
 - Found 99999 instances
Adding all instances to the index...
 - Found 1 anagrams
Creating sorted secondary index...
Sorting secondary index...
 - Found 1 anagrams of length 1
Constructing Language Model...
 - No language model provided

And no matter what word I query with find_variants, the result is always the same 2 lines, returning the first two items in the lexicon file :-o

{'text': 'и', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'lexicons': ['/home/pirol/quanti/devel/analiti/01_jan_car_lexfile_plain.tsv']}
{'text': 'не', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'lexicons': ['/home/pirol/quanti/devel/analiti/01_jan_car_lexfile_plain.tsv']}

The files are UTF-8.
I am going to see if I can convert them to Unicode Normal Form C and whether that makes a difference.

I think by now I tested installations by both cargo and pip.

@proycon
Copy link
Owner

proycon commented Feb 21, 2023

Short update: I also tried to run analiticcl in a Colab notebook.

Can you share the notebook? (along with all input files). Then I can check if I can see what's happening.

@pirolen
Copy link
Author

pirolen commented Feb 21, 2023

Thanks very much! I have sent an invitation to your email address.

@proycon
Copy link
Owner

proycon commented Feb 21, 2023

Got it, something's going wrong with the anagram computation based on the alphabet file. I'm investigating...

proycon added a commit that referenced this issue Feb 21, 2023
…tibyte characters #17

Also added a 'testinput' mode and made alphabet debugging more verbose
@proycon
Copy link
Owner

proycon commented Feb 21, 2023

There was a serious bug in the multibyte handling that came to light thanks to your example. I'm doing a new analiticcl release tonight (v0.4.5) that will fix this.

@proycon
Copy link
Owner

proycon commented Feb 21, 2023

Released now! (both on crates.io and pypi)

@proycon
Copy link
Owner

proycon commented Feb 21, 2023

Example output from your test in the new situation:

$ analiticcl query --alphabet alphabet.true.tsv --lexicon 01_jan_car_lexfile_plain.lexicon.tsv      
...
Querying the model...
(accepting standard input; enter input to match, one per line, output may be delayed until end of input due to parallellisation)
жизни
жизни   жизни   1               жиѕни   1               жиꙁни   1               жизнї   1               ѡжизни  0.775           жизнїю  0.775           жиꙁни∙  0.775           жизныи  0.75            изни    0.725

@proycon
Copy link
Owner

proycon commented Feb 21, 2023

I also added a testinput mode which you can use to check if a particular input is covered by your alphabet:

$ analiticcl testinput --alphabet alphabet.true.tsv --lexicon 01_jan_car_lexfile_plain.lexicon.tsv
ꙗванi
OK: ꙗванi       9710701 [4, 23, 5, 28, 3]
blah
UNKNOWN: blah   50308609        [37, 37, 5, 37]

(the highest number in the array (37) corresponds to an unknown character, all the non-cyrillic once in this case).
This may help in improving coverage of your alphabet.

@pirolen
Copy link
Author

pirolen commented Feb 21, 2023

Fantastic, thank you so much! I'm excited to test it asap!

@pirolen
Copy link
Author

pirolen commented Feb 21, 2023

Awesome, both the module from pypi and the CLI version now work fine! I am going to explore the different modes.

@pirolen
Copy link
Author

pirolen commented Mar 24, 2023

#17 (comment)

Not sure if this is of interest, but if I run the same CLI command with testinput with the same files now, and copy-paste ꙗванi from this issue into the CLI, although all of its letters are in my alphabet file, it does not get accepted :-(

UNKNOWN: ꙗванi 111299573 [4, 23, 35, 28, 3]

Also, if I copy-paste some tokens from the lexocon file opened in my VS Code editor into VS Code Terminal CLI, I get surprises:

ихъ
UNKNOWN: ихъ 202193 [35, 16, 8]
хъ
OK: хъ 1357 [16, 8]
UNKNOWN: и 149 [35]

But 'и' is in the alphabet file, it can be searched for and is found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants