-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement search mode (error detection in running text), aka context-aware correction #2
Comments
Current implementationThis proves to be quite a difficult problem and the implementation I now have still delivers poor quality. Let me describe my current implementation for future reference (i.e. so I don't forget it myself). I had hoped for a solution inspired on phrase-based statistical machine translation, where I combine a variant model and a language model and attempt to optimize their joint probability. All possible segmentations of an input sentence are represented in a Finite State Transducer such as the following for the simple input sentence "het is een huys" which should correct to "het is een huis": The labels are in the form The top n best routes are extracted from the FST and then the language model comes in: each complete sequence is scored according to a simple bigram language model (with +1 smoothing). For each sequence we then obtain a logprob from the LM, along with a variant logprob (the sum of the variant scores of the path that was followed). The heart of the matter is that these two scores 'compete' with one another and their combination needs to be optimized. But their units are not directly comparable, so to remedy this I attempt a kind of normalisation where the top LM score gets cast to 1, and the top variant sum also gets cast to 1. The worst scores get cast to 0, so everything ends up in the range 1.0 (best) - 0.0 (worst). This may still be too arbitrary but it's a start. The final score is a simple weighted linear combination (LM weight and variant model weight are parameters that by default are both set to 1). Ranked sequences for this example:
|
ProblemsIssues I've thus-far with this approach are:
The actual issue here seems to be that the LM is actually steering us away from the right answer in this case because it really likes the bigram "antwoord wat". A stronger LM (trigrams?) may be an idea here. But the point I wanted to raise is about how to make these comparisons in each of the triangles and whether it's fair to compare the variant model scores of unigrams and n-grams as we do here.
|
…s input (let it fall back to unigrams) #2
I've been experimenting with a revised scoring mechanism today to establish a better common ground for comparing the variant scores of unigrams and higher-order n-grams (point 3 from my previous comment). The following FST demonstrates this experiment: The variant scores in this FST consist of a base component that is proportional to the number of tokens spanned (e.g. 1 for unigrams, 2 for bigrams etc). The values in the FST can be interpreted as costs that are to be minimized (shortest path). The fractional part behind the comma encodes the variant score (0,1.0) as it was, but inversely (1.0 - variant score). This means that if we have two unigrams with a perfect variant score (0 + 0), we get base score 1 + 1 for the combination, if these same two unigrams can be covered by a bigram, and this bigram also has a perfect variant score (0) and a base score 2, then we see that the scores are identical and we have established some common ground for comparison. In that ideal case we'd have a tie and only the LM can break it. In reality this rarely happens and one or the other route tends to be favoured. I do seem to get some better results with this approach, but it feels fairly ad-hoc/contrived. There's also still the question of how to strike a balance between the language model and the variant model (point 1), I made some improvements in that regard too where I use perplexity instead of the LM logprob, and then map this perplexity to 1.0 (best) ~ 0.0, without forcing an actual lower bound like I did before (worst perplexity found no longer gets mapped to 0, nor is the worst variant score mapped to 0 any longer). The actual score computation that expresses both LM and variant model is a weighted geometric mean. |
…age, modifying the variant scores #2
At the end of July, I also implemented an alternative approach (enabled using Both approaches are currently available in the implementation, for testing. |
Search mode is the error detection stage of analiticcl, where running text is passed as input, and analiticcl attempts to identify and correct parts of the text that require correction, regardless of where in the text that is or whether it concerns single words or larger n-grams (i.e. run-on and split errors should be supported). This is the stage where context plays an important role (unlike in analiticcl's simpler query mode)
A first version of a search mode has been implemented as follows:
the variant scores (no context considered yet)
model score and variant model score, return the best sequence
However, this still requires extensive testing and tweaking, as the current results are still sub-optimal.
The text was updated successfully, but these errors were encountered: