Loading the confusables file #19

pirolen · 2023-04-26T16:44:52Z

I wonder if this is the right way to loading the confusables file:

m = build_variant_model(alphabet_file, weightsconfig=ws1)
m.read_confusablelist(confusables_file)

It would be brilliant to have an example about how the confusables list impacts the ranking of error candidates, resp. how the confusables penalty or promotion works (i.e. what do we gain by these).

Especially: what would happen in analiticcl (apart from the semantic heterogeneity), if confusables would be listed in the alphabet file? Many thanks!

The text was updated successfully, but these errors were encountered:

proycon · 2023-05-02T12:36:28Z

I wonder if this is the right way to loading the confusables file

Yes, it is.

It would be brilliant to have an example about how the confusables list impacts the ranking of error candidates, resp.
how the confusables penalty or promotion works (i.e. what do we gain by these).

Good and valid questions indeed. First, I must perhaps say that I don't think this confusable weighting functionality has really been used in practice yet, so there's no proper evaluation or anything. Though I implemented it, we never used it in the Golden Agents projects for which analiticcl was developed. I can tell, of course, how it is implemented:

After all variants are scored in the regular way using the distance metrics and possibly frequency information (a log linear combination of various components), an extra rescoring is performed if a confusable list is provided. This rescoring is meant to give slight bonuses or penalties to the scores whenever certain confusables occur (with a certain confusable weight). In the documentation I write about this:

Weights greater than 1.0 are being given preference in the score weighting, weights smaller than 1.0 imply a penalty.
When multiple confusable patterns match, the products of their weights is taken. The final weight is applied to the
whole candidate score, so weights should be values fairly close to 1.0 in order not to introduce too large
bonuses/penalties.

It is a bit hard to predict how this plays out in actual use-cases, the challenge is always in tweaking the weights so there is a balance between the confusable weights and the weights in the main score function (of which these are not a part but applied after-the-fact to that score as a whole). The only way to find out is to experiment with it.

There is one relevant option which is not properly documented yet, there is a --early-confusables parameters which, when set, causes analiticcl to rescore variants using the confusable list before pruning variants on things like score thresholds and max candidates. The default is to first prune the variant list and only then apply the confusable weighting, as that is more performant (far less candidates to consider), but the other way round would of course be better for accuracy!

Especially: what would happen in analiticcl (apart from the semantic heterogeneity), if confusables would be listed in the > alphabet file? Many thanks!

The confusable lists and weights are a more refined mechanism and can express various things that the alphabet can't (like context information, and variable weights), but it does introduce an extra level of complexity. The alphabet file is much more crude, but if your confusables are unambiguous enough to fit in there, then that might indeed be the preferred option. If it causes only more ambiguity though, then it's probably not a good idea.

pirolen · 2024-04-21T21:19:33Z

There is one relevant option which is not properly documented yet, there is a --early-confusables parameters which, when set, causes analiticcl to rescore variants using the confusable list before pruning variants on things like score thresholds and max candidates.

I was trying to find this parameter for the Python objects but so far without success. Is it available?

proycon · 2024-04-22T08:22:46Z

Good point, I think it's not propagated to the Python binding yet. I'll add it.

…the --early-confusables parameter Ref: #19

proycon · 2024-04-22T11:19:41Z

This should now be fixed in v0.4.6, call model.set_confusables_before_pruning() to enable the parameter.

pirolen · 2024-04-24T13:35:17Z

Thanks!

I'd like to use this parameter to achieve e.g. the following.
Suppose we know a number of historical sound change patterns, e.g. жд --> ж.

So then if using this method, a

query 'тажде'
should return 'таже' with a high-ish score
(or vice versa), I suppose?

How should this be represented in the confusables file, e.g. similar as below?
=[aж]-[д]=[е] 1.1

or likely without the preceding (and the tailing) context, which are not generic enough?
I am not sure about the score in the 2nd column either.

(sorry for the multiple edits)

pirolen · 2024-04-24T13:45:15Z

... and is there a way to make patterns to behave symmetrically, and apply to the counterpart cases as well ?
I.e. to cover the 'vice versa' above, e.g. to get the pattern edit таже into тажде, or do I need to specify that separately, an addition instead of the deletion?

pirolen · 2024-04-24T13:55:03Z

I guess I have found it out, so e.g. this works well:
=[ж]-[д] 1.5
=[ж]+[д] 1.5

and the score depends on how the other scores are set, I guess. But 1.5 seems to return the desired lexemes well for my use case.

Thanks a lot for the implementation!

proycon · 2024-04-24T14:02:06Z

Great, I see you already figured it out! That indeed seems like the proper syntax, you indeed need both explicitly. It will give a higher score to variants that had жд and lost the д, and to variants that have ж and add д.. relative to the weighting of the other variants that do not exhibit such a pattern. Finding the proper score is always a bit trial and error, 1.5 might be a bit on the large side even as they'd best be fairly close to 1.0 in order not to have too big an influence.

proycon added the question Further information is requested label May 2, 2023

proycon self-assigned this May 2, 2023

proycon added a commit that referenced this issue Apr 22, 2024

added VariantModel.set_confusables_before_pruning() method that sets …

d36039e

…the --early-confusables parameter Ref: #19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading the confusables file #19

Loading the confusables file #19

pirolen commented Apr 26, 2023 •

edited

Loading

proycon commented May 2, 2023

pirolen commented Apr 21, 2024

proycon commented Apr 22, 2024

proycon commented Apr 22, 2024

pirolen commented Apr 24, 2024 •

edited

Loading

pirolen commented Apr 24, 2024

pirolen commented Apr 24, 2024

proycon commented Apr 24, 2024 •

edited

Loading

Loading the confusables file #19

Loading the confusables file #19

Comments

pirolen commented Apr 26, 2023 • edited Loading

proycon commented May 2, 2023

pirolen commented Apr 21, 2024

proycon commented Apr 22, 2024

proycon commented Apr 22, 2024

pirolen commented Apr 24, 2024 • edited Loading

pirolen commented Apr 24, 2024

pirolen commented Apr 24, 2024

proycon commented Apr 24, 2024 • edited Loading

pirolen commented Apr 26, 2023 •

edited

Loading

pirolen commented Apr 24, 2024 •

edited

Loading

proycon commented Apr 24, 2024 •

edited

Loading