Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading the confusables file #19

Open
pirolen opened this issue Apr 26, 2023 · 8 comments
Open

Loading the confusables file #19

pirolen opened this issue Apr 26, 2023 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@pirolen
Copy link

pirolen commented Apr 26, 2023

I wonder if this is the right way to loading the confusables file:

m = build_variant_model(alphabet_file, weightsconfig=ws1)
m.read_confusablelist(confusables_file)

It would be brilliant to have an example about how the confusables list impacts the ranking of error candidates, resp. how the confusables penalty or promotion works (i.e. what do we gain by these).

Especially: what would happen in analiticcl (apart from the semantic heterogeneity), if confusables would be listed in the alphabet file? Many thanks!

@proycon proycon added the question Further information is requested label May 2, 2023
@proycon proycon self-assigned this May 2, 2023
@proycon
Copy link
Owner

proycon commented May 2, 2023

I wonder if this is the right way to loading the confusables file

Yes, it is.

It would be brilliant to have an example about how the confusables list impacts the ranking of error candidates, resp.
how the confusables penalty or promotion works (i.e. what do we gain by these).

Good and valid questions indeed. First, I must perhaps say that I don't think this confusable weighting functionality has really been used in practice yet, so there's no proper evaluation or anything. Though I implemented it, we never used it in the Golden Agents projects for which analiticcl was developed. I can tell, of course, how it is implemented:

After all variants are scored in the regular way using the distance metrics and possibly frequency information (a log linear combination of various components), an extra rescoring is performed if a confusable list is provided. This rescoring is meant to give slight bonuses or penalties to the scores whenever certain confusables occur (with a certain confusable weight). In the documentation I write about this:

Weights greater than 1.0 are being given preference in the score weighting, weights smaller than 1.0 imply a penalty.
When multiple confusable patterns match, the products of their weights is taken. The final weight is applied to the
whole candidate score, so weights should be values fairly close to 1.0 in order not to introduce too large
bonuses/penalties.

It is a bit hard to predict how this plays out in actual use-cases, the challenge is always in tweaking the weights so there is a balance between the confusable weights and the weights in the main score function (of which these are not a part but applied after-the-fact to that score as a whole). The only way to find out is to experiment with it.

There is one relevant option which is not properly documented yet, there is a --early-confusables parameters which, when set, causes analiticcl to rescore variants using the confusable list before pruning variants on things like score thresholds and max candidates. The default is to first prune the variant list and only then apply the confusable weighting, as that is more performant (far less candidates to consider), but the other way round would of course be better for accuracy!

Especially: what would happen in analiticcl (apart from the semantic heterogeneity), if confusables would be listed in the > alphabet file? Many thanks!

The confusable lists and weights are a more refined mechanism and can express various things that the alphabet can't (like context information, and variable weights), but it does introduce an extra level of complexity. The alphabet file is much more crude, but if your confusables are unambiguous enough to fit in there, then that might indeed be the preferred option. If it causes only more ambiguity though, then it's probably not a good idea.

@pirolen
Copy link
Author

pirolen commented Apr 21, 2024

There is one relevant option which is not properly documented yet, there is a --early-confusables parameters which, when set, causes analiticcl to rescore variants using the confusable list before pruning variants on things like score thresholds and max candidates.

I was trying to find this parameter for the Python objects but so far without success. Is it available?

@proycon
Copy link
Owner

proycon commented Apr 22, 2024

Good point, I think it's not propagated to the Python binding yet. I'll add it.

proycon added a commit that referenced this issue Apr 22, 2024
@proycon
Copy link
Owner

proycon commented Apr 22, 2024

This should now be fixed in v0.4.6, call model.set_confusables_before_pruning() to enable the parameter.

@pirolen
Copy link
Author

pirolen commented Apr 24, 2024

Thanks!

I'd like to use this parameter to achieve e.g. the following.
Suppose we know a number of historical sound change patterns, e.g. жд --> ж.

So then if using this method, a

  • query 'тажде'
  • should return 'таже' with a high-ish score
    (or vice versa), I suppose?

How should this be represented in the confusables file, e.g. similar as below?
=[aж]-[д]=[е] 1.1

or likely without the preceding (and the tailing) context, which are not generic enough?
I am not sure about the score in the 2nd column either.

(sorry for the multiple edits)

@pirolen
Copy link
Author

pirolen commented Apr 24, 2024

... and is there a way to make patterns to behave symmetrically, and apply to the counterpart cases as well ?
I.e. to cover the 'vice versa' above, e.g. to get the pattern edit таже into тажде, or do I need to specify that separately, an addition instead of the deletion?

@pirolen
Copy link
Author

pirolen commented Apr 24, 2024

I guess I have found it out, so e.g. this works well:
=[ж]-[д] 1.5
=[ж]+[д] 1.5

and the score depends on how the other scores are set, I guess. But 1.5 seems to return the desired lexemes well for my use case.

Thanks a lot for the implementation!

@proycon
Copy link
Owner

proycon commented Apr 24, 2024

Great, I see you already figured it out! That indeed seems like the proper syntax, you indeed need both explicitly. It will give a higher score to variants that had жд and lost the д, and to variants that have ж and add д.. relative to the weighting of the other variants that do not exhibit such a pattern. Finding the proper score is always a bit trial and error, 1.5 might be a bit on the large side even as they'd best be fairly close to 1.0 in order not to have too big an influence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants