-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenization of "..." should be one token #31
Comments
The problem of full stop final abbreviation plus ellipsis appears quite often in all langs. at least in Finnish typography / grammar / rules three full stops is the correct one and we have decided that it's tokenised so that surface of abbreviation keeps one full stop and the ellipses has two full stops on surface but three (or ellipsis symbol) on analysis level. I.e. I propose a giella-shared/all_langs/punctuation.lexc entry like |
@flammie I think we have another single tokenization of the three full stops (the ones that are closer to each other), so that should be probably the same one. And if it is an error, which I guess depends on the norm, then the error should be similar to other errortags like e.g. Err/CLB or so. Also:
This is the analysis I get:
|
This is easily fixed by adding an Err/something reading to full stops after abbreviations. Then both full stops will be read as part of the initial token, and be given an analysis (as either CLB or not, depending on, but in both cases as an Err/xxx). |
Found another one with three full stops:
|
We get the following analysis:
|
we should fix it before the new release. I have seen a number of cases of it. |
I changed the dot lexicon so one and three full-stops work the same, i.e.:
|
Status today:
{
"errs": [
[
"čohken",
10,
16,
"real-čohkken",
"\"čohken\" orru leamen čállinmeattáhus",
[
"čohkken"
],
"Čállinmeattáhus dán oktavuođas"
],
[
"o.s..",
46,
51,
"typo",
"Ii leat sátnelisttus",
[
"o.s."
],
"Čállinmeattáhus"
]
],
"text": "Materiála čohken ja girjji čállin Davvi Girji o.s.."
}
Ie one out of three fixed. |
The present analysis looks like this:
Produced with this command: echo "Boarrásut materiálii gullet gielladutkiid čohken teavsttat, ođđasut materiálas maid dán \
artihkkalis ovdanbuvttán, leat mánáidgirjjit, divttat, aviisačállosat jna..." | \
./tools/grammarcheckers/modes/smegramrelease.mode Is this good enough, @lynnda-hill , or should the three stops be a separate token? |
In the following sentence we have the expression "and so on ...". But the three dots are tokenized as a period and two separate periods causing a problem for the grammarchecker. We should at least allow for the option of tokenizing all three dots as one token (could maybe then be disambiguated in mwe-dis.cg3:
Example:
pipeline:
Output:
The text was updated successfully, but these errors were encountered: