Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization of "..." should be one token #31

Open
lynnda-hill opened this issue Oct 18, 2021 · 9 comments
Open

tokenization of "..." should be one token #31

lynnda-hill opened this issue Oct 18, 2021 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@lynnda-hill
Copy link
Contributor

lynnda-hill commented Oct 18, 2021

In the following sentence we have the expression "and so on ...". But the three dots are tokenized as a period and two separate periods causing a problem for the grammarchecker. We should at least allow for the option of tokenizing all three dots as one token (could maybe then be disambiguated in mwe-dis.cg3:

Example:

Boarrásut materiálii gullet gielladutkiid čohken teavsttat, ođđasut materiálas maid dán artihkkalis ovdanbuvttán, leat mánáidgirjjit, divttat, aviisačállosat jna...

pipeline:

... | tools/grammarcheckers/modes/trace-smegramrelease-dev.mode | less

Output:

"<jna>"
        "jna" Adv ABBR Gram/IAbbr <W:0.0> @<ADVL MAP:23205 #21->21
        "jna" Adv ABBR Gram/IAbbr Attr <W:0.0> @<ADVL MAP:23205 #21->21
"<.>"
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark #22->22 ID:22 R:RIGHT:24 ADD:9737:no-space-after-punct ADD:9737:no-space-after-punct
no-space-after-punct-mark
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> ". .."S &no-space-after-punct-mark &SUGGESTWF #22->22 ID:22 R:RIGHT:24 ADD:9737:no-space-after-punct COPY:9754:no-space-after-punct-sugg
no-space-after-punct-mark
;       "jna" Adv ABBR Gram/IAbbr <W:0.0> REMOVE:2969

"<..>"
        ".." CLB <W:0.0> &no-space-after-punct-mark &LINK #1->1 ID:24 ADD:9718:double-space-before-link ADDRELATION($2):9719:double-space-before-rel ADDRELATION(LEFT):9720:double-space-before-rel ADD:9747:no-space-after-punct-link ADD:9718:double-space-before-link ADD:9747:no-space-after-punct-link
no-space-after-punct-mark
        ".." CLB <W:0.0> &double-space-before #1->1 ID:24 ADD:9718:double-space-before-link ADD:9747:no-space-after-punct-link ADD:9718:double-space-before-link ADD:9747:no-space-after-punct-link
double-space-before
        ".." CLB <W:0.0> &LINK #1->1 ID:24 ADD:9718:double-space-before-link ADDRELATION($2):9719:double-space-before-rel ADDRELATION(LEFT):9720:double-space-before-rel ADD:9747:no-space-after-punct-link ADDRELATION(RIGHT):9752:no-space-after-punct-rel ADD:9718:double-space-before-link ADD:9747:no-space-after-punct-link
@lynnda-hill lynnda-hill added the bug Something isn't working label Oct 18, 2021
@flammie
Copy link
Contributor

flammie commented Oct 19, 2021

The problem of full stop final abbreviation plus ellipsis appears quite often in all langs. at least in Finnish typography / grammar / rules three full stops is the correct one and we have decided that it's tokenised so that surface of abbreviation keeps one full stop and the ellipses has two full stops on surface but three (or ellipsis symbol) on analysis level. I.e. I propose a giella-shared/all_langs/punctuation.lexc entry like ..:... CLBcont ; (or CLBerrorth). But I haven't gotten that to work with other CG nicely, it should be removed unless -1 ends in . or so but maybe other approach may be better...

@lynnda-hill
Copy link
Contributor Author

lynnda-hill commented Oct 20, 2021

@flammie I think we have another single tokenization of the three full stops (the ones that are closer to each other), so that should be probably the same one. And if it is an error, which I guess depends on the norm, then the error should be similar to other errortags like e.g. Err/CLB or so.

Also:
I found another related example with two full stops where one of them should be tokenized as part of the abbreviation and the other one as a sentence boundary, and maybe there should be only one full stop anyway. But for that to be recognized we should probably distinguish between them.

Materiála čohken ja girjji čállin Davvi Girji o.s..

This is the analysis I get:

"<Davvi Girji>"
        "Davvi Girji" N Prop Sem/Org Sg Nom <W:0.0> @N< MAP:22355:r112 #6->6 SUBSTITUTE:9967
;       "Davvi Girji" MWE N Prop Sem/Org Attr <W:0.0> REMOVE:16806:r1912
: 
"<o.s>"
        "o.s" N <NomGenSg> ABBR Gram/IAbbr Sg Acc <W:0.0> @<OBJ SUBSTITUTE:3534 SELECT:17130:r2021 MAP:23862:IfNoTransV> #7->7
;       "o.s" N <NomGenSg> ABBR Gram/IAbbr Attr <W:0.0> SUBSTITUTE:3534 SELECT:17130:r2021
;       "o.s" N <NomGenSg> ABBR Gram/IAbbr Sg Gen <W:0.0> SUBSTITUTE:3534 SELECT:17130:r2021
;       "o.s" N <NomGenSg> ABBR Gram/IAbbr Sg Nom <W:0.0> SUBSTITUTE:3534 SELECT:17130:r2021
"<.>"
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark #8->8 ID:8 R:RIGHT:10 ADD:9739:no-space-after-punct ADD:9739:no-space-after-punct
no-space-after-punct-mark
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> ". ."S &no-space-after-punct-mark &SUGGESTWF #8->8 ID:8 R:RIGHT:10 ADD:9739:no-space-after-punct COPY:9756:no-space-after-punct-sugg
no-space-after-punct-mark
;       "o.s" N ABBR Gram/IAbbr Attr <W:0.0> REMOVE:2969
;       "o.s" N ABBR Gram/IAbbr Sg Acc <W:0.0> REMOVE:2969
;       "o.s" N ABBR Gram/IAbbr Sg Gen <W:0.0> REMOVE:2969
;       "o.s" N ABBR Gram/IAbbr Sg Nom <W:0.0> REMOVE:2969

"<.>"
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> &LINK #1->1 ID:10 ADD:9739:no-space-after-punct ADD:9749:no-space-after-punct-link ADDRELATION(RIGHT):9754:no-space-after-punct-rel ADD:9739:no-space-after-punct ADD:9749:no-space-after-punct-link
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark #1->1 ID:10 ADD:9739:no-space-after-punct ADD:9749:no-space-after-punct-link ADD:9739:no-space-after-punct ADD:9749:no-space-after-punct-link
no-space-after-punct-mark

@snomos
Copy link
Member

snomos commented Oct 20, 2021

Also: I found another related example with two full stops where one of them should be tokenized as part of the abbreviation and the other one as a sentence boundary, and maybe there should be only one full stop anyway. But for that to be recognized we should probably distinguish between them.

Materiála čohken ja girjji čállin Davvi Girji o.s..

This is easily fixed by adding an Err/something reading to full stops after abbreviations. Then both full stops will be read as part of the initial token, and be given an analysis (as either CLB or not, depending on, but in both cases as an Err/xxx).

snomos added a commit that referenced this issue Oct 20, 2021
@lynnda-hill
Copy link
Contributor Author

lynnda-hill commented Nov 18, 2021

Found another one with three full stops:

Oskujođieheaddjit plánejedje su jávkadit dan sivas go sii gáđaštedje su, muhto sis váilo ákkat áššáskuhttit su...

@lynnda-hill
Copy link
Contributor Author

lynnda-hill commented Nov 18, 2021

We get the following analysis:

"<su>"
        "son" Pron Sem/Hum Pers Sg3 Gen <W:0.0> @<ADVL SUBSTITUTE:3530 MAP:23287:r520 #17->17
        "son" Pron Sem/Hum Pers Sg3 Acc <W:0.0> @<OBJ SUBSTITUTE:3530 MAP:23875:IfNoTransV> #17->17
;       "su" Adv ABBR Gram/NumNoAbbr <W:0.0> REMOVE:3659
"<..>"
        "." CLB Err/Orth <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark #18->18 ID:18 R:RIGHT:19 ADD:9837:no-space-after-punct ADD:9837:no-space-after-punct
no-space-after-punct-mark
        "." CLB Err/Orth <W:0.0> <NoSpaceAfterPunctMark> ".. ."S &no-space-after-punct-mark &SUGGESTWF #18->18 ID:18 R:RIGHT:19 ADD:9837:no-space-after-punct COPY:9854:no-space-after-punct-sugg
no-space-after-punct-mark
        ".." CLB <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark #18->18 ID:18 R:RIGHT:19 ADD:9837:no-space-after-punct ADD:9837:no-space-after-punct
no-space-after-punct-mark
        ".." CLB <W:0.0> <NoSpaceAfterPunctMark> ".. ."S &no-space-after-punct-mark &SUGGESTWF #18->18 ID:18 R:RIGHT:19 ADD:9837:no-space-after-punct COPY:9854:no-space-after-punct-sugg
no-space-after-punct-mark
;       "su" Adv ABBR Gram/NumNoAbbr Err/Orth <W:0.0> REMOVE:2782:du
"<.>"
        "." CLB <W:0.0> &no-space-after-punct-mark #19->19 ID:19 ADD:9847:no-space-after-punct-link ADD:9847:no-space-after-punct-link
no-space-after-punct-mark
        "." CLB <W:0.0> &LINK #19->19 ID:19 ADD:9847:no-space-after-punct-link ADDRELATION(RIGHT):9852:no-space-after-punct-rel ADD:9847:no-space-after-punct-link

@lynnda-hill
Copy link
Contributor Author

we should fix it before the new release. I have seen a number of cases of it.

@flammie
Copy link
Contributor

flammie commented Dec 7, 2021

I changed the dot lexicon so one and three full-stops work the same, i.e.:

"<ákkat>"
	"ágga" N Sem/Prod-cogn Pl Nom <W:0.0> SELECT:18692:r2368 #15->15
;	"ágga" N Sem/Prod-cogn Sg Acc PxSg2 <W:0.0> SELECT:18692:r2368
;	"ágga" N Sem/Prod-cogn Sg Gen PxSg2 <W:0.0> SELECT:18692:r2368
;	"ággi" N Sem/Dummytag Sg Acc PxSg2 Err/Spellrelax <W:0.0> REMOVE:2099
;	"ággi" N Sem/Dummytag Sg Gen PxSg2 Err/Spellrelax <W:0.0> REMOVE:2099
: 
"<áššáskuhttit>"
	"áššáskuhttit" Ex/V TV Der/NomAg N Sem/Hum Pl Nom <W:0.0> @<SUBJ MAP:23785 #16->16 SUBSTITUTE:3691
	"áššáskuhttit" <mv> V <TH-Acc-Any><RS-Loc-Any> <PA-Acc-Hum> <BE-Acc-Any><TH-AktioLoc> TV Ind Prs Pl1 <W:0.0> SUBSTITUTE:3090 SUBSTITUTE:3639 SUBSTITUTE:4242 @FS-<ADVL MAP:16636:r406 #16->16 SUBSTITUTE:3979:SubV=mv SUBSTITUTE:4017:SubV=FS-<ADVLmv
	"áššáskuhttit" <mv> V <TH-Acc-Any><RS-Loc-Any> <PA-Acc-Hum> <BE-Acc-Any><TH-AktioLoc> TV Inf <W:0.0> SUBSTITUTE:3090 SUBSTITUTE:3639 SUBSTITUTE:4242 @-FMAINV MAP:16438:-FMAINVInf #16->16 SUBSTITUTE:3979:SubV=mv
	"áššáskuvvat" Ex/V Ex/IV Der/h <mv> V TV Ind Prs Pl1 <W:0.0> @FS-<ADVL MAP:16636:r406 #16->16 SUBSTITUTE:3979:SubV=mv SUBSTITUTE:4017:SubV=FS-<ADVLmv
	"áššáskuvvat" Ex/V Ex/IV Der/h <mv> V TV Inf <W:0.0> @-FMAINV MAP:16438:-FMAINVInf #16->16 SUBSTITUTE:3979:SubV=mv
;	"áššáskuhttit" V <TH-Acc-Any><RS-Loc-Any> <PA-Acc-Hum> <BE-Acc-Any><TH-AktioLoc> TV Imprt Pl2 <W:0.0> SUBSTITUTE:3090 SUBSTITUTE:3639 SUBSTITUTE:4242 REMOVE:6200:r973
;	"áššáskuvvat" Ex/V Ex/IV Der/h V TV Imprt Pl2 <W:0.0> REMOVE:6200:r973
: 
"<su...>"
	"su" Adv ABBR Gram/NumNoAbbr <W:0.0> <LastCohort> @<ADVL MAP:23239 #17->17
;	"..." CLB <W:0.0> "<...>" <LastCohort>
;		"su" Adv ABBR Gram/NumNoAbbr <W:0.0> "<su>" <LastCohort> REMOVE:3107:longest-match
;	"..." CLB <W:0.0> "<...>" <LastCohort>
;		"son" Pron Pers Sg3 Gen <W:0.0> "<su>" <LastCohort> REMOVE:3107:longest-match
;	"..." CLB <W:0.0> "<...>" <LastCohort>
;		"son" Pron Pers Sg3 Acc <W:0.0> "<su>" <LastCohort> REMOVE:3107:longest-match
:\n

@snomos
Copy link
Member

snomos commented Oct 11, 2022

Status today:

  • echo "leat mánáidgirjjit, divttat, aviisačállosat jna..." | modes/trace-smegramrelease.mode:
"<jna...>"
        "jna" Adv ABBR Gram/IAbbr <W:0.0> <LastCohort> @<ADVL MAP:23268
;       "..." CLB <W:0.0> "<...>" <LastCohort>
;               "jna" Adv ABBR Gram/IAbbr <W:0.0> "<jna>" <LastCohort> REMOVE:3116:longest-match
;       "..." CLB <W:0.0> "<...>" <LastCohort>
;               "jna" Adv ABBR Gram/IAbbr Attr <W:0.0> "<jna>" <LastCohort> REMOVE:3116:longest-match
  • echo "Materiála čohken ja girjji čállin Davvi Girji o.s.." | divvun-checker -a se.zcheck | jq .:
{
  "errs": [
    [
      "čohken",
      10,
      16,
      "real-čohkken",
      "\"čohken\" orru leamen čállinmeattáhus",
      [
        "čohkken"
      ],
      "Čállinmeattáhus dán oktavuođas"
    ],
    [
      "o.s..",
      46,
      51,
      "typo",
      "Ii leat sátnelisttus",
      [
        "o.s."
      ],
      "Čállinmeattáhus"
    ]
  ],
  "text": "Materiála čohken ja girjji čállin Davvi Girji o.s.."
}
  • echo "sis váilo ákkat áššáskuhttit su..." | modes/trace-smegramrelease.mode:
"<su...>"
        "su" Adv ABBR Gram/NumNoAbbr <W:0.0> <LastCohort> @<ADVL MAP:23268
;       "..." CLB <W:0.0> "<...>" <LastCohort>
;               "su" Adv ABBR Gram/NumNoAbbr <W:0.0> "<su>" <LastCohort> REMOVE:3116:longest-match
;       "..." CLB <W:0.0> "<...>" <LastCohort>
;               "son" Pron Pers Sg3 Gen <W:0.0> "<su>" <LastCohort> REMOVE:3116:longest-match
;       "..." CLB <W:0.0> "<...>" <LastCohort>
;               "son" Pron Pers Sg3 Acc <W:0.0> "<su>" <LastCohort> REMOVE:3116:longest-match
:\n

Ie one out of three fixed.

@snomos
Copy link
Member

snomos commented Nov 16, 2023

The present analysis looks like this:

"<aviisačállosat>"
        "aviisačálus" N Sem/Txt Pl Nom <W:0.0> <cohort-with-dynamic-compound> @APP-N< #20->20
: 
"<jna...>"
        "jna" Adv ABBR Gram/IAbbr <W:0.0> <LastCohort> @<ADVL #21->21
:\n

Produced with this command:

echo "Boarrásut materiálii gullet gielladutkiid čohken teavsttat, ođđasut materiálas maid dán \
artihkkalis ovdanbuvttán, leat mánáidgirjjit, divttat, aviisačállosat jna..." | \
./tools/grammarcheckers/modes/smegramrelease.mode

Is this good enough, @lynnda-hill , or should the three stops be a separate token?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants