tokenization of "..." should be one token #31

lynnda-hill · 2021-10-18T10:05:32Z

In the following sentence we have the expression "and so on ...". But the three dots are tokenized as a period and two separate periods causing a problem for the grammarchecker. We should at least allow for the option of tokenizing all three dots as one token (could maybe then be disambiguated in mwe-dis.cg3:

Example:

Boarrásut materiálii gullet gielladutkiid čohken teavsttat, ođđasut materiálas maid dán artihkkalis ovdanbuvttán, leat mánáidgirjjit, divttat, aviisačállosat jna...

pipeline:

... | tools/grammarcheckers/modes/trace-smegramrelease-dev.mode | less

Output:

"<jna>"
        "jna" Adv ABBR Gram/IAbbr <W:0.0> @<ADVL MAP:23205 #21->21
        "jna" Adv ABBR Gram/IAbbr Attr <W:0.0> @<ADVL MAP:23205 #21->21
"<.>"
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark #22->22 ID:22 R:RIGHT:24 ADD:9737:no-space-after-punct ADD:9737:no-space-after-punct
no-space-after-punct-mark
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> ". .."S &no-space-after-punct-mark &SUGGESTWF #22->22 ID:22 R:RIGHT:24 ADD:9737:no-space-after-punct COPY:9754:no-space-after-punct-sugg
no-space-after-punct-mark
;       "jna" Adv ABBR Gram/IAbbr <W:0.0> REMOVE:2969

"<..>"
        ".." CLB <W:0.0> &no-space-after-punct-mark &LINK #1->1 ID:24 ADD:9718:double-space-before-link ADDRELATION($2):9719:double-space-before-rel ADDRELATION(LEFT):9720:double-space-before-rel ADD:9747:no-space-after-punct-link ADD:9718:double-space-before-link ADD:9747:no-space-after-punct-link
no-space-after-punct-mark
        ".." CLB <W:0.0> &double-space-before #1->1 ID:24 ADD:9718:double-space-before-link ADD:9747:no-space-after-punct-link ADD:9718:double-space-before-link ADD:9747:no-space-after-punct-link
double-space-before
        ".." CLB <W:0.0> &LINK #1->1 ID:24 ADD:9718:double-space-before-link ADDRELATION($2):9719:double-space-before-rel ADDRELATION(LEFT):9720:double-space-before-rel ADD:9747:no-space-after-punct-link ADDRELATION(RIGHT):9752:no-space-after-punct-rel ADD:9718:double-space-before-link ADD:9747:no-space-after-punct-link

The text was updated successfully, but these errors were encountered:

flammie · 2021-10-19T03:57:18Z

The problem of full stop final abbreviation plus ellipsis appears quite often in all langs. at least in Finnish typography / grammar / rules three full stops is the correct one and we have decided that it's tokenised so that surface of abbreviation keeps one full stop and the ellipses has two full stops on surface but three (or ellipsis symbol) on analysis level. I.e. I propose a giella-shared/all_langs/punctuation.lexc entry like ..:... CLBcont ; (or CLBerrorth). But I haven't gotten that to work with other CG nicely, it should be removed unless -1 ends in . or so but maybe other approach may be better...

lynnda-hill · 2021-10-20T13:23:50Z

@flammie I think we have another single tokenization of the three full stops (the ones that are closer to each other), so that should be probably the same one. And if it is an error, which I guess depends on the norm, then the error should be similar to other errortags like e.g. Err/CLB or so.

Also:
I found another related example with two full stops where one of them should be tokenized as part of the abbreviation and the other one as a sentence boundary, and maybe there should be only one full stop anyway. But for that to be recognized we should probably distinguish between them.

Materiála čohken ja girjji čállin Davvi Girji o.s..

This is the analysis I get:

"<Davvi Girji>"
        "Davvi Girji" N Prop Sem/Org Sg Nom <W:0.0> @N< MAP:22355:r112 #6->6 SUBSTITUTE:9967
;       "Davvi Girji" MWE N Prop Sem/Org Attr <W:0.0> REMOVE:16806:r1912
: 
"<o.s>"
        "o.s" N <NomGenSg> ABBR Gram/IAbbr Sg Acc <W:0.0> @<OBJ SUBSTITUTE:3534 SELECT:17130:r2021 MAP:23862:IfNoTransV> #7->7
;       "o.s" N <NomGenSg> ABBR Gram/IAbbr Attr <W:0.0> SUBSTITUTE:3534 SELECT:17130:r2021
;       "o.s" N <NomGenSg> ABBR Gram/IAbbr Sg Gen <W:0.0> SUBSTITUTE:3534 SELECT:17130:r2021
;       "o.s" N <NomGenSg> ABBR Gram/IAbbr Sg Nom <W:0.0> SUBSTITUTE:3534 SELECT:17130:r2021
"<.>"
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark #8->8 ID:8 R:RIGHT:10 ADD:9739:no-space-after-punct ADD:9739:no-space-after-punct
no-space-after-punct-mark
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> ". ."S &no-space-after-punct-mark &SUGGESTWF #8->8 ID:8 R:RIGHT:10 ADD:9739:no-space-after-punct COPY:9756:no-space-after-punct-sugg
no-space-after-punct-mark
;       "o.s" N ABBR Gram/IAbbr Attr <W:0.0> REMOVE:2969
;       "o.s" N ABBR Gram/IAbbr Sg Acc <W:0.0> REMOVE:2969
;       "o.s" N ABBR Gram/IAbbr Sg Gen <W:0.0> REMOVE:2969
;       "o.s" N ABBR Gram/IAbbr Sg Nom <W:0.0> REMOVE:2969

"<.>"
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> &LINK #1->1 ID:10 ADD:9739:no-space-after-punct ADD:9749:no-space-after-punct-link ADDRELATION(RIGHT):9754:no-space-after-punct-rel ADD:9739:no-space-after-punct ADD:9749:no-space-after-punct-link
        "." CLB <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark #1->1 ID:10 ADD:9739:no-space-after-punct ADD:9749:no-space-after-punct-link ADD:9739:no-space-after-punct ADD:9749:no-space-after-punct-link
no-space-after-punct-mark

snomos · 2021-10-20T15:54:58Z

Also: I found another related example with two full stops where one of them should be tokenized as part of the abbreviation and the other one as a sentence boundary, and maybe there should be only one full stop anyway. But for that to be recognized we should probably distinguish between them.

Materiála čohken ja girjji čállin Davvi Girji o.s..

This is easily fixed by adding an Err/something reading to full stops after abbreviations. Then both full stops will be read as part of the initial token, and be given an analysis (as either CLB or not, depending on, but in both cases as an Err/xxx).

… stops As discussed in #31.

lynnda-hill · 2021-11-18T10:05:26Z

Found another one with three full stops:

Oskujođieheaddjit plánejedje su jávkadit dan sivas go sii gáđaštedje su, muhto sis váilo ákkat áššáskuhttit su...

lynnda-hill · 2021-11-18T10:06:31Z

We get the following analysis:

"<su>"
        "son" Pron Sem/Hum Pers Sg3 Gen <W:0.0> @<ADVL SUBSTITUTE:3530 MAP:23287:r520 #17->17
        "son" Pron Sem/Hum Pers Sg3 Acc <W:0.0> @<OBJ SUBSTITUTE:3530 MAP:23875:IfNoTransV> #17->17
;       "su" Adv ABBR Gram/NumNoAbbr <W:0.0> REMOVE:3659
"<..>"
        "." CLB Err/Orth <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark #18->18 ID:18 R:RIGHT:19 ADD:9837:no-space-after-punct ADD:9837:no-space-after-punct
no-space-after-punct-mark
        "." CLB Err/Orth <W:0.0> <NoSpaceAfterPunctMark> ".. ."S &no-space-after-punct-mark &SUGGESTWF #18->18 ID:18 R:RIGHT:19 ADD:9837:no-space-after-punct COPY:9854:no-space-after-punct-sugg
no-space-after-punct-mark
        ".." CLB <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark #18->18 ID:18 R:RIGHT:19 ADD:9837:no-space-after-punct ADD:9837:no-space-after-punct
no-space-after-punct-mark
        ".." CLB <W:0.0> <NoSpaceAfterPunctMark> ".. ."S &no-space-after-punct-mark &SUGGESTWF #18->18 ID:18 R:RIGHT:19 ADD:9837:no-space-after-punct COPY:9854:no-space-after-punct-sugg
no-space-after-punct-mark
;       "su" Adv ABBR Gram/NumNoAbbr Err/Orth <W:0.0> REMOVE:2782:du
"<.>"
        "." CLB <W:0.0> &no-space-after-punct-mark #19->19 ID:19 ADD:9847:no-space-after-punct-link ADD:9847:no-space-after-punct-link
no-space-after-punct-mark
        "." CLB <W:0.0> &LINK #19->19 ID:19 ADD:9847:no-space-after-punct-link ADDRELATION(RIGHT):9852:no-space-after-punct-rel ADD:9847:no-space-after-punct-link

lynnda-hill · 2021-11-29T13:37:50Z

we should fix it before the new release. I have seen a number of cases of it.

flammie · 2021-12-07T09:03:48Z

I changed the dot lexicon so one and three full-stops work the same, i.e.:

"<ákkat>"
	"ágga" N Sem/Prod-cogn Pl Nom <W:0.0> SELECT:18692:r2368 #15->15
;	"ágga" N Sem/Prod-cogn Sg Acc PxSg2 <W:0.0> SELECT:18692:r2368
;	"ágga" N Sem/Prod-cogn Sg Gen PxSg2 <W:0.0> SELECT:18692:r2368
;	"ággi" N Sem/Dummytag Sg Acc PxSg2 Err/Spellrelax <W:0.0> REMOVE:2099
;	"ággi" N Sem/Dummytag Sg Gen PxSg2 Err/Spellrelax <W:0.0> REMOVE:2099
: 
"<áššáskuhttit>"
	"áššáskuhttit" Ex/V TV Der/NomAg N Sem/Hum Pl Nom <W:0.0> @<SUBJ MAP:23785 #16->16 SUBSTITUTE:3691
	"áššáskuhttit" <mv> V <TH-Acc-Any><RS-Loc-Any> <PA-Acc-Hum> <BE-Acc-Any><TH-AktioLoc> TV Ind Prs Pl1 <W:0.0> SUBSTITUTE:3090 SUBSTITUTE:3639 SUBSTITUTE:4242 @FS-<ADVL MAP:16636:r406 #16->16 SUBSTITUTE:3979:SubV=mv SUBSTITUTE:4017:SubV=FS-<ADVLmv
	"áššáskuhttit" <mv> V <TH-Acc-Any><RS-Loc-Any> <PA-Acc-Hum> <BE-Acc-Any><TH-AktioLoc> TV Inf <W:0.0> SUBSTITUTE:3090 SUBSTITUTE:3639 SUBSTITUTE:4242 @-FMAINV MAP:16438:-FMAINVInf #16->16 SUBSTITUTE:3979:SubV=mv
	"áššáskuvvat" Ex/V Ex/IV Der/h <mv> V TV Ind Prs Pl1 <W:0.0> @FS-<ADVL MAP:16636:r406 #16->16 SUBSTITUTE:3979:SubV=mv SUBSTITUTE:4017:SubV=FS-<ADVLmv
	"áššáskuvvat" Ex/V Ex/IV Der/h <mv> V TV Inf <W:0.0> @-FMAINV MAP:16438:-FMAINVInf #16->16 SUBSTITUTE:3979:SubV=mv
;	"áššáskuhttit" V <TH-Acc-Any><RS-Loc-Any> <PA-Acc-Hum> <BE-Acc-Any><TH-AktioLoc> TV Imprt Pl2 <W:0.0> SUBSTITUTE:3090 SUBSTITUTE:3639 SUBSTITUTE:4242 REMOVE:6200:r973
;	"áššáskuvvat" Ex/V Ex/IV Der/h V TV Imprt Pl2 <W:0.0> REMOVE:6200:r973
: 
"<su...>"
	"su" Adv ABBR Gram/NumNoAbbr <W:0.0> <LastCohort> @<ADVL MAP:23239 #17->17
;	"..." CLB <W:0.0> "<...>" <LastCohort>
;		"su" Adv ABBR Gram/NumNoAbbr <W:0.0> "<su>" <LastCohort> REMOVE:3107:longest-match
;	"..." CLB <W:0.0> "<...>" <LastCohort>
;		"son" Pron Pers Sg3 Gen <W:0.0> "<su>" <LastCohort> REMOVE:3107:longest-match
;	"..." CLB <W:0.0> "<...>" <LastCohort>
;		"son" Pron Pers Sg3 Acc <W:0.0> "<su>" <LastCohort> REMOVE:3107:longest-match
:\n

snomos · 2022-10-11T20:33:30Z

Status today:

echo "leat mánáidgirjjit, divttat, aviisačállosat jna..." | modes/trace-smegramrelease.mode:

"<jna...>"
        "jna" Adv ABBR Gram/IAbbr <W:0.0> <LastCohort> @<ADVL MAP:23268
;       "..." CLB <W:0.0> "<...>" <LastCohort>
;               "jna" Adv ABBR Gram/IAbbr <W:0.0> "<jna>" <LastCohort> REMOVE:3116:longest-match
;       "..." CLB <W:0.0> "<...>" <LastCohort>
;               "jna" Adv ABBR Gram/IAbbr Attr <W:0.0> "<jna>" <LastCohort> REMOVE:3116:longest-match

echo "Materiála čohken ja girjji čállin Davvi Girji o.s.." | divvun-checker -a se.zcheck | jq .:

{
  "errs": [
    [
      "čohken",
      10,
      16,
      "real-čohkken",
      "\"čohken\" orru leamen čállinmeattáhus",
      [
        "čohkken"
      ],
      "Čállinmeattáhus dán oktavuođas"
    ],
    [
      "o.s..",
      46,
      51,
      "typo",
      "Ii leat sátnelisttus",
      [
        "o.s."
      ],
      "Čállinmeattáhus"
    ]
  ],
  "text": "Materiála čohken ja girjji čállin Davvi Girji o.s.."
}

echo "sis váilo ákkat áššáskuhttit su..." | modes/trace-smegramrelease.mode:

"<su...>"
        "su" Adv ABBR Gram/NumNoAbbr <W:0.0> <LastCohort> @<ADVL MAP:23268
;       "..." CLB <W:0.0> "<...>" <LastCohort>
;               "su" Adv ABBR Gram/NumNoAbbr <W:0.0> "<su>" <LastCohort> REMOVE:3116:longest-match
;       "..." CLB <W:0.0> "<...>" <LastCohort>
;               "son" Pron Pers Sg3 Gen <W:0.0> "<su>" <LastCohort> REMOVE:3116:longest-match
;       "..." CLB <W:0.0> "<...>" <LastCohort>
;               "son" Pron Pers Sg3 Acc <W:0.0> "<su>" <LastCohort> REMOVE:3116:longest-match
:\n

Ie one out of three fixed.

snomos · 2023-11-16T09:35:01Z

The present analysis looks like this:

"<aviisačállosat>"
        "aviisačálus" N Sem/Txt Pl Nom <W:0.0> <cohort-with-dynamic-compound> @APP-N< #20->20
: 
"<jna...>"
        "jna" Adv ABBR Gram/IAbbr <W:0.0> <LastCohort> @<ADVL #21->21
:\n

Produced with this command:

echo "Boarrásut materiálii gullet gielladutkiid čohken teavsttat, ođđasut materiálas maid dán \
artihkkalis ovdanbuvttán, leat mánáidgirjjit, divttat, aviisačállosat jna..." | \
./tools/grammarcheckers/modes/smegramrelease.mode

Is this good enough, @lynnda-hill , or should the three stops be a separate token?

lynnda-hill added the bug Something isn't working label Oct 18, 2021

lynnda-hill assigned snomos, flammie and duomdaamaendra Oct 18, 2021

snomos added a commit that referenced this issue Oct 20, 2021

Add Err/Orth variants for cases of abbreviations followed by two full…

418edb8

… stops As discussed in #31.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization of "..." should be one token #31

tokenization of "..." should be one token #31

lynnda-hill commented Oct 18, 2021 •

edited by snomos

Loading

flammie commented Oct 19, 2021

lynnda-hill commented Oct 20, 2021 •

edited by snomos

Loading

snomos commented Oct 20, 2021 •

edited

Loading

lynnda-hill commented Nov 18, 2021 •

edited by snomos

Loading

lynnda-hill commented Nov 18, 2021 •

edited by snomos

Loading

lynnda-hill commented Nov 29, 2021

flammie commented Dec 7, 2021

snomos commented Oct 11, 2022

snomos commented Nov 16, 2023

tokenization of "..." should be one token #31

tokenization of "..." should be one token #31

Comments

lynnda-hill commented Oct 18, 2021 • edited by snomos Loading

flammie commented Oct 19, 2021

lynnda-hill commented Oct 20, 2021 • edited by snomos Loading

snomos commented Oct 20, 2021 • edited Loading

lynnda-hill commented Nov 18, 2021 • edited by snomos Loading

lynnda-hill commented Nov 18, 2021 • edited by snomos Loading

lynnda-hill commented Nov 29, 2021

flammie commented Dec 7, 2021

snomos commented Oct 11, 2022

snomos commented Nov 16, 2023

lynnda-hill commented Oct 18, 2021 •

edited by snomos

Loading

lynnda-hill commented Oct 20, 2021 •

edited by snomos

Loading

snomos commented Oct 20, 2021 •

edited

Loading

lynnda-hill commented Nov 18, 2021 •

edited by snomos

Loading

lynnda-hill commented Nov 18, 2021 •

edited by snomos

Loading