You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description of the problem:
I am using Purepos as part of emtsv tok,morph,pos pipelines.
For both frequentative and causative/factitive verbs, it seems that when emMorph doesn't find a verb stem that already contains the frequentative and/or causative morpheme, then Purepos always omits the frequentative or causative morpheme from the lemma, and puts the [_Freq/V] morpheme in xpostag. For example, in the sentences below, the stem falogat[/V] does not appear in the emMorph analysis of falogatta, but only the morpheme combination fal[/V] + ogat[_Freq/V], so its lemma becomes fal, and similarly for e.g. táncoltat, there's no táncoltat[/V], so the lemma is táncol.
András kocsis szinte vidáman <<falogatta>>[lemma: fal; xpostag: [/V][_Freq/V][Pst.Def.3Sg]] a húst.
Öltöztetjük, ringatgatjuk, <<táncoltatjuk>>[lemma: táncol; xpostag: [/V][_Caus/V][Prs.Def.1Pl]], altatgatjuk.
On the other hand, when the causative or frequentative verb is (apparently) included in emMorph's dictionary, then emMorph's output includes analyses like e.g. meg[/Prev] + törülget[/V] for the form megtörülgette, or forgat[/V] for the form forgatja. When Purepos happens to select these analyses its output looks like this:
Ahogy erre <<megtörülgette>>[lemma: megtörülget; xpostag: [/V][Pst.Def.3Sg]] a szemét, a két gyermek is sírva fakadt.
Meglássátok, hogy az én fakezű apám hogy <<forgatja>>[lemma: forgat; xpostag: [/V][Prs.Def.3Sg]] a kardot!
There are two problems with this:
Omitting the frequentative or causative morpheme from the lemma, like in the first case, sometimes results in a lemma that seems clearly incorrect in the sense that it doesn't even exist in Hungarian. This happens relatively often for verbs with preverbs:
Dobó ismét kisütötte a maga ágyúit, s ismét <<felforgatta>>[lemma: felforog; xpostag: [/V][_Caus/V][Pst.Def.3Sg]]
a kasokat, s ágyúkat, de a feldőlt kasok mögött új kasok emelkedtek, s azok mellett új ágyúk.
A fiú <<megrángatta>>[lemma: megráng; xpostag: [/V][_Caus/V][Pst.Def.3Sg]] a kantárt, s a szürke megindult,
ki az erdőből: vonta, vitte magával a török lovat is.
Végre <<megzörgette>>[lemma: megzörög; xpostag: [/V][_Caus/V][Pst.Def.3Sg]] a vasajtót is.
Much more often the assigned non-derived lemma does exist, but the derivation changes its meaning significantly, so I believe the lemmatisation is still clearly incorrect:
A képeket <<elégették>>[lemma: elég; xpostag: [/V][_Caus/V][Pst.Def.3Pl]].
Mindaddig ezt a rengeteg hadat <<etetni>>[lemma: eszik; xpostag: [/V][_Caus/V][Inf]] kell.
Dobó <<lehányatta>>[lemma: lehány; xpostag: [/V][_Caus/V][Pst.Def.3Sg]] az istáló tetejét is.
A minisztráló fiu <<csenget>>[lemma: cseng; xpostag: [/V][_Caus/V][Prs.NDef.3Sg]].
In some cases the selected analysis is wrong anyway, but if the assigned lemma contained the derivational morpheme, and the xpostag didn't contain the causative tag, which is what the output should look like in my opinion, then this mistake would not be visible:
Kissé <<megzavargatjátok>>[lemma: megzavarog; xpostag: [/V][_Caus/V][Prs.Def.2Pl]] őket.
És csak a szakálasokból pukkantott olykor közibük, hogy a munkájukat
<<zavargassa>>[lemma: zavarog; xpostag: [/V][_Caus/V][Sbjv.Def.3Sg]].
When the emMorph analysis contains both the "unanalysed" relative stem (e.g. beszélget[/V]) of the form and the same stem split up into its base and the derivational morpheme (e.g. beszél[/V] + get[_Freq/V]), then Purepos chooses between the lemma beszél and beszélget seemingly completely at random from the user's point of view (although I'm sure this is a deterministic decision internally, this doesn't help). This means that it's impossible to know in advance whether forms of a causative or frequentative verb will be lemmatized to the stem that includes the respective derivational morpheme or to the base of this derivation. Essentially this means that different forms of the very same verb, or even the exact same form of the same verb in different contexts sometimes end up with different lemmas assigned to them:
beszélget/beszél:
Aztán mindenféle hadi ügyekről, törökről, németről
<<beszélgettek>>[lemma: beszélget; xpostag: [/V][Prs.NDef.2Pl]] az asztalnál.
- Hová sietnél? - csodálkozott reá, - hiszen még nem is
- <<beszélgettünk>>[lemma: beszél; xpostag: [/V][_Freq/V][Pst.NDef.1Pl]].
megtörölget/megtöröl:
S <<megtörölgette>>[lemma: megtörölget; xpostag: [/V][Pst.Def.3Sg]] izzadt arcát a kendőjében.
Az aranygombjait <<megtörölgette>>[lemma: megtöröl; xpostag: [/V][_Freq/V][Pst.Def.3Sg]] szarvasbőrrel.
forgat/forog
Most hát értsétek meg: csak a legislegkiválóbbakat akartam megdícsérni, akik az életüket a hazáért a bizonyos halál
veszedelmében <<forgatták>>[lemma: forgat; xpostag: [/V][Pst.Def.3Pl]].
Egyik-másik könyvet meg is <<forgatta>>[lemma: forog; xpostag: [/V][_Caus/V][Pst.Def.3Sg]], hogy képesek-e?
öltözik/öltöztet:
Megmosdatta, <<felöltöztette>>[lemma: felöltözik; xpostag: [/V][_Caus/V][Pst.Def.3Sg]].
A gyermek megint sírva fakadt: - Meded, meded! (jaj-jaj) Vasné letérdelt és szótlanul
<<öltöztette>>[lemma: öltözik; xpostag: [/V][_Caus/V][Pst.Def.3Sg]] a gyereket.
- Férfi ruhába <<öltöztették>>[lemma: öltöztet; xpostag: [/V][Pst.Def.3Pl]].
Note that in the causative case this is not about disambiguating between past tense and causative forms, as in e.g. ejtette can be a form of ejt or ejtet, since in the above examples the causative analysis is chosen correctly, it is just that the tagger's behavior is inconsistent.
Suggested solution:
I would suggest that a sensible solution for both the incorrect lemmas and the inconsistent and apparently random behavior of the tagger would be to always assign the derived lemma to the factitive and causative tokens. Similarly for the sake of consistency the tag of the causative/factitive and frequentative morphemes should not appear in the xpostag at all, i.e. <<forgatta>>[lemma: forgat; xpostag: [/V][Pst.Def.3Sg]] and <<megtörölgette>>[lemma: megtörölget; xpostag: [/V][Pst.Def.3Sg]] for example.
The text was updated successfully, but these errors were encountered:
Description of the problem:
I am using Purepos as part of emtsv tok,morph,pos pipelines.
For both frequentative and causative/factitive verbs, it seems that when emMorph doesn't find a verb stem that already contains the frequentative and/or causative morpheme, then Purepos always omits the frequentative or causative morpheme from the lemma, and puts the
[_Freq/V]
morpheme in xpostag. For example, in the sentences below, the stemfalogat[/V]
does not appear in the emMorph analysis of falogatta, but only the morpheme combinationfal[/V] + ogat[_Freq/V]
, so its lemma becomes fal, and similarly for e.g. táncoltat, there's notáncoltat[/V]
, so the lemma is táncol.On the other hand, when the causative or frequentative verb is (apparently) included in emMorph's dictionary, then emMorph's output includes analyses like e.g.
meg[/Prev] + törülget[/V]
for the form megtörülgette, orforgat[/V]
for the form forgatja. When Purepos happens to select these analyses its output looks like this:There are two problems with this:
Much more often the assigned non-derived lemma does exist, but the derivation changes its meaning significantly, so I believe the lemmatisation is still clearly incorrect:
In some cases the selected analysis is wrong anyway, but if the assigned lemma contained the derivational morpheme, and the xpostag didn't contain the causative tag, which is what the output should look like in my opinion, then this mistake would not be visible:
beszélget[/V]
) of the form and the same stem split up into its base and the derivational morpheme (e.g.beszél[/V] + get[_Freq/V]
), then Purepos chooses between the lemma beszél and beszélget seemingly completely at random from the user's point of view (although I'm sure this is a deterministic decision internally, this doesn't help). This means that it's impossible to know in advance whether forms of a causative or frequentative verb will be lemmatized to the stem that includes the respective derivational morpheme or to the base of this derivation. Essentially this means that different forms of the very same verb, or even the exact same form of the same verb in different contexts sometimes end up with different lemmas assigned to them:Note that in the causative case this is not about disambiguating between past tense and causative forms, as in e.g. ejtette can be a form of ejt or ejtet, since in the above examples the causative analysis is chosen correctly, it is just that the tagger's behavior is inconsistent.
Suggested solution:
I would suggest that a sensible solution for both the incorrect lemmas and the inconsistent and apparently random behavior of the tagger would be to always assign the derived lemma to the factitive and causative tokens. Similarly for the sake of consistency the tag of the causative/factitive and frequentative morphemes should not appear in the xpostag at all, i.e.
<<forgatta>>[lemma: forgat; xpostag: [/V][Pst.Def.3Sg]]
and<<megtörölgette>>[lemma: megtörölget; xpostag: [/V][Pst.Def.3Sg]]
for example.The text was updated successfully, but these errors were encountered: