Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent and arguably incorrect lemmas for frequentative and causative/factitive verbs #16

Open
gpetho opened this issue Feb 4, 2022 · 0 comments

Comments

@gpetho
Copy link

gpetho commented Feb 4, 2022

Description of the problem:
I am using Purepos as part of emtsv tok,morph,pos pipelines.

For both frequentative and causative/factitive verbs, it seems that when emMorph doesn't find a verb stem that already contains the frequentative and/or causative morpheme, then Purepos always omits the frequentative or causative morpheme from the lemma, and puts the [_Freq/V] morpheme in xpostag. For example, in the sentences below, the stem falogat[/V] does not appear in the emMorph analysis of falogatta, but only the morpheme combination fal[/V] + ogat[_Freq/V], so its lemma becomes fal, and similarly for e.g. táncoltat, there's no táncoltat[/V], so the lemma is táncol.

András kocsis szinte vidáman <<falogatta>>[lemma: fal; xpostag: [/V][_Freq/V][Pst.Def.3Sg]] a húst. 
Öltöztetjük, ringatgatjuk, <<táncoltatjuk>>[lemma: táncol; xpostag: [/V][_Caus/V][Prs.Def.1Pl]], altatgatjuk. 

On the other hand, when the causative or frequentative verb is (apparently) included in emMorph's dictionary, then emMorph's output includes analyses like e.g. meg[/Prev] + törülget[/V] for the form megtörülgette, or forgat[/V] for the form forgatja. When Purepos happens to select these analyses its output looks like this:

Ahogy erre <<megtörülgette>>[lemma: megtörülget; xpostag: [/V][Pst.Def.3Sg]] a szemét, a két gyermek is sírva fakadt. 
Meglássátok, hogy az én fakezű apám hogy <<forgatja>>[lemma: forgat; xpostag: [/V][Prs.Def.3Sg]] a kardot! 

There are two problems with this:

  1. Omitting the frequentative or causative morpheme from the lemma, like in the first case, sometimes results in a lemma that seems clearly incorrect in the sense that it doesn't even exist in Hungarian. This happens relatively often for verbs with preverbs:
Dobó ismét kisütötte a maga ágyúit, s ismét <<felforgatta>>[lemma: felforog; xpostag: [/V][_Caus/V][Pst.Def.3Sg]]
a kasokat, s ágyúkat, de a feldőlt kasok mögött új kasok emelkedtek, s azok mellett új ágyúk.
A fiú <<megrángatta>>[lemma: megráng; xpostag: [/V][_Caus/V][Pst.Def.3Sg]] a kantárt, s a szürke megindult,
ki az erdőből: vonta, vitte magával a török lovat is.
Végre <<megzörgette>>[lemma: megzörög; xpostag: [/V][_Caus/V][Pst.Def.3Sg]] a vasajtót is. 

Much more often the assigned non-derived lemma does exist, but the derivation changes its meaning significantly, so I believe the lemmatisation is still clearly incorrect:

A képeket <<elégették>>[lemma: elég; xpostag: [/V][_Caus/V][Pst.Def.3Pl]]. 
Mindaddig ezt a rengeteg hadat <<etetni>>[lemma: eszik; xpostag: [/V][_Caus/V][Inf]] kell. 
Dobó <<lehányatta>>[lemma: lehány; xpostag: [/V][_Caus/V][Pst.Def.3Sg]] az istáló tetejét is. 
A minisztráló fiu <<csenget>>[lemma: cseng; xpostag: [/V][_Caus/V][Prs.NDef.3Sg]]. 

In some cases the selected analysis is wrong anyway, but if the assigned lemma contained the derivational morpheme, and the xpostag didn't contain the causative tag, which is what the output should look like in my opinion, then this mistake would not be visible:

Kissé <<megzavargatjátok>>[lemma: megzavarog; xpostag: [/V][_Caus/V][Prs.Def.2Pl]] őket.
És csak a szakálasokból pukkantott olykor közibük, hogy a munkájukat
    <<zavargassa>>[lemma: zavarog; xpostag: [/V][_Caus/V][Sbjv.Def.3Sg]]. 
  1. When the emMorph analysis contains both the "unanalysed" relative stem (e.g. beszélget[/V]) of the form and the same stem split up into its base and the derivational morpheme (e.g. beszél[/V] + get[_Freq/V]), then Purepos chooses between the lemma beszél and beszélget seemingly completely at random from the user's point of view (although I'm sure this is a deterministic decision internally, this doesn't help). This means that it's impossible to know in advance whether forms of a causative or frequentative verb will be lemmatized to the stem that includes the respective derivational morpheme or to the base of this derivation. Essentially this means that different forms of the very same verb, or even the exact same form of the same verb in different contexts sometimes end up with different lemmas assigned to them:
beszélget/beszél:
Aztán mindenféle hadi ügyekről, törökről, németről
    <<beszélgettek>>[lemma: beszélget; xpostag: [/V][Prs.NDef.2Pl]] az asztalnál. 
- Hová sietnél? - csodálkozott reá, - hiszen még nem is
    - <<beszélgettünk>>[lemma: beszél; xpostag: [/V][_Freq/V][Pst.NDef.1Pl]]. 

megtörölget/megtöröl:
S <<megtörölgette>>[lemma: megtörölget; xpostag: [/V][Pst.Def.3Sg]] izzadt arcát a kendőjében. 
Az aranygombjait <<megtörölgette>>[lemma: megtöröl; xpostag: [/V][_Freq/V][Pst.Def.3Sg]] szarvasbőrrel. 

forgat/forog
Most hát értsétek meg: csak a legislegkiválóbbakat akartam megdícsérni, akik az életüket a hazáért a bizonyos halál
    veszedelmében <<forgatták>>[lemma: forgat; xpostag: [/V][Pst.Def.3Pl]]. 
Egyik-másik könyvet meg is <<forgatta>>[lemma: forog; xpostag: [/V][_Caus/V][Pst.Def.3Sg]], hogy képesek-e? 

öltözik/öltöztet:
Megmosdatta, <<felöltöztette>>[lemma: felöltözik; xpostag: [/V][_Caus/V][Pst.Def.3Sg]].
A gyermek megint sírva fakadt: - Meded, meded! (jaj-jaj) Vasné letérdelt és szótlanul
    <<öltöztette>>[lemma: öltözik; xpostag: [/V][_Caus/V][Pst.Def.3Sg]] a gyereket. 
- Férfi ruhába <<öltöztették>>[lemma: öltöztet; xpostag: [/V][Pst.Def.3Pl]]. 

Note that in the causative case this is not about disambiguating between past tense and causative forms, as in e.g. ejtette can be a form of ejt or ejtet, since in the above examples the causative analysis is chosen correctly, it is just that the tagger's behavior is inconsistent.

Suggested solution:
I would suggest that a sensible solution for both the incorrect lemmas and the inconsistent and apparently random behavior of the tagger would be to always assign the derived lemma to the factitive and causative tokens. Similarly for the sake of consistency the tag of the causative/factitive and frequentative morphemes should not appear in the xpostag at all, i.e. <<forgatta>>[lemma: forgat; xpostag: [/V][Pst.Def.3Sg]] and <<megtörölgette>>[lemma: megtörölget; xpostag: [/V][Pst.Def.3Sg]] for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant