Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency: dates as OBJ #32

Open
DavidNemeskey opened this issue Nov 19, 2021 · 4 comments
Open

Dependency: dates as OBJ #32

DavidNemeskey opened this issue Nov 19, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@DavidNemeskey
Copy link
Collaborator

When running the legacy dependency parser on the sentence "A spanyol nagydíj harmadik szabadedzését május 21-én, szombaton délelőtt tartották", 21-én gets the OBJ relation even though it is not in the accusative case and it is obviously not an object. This seems to be a common error with dates.

@DavidNemeskey DavidNemeskey added the bug Something isn't working label Nov 19, 2021
@DavidNemeskey
Copy link
Collaborator Author

Actually, we are looking at two errors here:

  1. A noun in a non-accusative case is marked the OBJ of a finite (active) verb
  2. The POS of 21-én is actually [/N][Nom], indicating an error either in emMorph or in emToken as well.

@dlazesz
Copy link
Collaborator

dlazesz commented Nov 24, 2021

The output of PurePOS:

form	lemma	xpostag
A	a	[/Det|Art.Def]
spanyol	spanyol	[/Adj][Nom]
nagydíj	nagydíj	[/N][Nom]
harmadik	három	[/Num][_Ord/Adj][Nom]
szabadedzését	szabadedzés	[/N][Poss.3Sg][Acc]
május	május	[/N][Nom]
21-én	21-én	[/N][Nom]
,	,	[Punct]
szombaton	szombat	[/N][Supe]
délelőtt	délelőtt	[/N][_Tmp_Loc/Adv]
tartották	tart	[/V][Pst.Def.3Pl]
.	.	[Punct]

The output of emMorph:

{
  "21-én": [
    {
      "lemma": "21-én",
      "tag": "[/N][Nom]"
    },
    {
      "lemma": "21",
      "tag": "[/Num|Digit][Poss.3Sg][Supe]"
    },
    {
      "lemma": "21",
      "tag": "[/Num|Digit][AnP][Supe]"
    },
    {
      "lemma": "21",
      "tag": "[/Num|Digit][_OrdDate/N][Supe]"
    }
  ]
}

The relevant part of Szeged corpus (prefixed with freqs):

      4 1-én    1-én    [/N][Nom]
     20 4-én    4-én    [/N][Nom]
     11 5-én    5-én    [/N][Nom]
     11 7-én    7-én    [/N][Nom]
     13 9-én    9-én    [/N][Nom]
     11 10-én   10-én   [/N][Nom]
     10 11-én   11-én   [/N][Nom]
     13 12-én   12-én   [/N][Nom]
      6 14-én   14-én   [/N][Nom]
     12 15-én   15-én   [/N][Nom]
      8 17-én   17-én   [/N][Nom]
      6 19-én   19-én   [/N][Nom]
     12 21-én   21-én   [/N][Nom]
     13 22-én   22-én   [/N][Nom]
     12 24-én   24-én   [/N][Nom]
     14 25-én   25-én   [/N][Nom]
     16 27-én   27-én   [/N][Nom]
     10 29-én   29-én   [/N][Nom]
     22 31-én   31-én   [/N][Nom]

This is another systematic error in the training corpus, which has nothing to do with the modules.

This will not be fixed until new training data materializes.

@DavidNemeskey
Copy link
Collaborator Author

Well, this still doesn't explain the first problem (assiging OBJ to 21-én). As for the "new traning data", this is def. something we could fix in it ourselves.

@dlazesz
Copy link
Collaborator

dlazesz commented Nov 25, 2021

I don't have access to the corpus version which emDep has trained with, but in the public UD Szeged corpus the problem is fixed and simply changing the dependency parser to Stanza seems to solve the original problem:

form lemma xpostag upostag feats id deprel head
A a [/Det|Art.Def] DET Definite=Def|PronType=Art 1 det 3
spanyol spanyol [/Adj][Nom] ADJ Case=Nom|Degree=Pos|Number=Sing 2 amod:att 3
nagydíj nagydíj [/N][Nom] NOUN Case=Nom|Number=Sing 3 nmod:att 5
harmadik három [/Num][_Ord/Adj][Nom] ADJ Case=Nom|Number=Sing|NumType=Ord 4 amod:att 5
szabadedzését szabadedzés [/N][Poss.3Sg][Acc] NOUN Case=Acc|Number=Sing|Number[psor]=Sing|Person[psor]=3 5 obj 11
május május [/N][Nom] NOUN Case=Nom|Number=Sing 6 nmod:att 7
21-én 21-én [/N][Nom] NOUN Case=Nom|Number=Sing 7 obl 11
, , [Punct] PUNCT _ 8 punct 11
szombaton szombat [/N][Supe] NOUN Case=Sup|Number=Sing 9 nmod:obl 11
délelőtt délelőtt [/N][_Tmp_Loc/Adv] ADV _ 10 advmod:tlocy 11
tartották tart [/V][Pst.Def.3Pl] VERB Definite=Def|Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin|Voice=Act 11 root 0
. . [Punct] PUNCT _ 12 punct 11

There are several new coprora in the making IMO we should focus on them instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants