Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lemma with # sign in Finnish language #70

Open
mrgransky opened this issue Apr 4, 2023 · 2 comments
Open

lemma with # sign in Finnish language #70

mrgransky opened this issue Apr 4, 2023 · 2 comments

Comments

@mrgransky
Copy link

mrgransky commented Apr 4, 2023

Given the following code snippet:

import json
from trankit import Pipeline

p = Pipeline('auto', embedding='xlm-roberta-large')

doc = '''Naton päämajassa Brysselissä järjestettiin iltapäivällä Suomen virallinen liittymisseremonia.'''

tokens = p(doc, is_sent=True)
print(json.dumps(tokens, indent=2, ensure_ascii=False))

For some reason, I get # in my lemma as seen in this sample doc:

{
  "text": "Naton päämajassa Brysselissä järjestettiin iltapäivällä Suomen virallinen liittymisseremonia.",
  "tokens": [
    {
      "id": 1,
      "text": "Naton",
      "upos": "PROPN",
      "xpos": "N",
      "feats": "Case=Gen|Number=Sing",
      "head": 2,
      "deprel": "nmod:poss",
      "span": [
        0,
        5
      ],
      "lemma": "Nato"
    },
    {
      "id": 2,
      "text": "päämajassa",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Case=Ine|Number=Sing",
      "head": 4,
      "deprel": "obl",
      "span": [
        6,
        16
      ],
      "lemma": "pää#maja"  <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<<
    },
    {
      "id": 3,
      "text": "Brysselissä",
      "upos": "PROPN",
      "xpos": "N",
      "feats": "Case=Ine|Number=Sing",
      "head": 2,
      "deprel": "appos",
      "span": [
        17,
        28
      ],
      "lemma": "Bryssel"
    },
    {
      "id": 4,
      "text": "järjestettiin",
      "upos": "VERB",
      "xpos": "V",
      "feats": "Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Pass",
      "head": 0,
      "deprel": "root",
      "span": [
        29,
        42
      ],
      "lemma": "järjestää"
    },
    {
      "id": 5,
      "text": "iltapäivällä",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Case=Ade|Number=Sing",
      "head": 4,
      "deprel": "obl",
      "span": [
        43,
        55
      ],
      "lemma": "ilta#päivä" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<<
    },
    {
      "id": 6,
      "text": "Suomen",
      "upos": "PROPN",
      "xpos": "N",
      "feats": "Case=Gen|Number=Sing",
      "head": 8,
      "deprel": "nmod:poss",
      "span": [
        56,
        62
      ],
      "lemma": "Suomi"
    },
    {
      "id": 7,
      "text": "virallinen",
      "upos": "ADJ",
      "xpos": "A",
      "feats": "Case=Nom|Degree=Pos|Derivation=Llinen|Number=Sing",
      "head": 8,
      "deprel": "amod",
      "span": [
        63,
        73
      ],
      "lemma": "virallinen"
    },
    {
      "id": 8,
      "text": "liittymisseremonia",
      "upos": "NOUN",
      "xpos": "N",
      "feats": "Case=Nom|Number=Sing",
      "head": 4,
      "deprel": "obj",
      "span": [
        74,
        92
      ],
      "lemma": "liittyä#seremoni" <<<<<<<<<<<<<<<<<<<<< HERE <<<<<<<<<<<<<<<<<<<<<
    },
    {
      "id": 9,
      "text": ".",
      "upos": "PUNCT",
      "xpos": "Punct",
      "head": 4,
      "deprel": "punct",
      "span": [
        92,
        93
      ],
      "lemma": "."
    }
  ],
  "lang": "finnish"
}

I tired it both in Colab and terminal, but same results!

What am I doing wrong?

PS, I do not get the same error in demo website:
bild

Cheers,

@OttoTarkka
Copy link

Not an error, the component words of compound words (Finnish: yhdyssana) are separated by the '#' sign by design.

@mrgransky
Copy link
Author

but this only occurs when Standard package TDT is used,
FTB would not lead into the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants