Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Spanish verb decomposition #1395

Open
busdriverbuddha opened this issue Jun 13, 2024 · 10 comments
Open

Incorrect Spanish verb decomposition #1395

busdriverbuddha opened this issue Jun 13, 2024 · 10 comments
Labels

Comments

@busdriverbuddha
Copy link

Describe the bug
The token "decírselo" is incorrectly decomposed into the words "decar", "se", "lo". "Decar" is not a word in Spanish. It should be "decír".

To Reproduce

import stanza
nlp = stanza.Pipeline("es", processors="mwt,tokenize")
doc = nlp('Decírselo.')
print(", ".join(w.text for w in doc.sentences[0].tokens[0].words))

This yields the output

Decar, se, lo

Expected behavior
The expected output is

Decír, se, lo

Environment (please complete the following information):

  • Ubuntu 20.04
  • Python 3.10.14
  • Stanza version: 1.8.2

Additional context
None at the moment.

@AngledLuffa
Copy link
Collaborator

Thanks, that's a useful observation. We can add it to the training data for MWT, and I'll have a new model ready probably Monday or so.

If you find others between now and then, please let us know and I'll add those as well.

@busdriverbuddha
Copy link
Author

busdriverbuddha commented Jun 15, 2024

I had similar issues with "decírmelo" (decír+me+lo) and "dárselo" (dar+se+lo, without the accent).

Perhaps it would be interesting to add to the training data a variety of verbs with similar construction

EDIT: Please disregard this comment as it is incorrect.

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Jun 15, 2024

I'm trying to figure out - why would the tokenized dar not have an accent, but decír does?

Generally speaking, the GSD treebank we base the Spanish models from doesn't have accents on any of the uses of decir. The unique factor here is there are no decir plus two clitics in the original training data, and generally words with the accent get tokenized so they still have the accent. For example,

# sent_id = es-train-003-s271
# text = Jacob, desempleado por una discusión que tuvo con Bretton James, y sabiendo que Winnie está esperando un hijo suyo, decide persuadir a Winnie de liberar el fideicomiso, para depositárselo a Gordon Gekko quien le ha prometido usarlos para consolidar una fortuna para Winnie y él.
33-35   depositárselo   _       _       _       _       _       _       _       _
33      depositár       depositar       VERB    _       VerbForm=Inf    28      advcl   _       _
34      se      él      PRON    _       Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      33      expl:pv _       _
35      lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     33      obj     _       _

so my thinking is that dárselo also splits with the accent on dar, even though it isn't the standard way of writing dar when it doesn't have clitics

@busdriverbuddha
Copy link
Author

busdriverbuddha commented Jun 15, 2024

@AngledLuffa You're correct, I'm sorry. Decírselo would indeed be decomposed as decir+se+lo, without the accent, as you mentioned.

The only actual error made by Stanza, then, is the first one I pointed out: decírselo as decar-se-lo which should be decir-se-lo (without the accent, as you pointed out).

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Jun 15, 2024 via email

@busdriverbuddha
Copy link
Author

busdriverbuddha commented Jun 16, 2024

Well, that's certainly a valid line of investigation, and I wish I could contribute further, but unfortunately I don't actually speak Spanish. I use the Stanza constituency tree parser as part of a larger application which requires a constituency tree where the actual tokens are leaves, not the words, so the application reconstructs the tree, but merging the MWT into single leaves, which is how I caught the initial discrepancy in the first place.

@AngledLuffa
Copy link
Collaborator

Hey, ran into an issue or two with the Spanish GSD dataset. Once we get that cleaned up I'll retry the models tomorrow. Guess it's time to put on my UD annotator hat...

@AngledLuffa
Copy link
Collaborator

Gah, I apologize for how long this is taking. So in the one treebank I was looking at, GSD, the tokens keep the accents after splitting. The same is true in PUD. In AnCora, though, which we were treating as the default, your observation that the accents disappear is correct.

Do you have a preference? I don't care too much either way.

UniversalDependencies/UD_Spanish-AnCora#9

@busdriverbuddha
Copy link
Author

busdriverbuddha commented Jun 19, 2024 via email

@AngledLuffa
Copy link
Collaborator

Got a bunch of data improvements from the UD team. I added the examples above and recreated all of the Spanish models as combined models with both AnCora and GSD. There's a bit of a performance hit when POS tagging the AnCora dev set, which we can investigate some. Otherwise, it seems to be working. I pushed those models as the new defaults for Spanish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants