Incorrect Spanish verb decomposition #1395

busdriverbuddha · 2024-06-13T15:14:45Z

Describe the bug
The token "decírselo" is incorrectly decomposed into the words "decar", "se", "lo". "Decar" is not a word in Spanish. It should be "decír".

To Reproduce

import stanza
nlp = stanza.Pipeline("es", processors="mwt,tokenize")
doc = nlp('Decírselo.')
print(", ".join(w.text for w in doc.sentences[0].tokens[0].words))

This yields the output

Decar, se, lo

Expected behavior
The expected output is

Decír, se, lo

Environment (please complete the following information):

Ubuntu 20.04
Python 3.10.14
Stanza version: 1.8.2

Additional context
None at the moment.

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-06-14T08:14:05Z

Thanks, that's a useful observation. We can add it to the training data for MWT, and I'll have a new model ready probably Monday or so.

If you find others between now and then, please let us know and I'll add those as well.

busdriverbuddha · 2024-06-15T01:38:49Z

~~I had similar issues with "decírmelo" (decír+me+lo) and "dárselo" (dar+se+lo, without the accent).~~

~~Perhaps it would be interesting to add to the training data a variety of verbs with similar construction~~

EDIT: Please disregard this comment as it is incorrect.

AngledLuffa · 2024-06-15T08:35:39Z

I'm trying to figure out - why would the tokenized dar not have an accent, but decír does?

Generally speaking, the GSD treebank we base the Spanish models from doesn't have accents on any of the uses of decir. The unique factor here is there are no decir plus two clitics in the original training data, and generally words with the accent get tokenized so they still have the accent. For example,

# sent_id = es-train-003-s271
# text = Jacob, desempleado por una discusión que tuvo con Bretton James, y sabiendo que Winnie está esperando un hijo suyo, decide persuadir a Winnie de liberar el fideicomiso, para depositárselo a Gordon Gekko quien le ha prometido usarlos para consolidar una fortuna para Winnie y él.
33-35   depositárselo   _       _       _       _       _       _       _       _
33      depositár       depositar       VERB    _       VerbForm=Inf    28      advcl   _       _
34      se      él      PRON    _       Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      33      expl:pv _       _
35      lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     33      obj     _       _

so my thinking is that dárselo also splits with the accent on dar, even though it isn't the standard way of writing dar when it doesn't have clitics

busdriverbuddha · 2024-06-15T13:00:20Z

@AngledLuffa You're correct, I'm sorry. Decírselo would indeed be decomposed as decir+se+lo, without the accent, as you mentioned.

The only actual error made by Stanza, then, is the first one I pointed out: decírselo as decar-se-lo which should be decir-se-lo (without the accent, as you pointed out).

AngledLuffa · 2024-06-15T22:21:11Z

Actually it seems the standard in GSD is to keep the text intact but remove the accents in the lemma. Some part of me wonders if that means we can deterministic split all the words aside from a few known exceptions. We found that for English, there are no exceptions at all, so we split all words into the raw text. Could do the same thing with exceptions for Spanish

…

On Sat, Jun 15, 2024, 6:00 AM Guilherme Gama ***@***.***> wrote: @AngledLuffa <https://github.com/AngledLuffa> You're correct, I'm sorry. Decírselo would indeed be decomposed as decir+se+lo, without the accent, as you mentioned. The only actual error made by Stanza, then, is the first one I pointed out: decírselo as decir-se-lo (without the accent, as you pointed out). — Reply to this email directly, view it on GitHub <#1395 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWIJWZJULARJBPDPH33ZHQ3HTAVCNFSM6AAAAABJIT3F2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRZGU2DENJUGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

busdriverbuddha · 2024-06-16T01:27:28Z

Well, that's certainly a valid line of investigation, and I wish I could contribute further, but unfortunately I don't actually speak Spanish. I use the Stanza constituency tree parser as part of a larger application which requires a constituency tree where the actual tokens are leaves, not the words, so the application reconstructs the tree, but merging the MWT into single leaves, which is how I caught the initial discrepancy in the first place.

AngledLuffa · 2024-06-18T07:18:48Z

Hey, ran into an issue or two with the Spanish GSD dataset. Once we get that cleaned up I'll retry the models tomorrow. Guess it's time to put on my UD annotator hat...

AngledLuffa · 2024-06-19T06:50:57Z

Gah, I apologize for how long this is taking. So in the one treebank I was looking at, GSD, the tokens keep the accents after splitting. The same is true in PUD. In AnCora, though, which we were treating as the default, your observation that the accents disappear is correct.

Do you have a preference? I don't care too much either way.

UniversalDependencies/UD_Spanish-AnCora#9

busdriverbuddha · 2024-06-19T12:27:05Z

Hi! I have no preference either, and there's also no urgency on my end - I've written a patch here to just ignore the spellings of the words and use the full token when merging, so the problem is solved as far as I'm concerned.

…

On Wed, Jun 19, 2024, 3:51 AM John Bauer ***@***.***> wrote: Gah, I apologize for how long this is taking. So in the one treebank I was looking at, GSD, the tokens keep the accents after splitting. The same is true in PUD. In AnCora, though, which we were treating as the default, your observation that the accents disappear is correct. Do you have a preference? I don't care too much either way. UniversalDependencies/UD_Spanish-AnCora#9 <UniversalDependencies/UD_Spanish-AnCora#9> — Reply to this email directly, view it on GitHub <#1395 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBPVXS6XHK7VNUEWNXLIATZIES6NAVCNFSM6AAAAABJIT3F2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZXHA4DEOJUGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

AngledLuffa · 2024-06-24T03:36:56Z

Got a bunch of data improvements from the UD team. I added the examples above and recreated all of the Spanish models as combined models with both AnCora and GSD. There's a bit of a performance hit when POS tagging the AnCora dev set, which we can investigate some. Otherwise, it seems to be working. I pushed those models as the new defaults for Spanish

busdriverbuddha added the bug label Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Spanish verb decomposition #1395

Incorrect Spanish verb decomposition #1395

busdriverbuddha commented Jun 13, 2024

AngledLuffa commented Jun 14, 2024

busdriverbuddha commented Jun 15, 2024 •

edited

Loading

AngledLuffa commented Jun 15, 2024 •

edited

Loading

busdriverbuddha commented Jun 15, 2024 •

edited

Loading

AngledLuffa commented Jun 15, 2024 via email

busdriverbuddha commented Jun 16, 2024 •

edited

Loading

AngledLuffa commented Jun 18, 2024

AngledLuffa commented Jun 19, 2024

busdriverbuddha commented Jun 19, 2024 via email

AngledLuffa commented Jun 24, 2024

Incorrect Spanish verb decomposition #1395

Incorrect Spanish verb decomposition #1395

Comments

busdriverbuddha commented Jun 13, 2024

AngledLuffa commented Jun 14, 2024

busdriverbuddha commented Jun 15, 2024 • edited Loading

AngledLuffa commented Jun 15, 2024 • edited Loading

busdriverbuddha commented Jun 15, 2024 • edited Loading

AngledLuffa commented Jun 15, 2024 via email

busdriverbuddha commented Jun 16, 2024 • edited Loading

AngledLuffa commented Jun 18, 2024

AngledLuffa commented Jun 19, 2024

busdriverbuddha commented Jun 19, 2024 via email

AngledLuffa commented Jun 24, 2024

busdriverbuddha commented Jun 15, 2024 •

edited

Loading

AngledLuffa commented Jun 15, 2024 •

edited

Loading

busdriverbuddha commented Jun 15, 2024 •

edited

Loading

busdriverbuddha commented Jun 16, 2024 •

edited

Loading