Tokenization standard difference with GSD #9

AngledLuffa · 2024-06-19T06:48:19Z

I found a tokenization difference between the Spanish datasets which makes them somewhat incompatible. If the clitics cause a word to acquire an accent, in GSD the pieces keep the accent, whereas in AnCora the pieces do not. In PUD the accents also remain. Would be great to unify them, perhaps by changing AnCora to keep the accents:

GSD

# sent_id = es-train-003-s271
# text = Jacob, desempleado por una discusión que tuvo con Bretton James, y sabiendo que Winnie está esperando un hijo suyo, decide persuadir a Winnie de liberar el fideicomiso, para depositárselo a Gordon Gekko quien le ha prometido usarlos para consolidar una fortuna para Winnie y él.
33-35   depositárselo   _       _       _       _       _       _       _       _
33      depositár       depositar       VERB    _       VerbForm=Inf    28      advcl   _       _
34      se      él      PRON    _       Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      33      expl:pv _       _
35      lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     33      obj     _       _

AnCora

# sent_id = 3LB-CAST-a12-3-s7
# text = Luego se deciden a afeitárselo y entonces se dan cuenta de cuál es su verdadera carencia la carencia de bigote.
# orig_file_sentence 007#17
5-7     afeitárselo     _       _       _       _       _       _       _       _
5       afeitar afeitar VERB    vmn0000 VerbForm=Inf    3       xcomp   3:xcomp ArgTem=arg1:tem
6       se      él      PRON    _       Case=Dat|Person=3|PrepCase=Npr|PronType=Prs     5       obl:arg 5:obl:arg       _
7       lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     5       obj     5:obj   _

PUD

# newdoc id = n01040
# sent_id = n01040028
# text = Estudiantes como Rai han estado reuniéndose con consejeros en el colegio para hablar de lo que pasó, pero esta dice que el mayor consuelo lo obtiene viendo a sus amigos.
# text_en = Students like Rai have been meeting with counsellors at the school to talk about what happened, but she said the biggest comfort has come from seeing her friends.
6-7     reuniéndose     _       _       _       _       _       _       _       _
6       reuniéndo       reunir  VERB    VBG     VerbForm=Ger    0       root    _       _
7       se      él      PRON    SE      Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      6       compound:prt    _       _

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-06-19T06:50:45Z

... although I mention adding the accents for AnCora, removing them from GSD and PUD would be just as satisfactory

dan-zeman · 2024-06-19T08:28:17Z

The rule in Spanish (and generally in the original UD guidelines, until English went its own way :-)) is that the pieces have the FORM that they would have if they occurred as orthographic words. Therefore, the preferable solution is to remove the accents from GSD and PUD.

AngledLuffa · 2024-06-19T15:19:23Z

Therefore, the preferable solution is to remove the accents from GSD and PUD.

Is that a viable solution, or would that be a weird modification to someone else's treebank?

amir-zeldes · 2024-06-19T16:11:14Z

The rule in Spanish (and generally in the original UD guidelines, until English went its own way :-)) is that the pieces have the FORM that they would have if they occurred as orthographic words.

I don't think English was the first (or the last) to prefer MWTs with tokens that sum up to the whole. For example, UD_Arabic-PADT is older and keeps clitic pronouns in their orthographic form inside MWTs (ه does not become هُوَ). We copied that in UD_Hebrew-IAHLTwiki, and UD_Coptic is the same.

I think if there is a way to make MWT sub-tokenization concatenative, it is by far preferable, since it makes tokenization much easier in that language. I don't really see the value in removing the accents given that we also have lemmatization. It just means you now have to use a seq2seq tokenizer and have more chances of errors. Of course in some cases it's inevitable, since you don't have a viable segmentation (e.g. Portuguese "a" can contain two tokens, so that's no solvable using concatenative MWTs), but if it's fairly easy to do I would absolutely prefer that in any language I work with.

Disclaimer: I don't really work much with UD Spanish and there may be really important existing tools or other corpora that favor stripping the accents, in which case that needs to be considered of course.

AngledLuffa · 2024-06-19T16:25:52Z

I think if there is a way to make MWT sub-tokenization concatenative, it is by far preferable, since it makes tokenization much easier in that language.

Agreed, which is why I posted this issue here instead of GSD & PUD, since AnCora is the one which would change under that scheme. Although for Spanish, tokens such as del would be hard to make concatenative.

(Even in English we have some examples that are a bit of a stretch, such as gonna.)

dan-zeman · 2024-06-19T17:59:13Z

I don't think English was the first (or the last) to prefer MWTs with tokens that sum up to the whole. For example, UD_Arabic-PADT is older and keeps clitic pronouns in their orthographic form inside MWTs (ه does not become هُوَ).

True. In the case of PADT, it is legacy tokenization from the pre-UD version of PADT. Whenever an orthographic word was split into multiple nodes in the original PADT, it was reflected as a multiword token in UD, without trying to revisit the rules and possibly adjust them in the UD spirit. Partly for time/capacity reasons, partly simply because the person doing the conversion (me) did not possess the necessary knowledge to even spot the issue. (Besides, I am not sure that هُوَ would be my preferred solution when I think of it now. I am still no expert on Arabic but it seems to me that this form is nominative while the required form would be accusative. It is possible that the expected form (paradigm slot) never occurs as a free form in the language; in such cases, taking a substring of the surface token is probably the only option.)

dan-zeman · 2024-06-19T18:09:52Z

Note for myself: COSER sides with AnCora:

# sent_id = astu-480
# text = Lo que no pasó este año por aquí, preguntándome por la capilla yo creo que fueron pa la playa todos pero tenía que enseñales desde aquí por donde tenían que ir pa ya pero creo que taba así to l día.
10-11	preguntándome	_	_	_	_	_	_	_	_
10	preguntando	preguntar	VERB	_	VerbForm=Ger	4	advcl	_	_
11	me	yo	PRON	pc1cs000	Case=Dat|Number=Sing|Person=1|PrepCase=Npr|PronType=Prs	10	expl:pv	_	_
12	por	por	ADP	sps00	_	14	case	_	_
13	la	el	DET	da0fs0	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	14	det	_	_
14	capilla	capilla	NOUN	ncfs000	Gender=Fem|Number=Sing	10	obl	_	_

AngledLuffa · 2024-06-19T18:19:23Z

Note for myself: COSER sides with AnCora

Fair point. I had only checked the three I mentioned.

Would it be valid to rewrite the forms to match one or the other standard? I do agree with @amir-zeldes that keeping accents is preferable (bearing in mind my opinion is an engineering opinion, not a linguistic opinion) but my only really strong desire is to see them be unified somehow.

UniversalDependencies/UD_Spanish-AnCora#9

amir-zeldes · 2024-06-19T18:49:14Z

I am not sure that هُوَ would be my preferred solution when I think of it now. I am still no expert on Arabic but it seems to me that this form is nominative while the required form would be accusative

This is a tricky position, because some environments are "MWT only", like you say. But arguably this is true even of the paradigm examples such as Romance article fusion: if we require a masculine article such that it is governed by "à" then it can only be "u" (which historically is true, it's just L-vocalization).

So the question is one of granularity - how specific we take the required form's environment to be. If we don't want specific prepositions to be an environment and we say it's "accusative" or deprel=obj, we could also think that the canonical Standard Arabic independent accusative pronoun is إياه and start putting that into every MWT with a clitic object. To be clear, I don't think this is the right thing to do at all. Ultimately, this kind of analysis seems to import much more assumptions and complexity into the treebank, whereas leaving the ه as is is both easy and better for engineering reasons.

dan-zeman · 2024-06-19T18:49:21Z

Fixed UD_Spanish-PUD in UniversalDependencies/UD_Spanish-PUD@36178ff. It turns out it was already mostly in line with AnCora. Using

[áéí](r|ndo)(me|te|se|l[eoa]s?|nos|os){1,2}\t

I found 17 gerunds with clitics. Out of them, 16 were already good and only reuniéndo had to be fixed.

AngledLuffa · 2024-06-19T18:53:51Z

That's kind of funny, that I was manually searching and the first one I came across the one which was different from all the others.

So it sounds like changing GSD would be the more accepted solution? Is that something we can do? Is that something I would need to do?

dan-zeman · 2024-06-19T19:13:57Z

I am looking into it. There are 235 instances in GSD. Many of them are already in line with AnCora, too, some are not. But 15 of them have not been even segmented.

AngledLuffa · 2024-06-19T19:17:49Z

Thank you! LMK if you want me to take on any part of it

…

On Wed, Jun 19, 2024, 12:14 PM Dan Zeman ***@***.***> wrote: I am looking into it. There are 235 instances in GSD. Many of them are already in line with AnCora, too, some are not. But 15 of them have not been even segmented. — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWKH6LJFNAEFCVOFUMDZIHKAXAVCNFSM6AAAAABJRNBWM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGMZTQNBRG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

UniversalDependencies/UD_Spanish-AnCora#9

…ented). UniversalDependencies/UD_Spanish-AnCora#9

UniversalDependencies/UD_Spanish-AnCora#9

dan-zeman · 2024-06-20T14:55:03Z

Done. In the end the "many are already in line" claim held for dev and test data, while almost all instances in train had to be fixed.

AngledLuffa · 2024-06-20T23:04:20Z

Thank you, this will greatly improve the interoperability of the treebanks.

AngledLuffa mentioned this issue Jun 19, 2024

Incorrect Spanish verb decomposition stanfordnlp/stanza#1395

Closed

dan-zeman added a commit to UniversalDependencies/UD_Spanish-PUD that referenced this issue Jun 19, 2024

Fixed accent error in reconstructed surface form.

36178ff

UniversalDependencies/UD_Spanish-AnCora#9

dan-zeman added a commit to UniversalDependencies/UD_Spanish-GSD that referenced this issue Jun 19, 2024

Fixed gerunds with clitics in development data.

9a75494

UniversalDependencies/UD_Spanish-AnCora#9

dan-zeman added a commit to UniversalDependencies/UD_Spanish-GSD that referenced this issue Jun 19, 2024

Fixed gerunds with clitics in test data.

9a02a6f

UniversalDependencies/UD_Spanish-AnCora#9

dan-zeman added a commit to UniversalDependencies/UD_Spanish-GSD that referenced this issue Jun 19, 2024

Fixed gerunds with clitics in training data (only those properly segm…

7c744c5

…ented). UniversalDependencies/UD_Spanish-AnCora#9

dan-zeman added a commit to UniversalDependencies/UD_Spanish-GSD that referenced this issue Jun 20, 2024

Fixed segmentation of gerunds with clitics.

8f0bfb3

UniversalDependencies/UD_Spanish-AnCora#9

dan-zeman closed this as completed Jun 20, 2024

dan-zeman mentioned this issue Jun 20, 2024

VERB+PRON compounds UniversalDependencies/UD_Spanish-GSD#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization standard difference with GSD #9

Tokenization standard difference with GSD #9

AngledLuffa commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024

dan-zeman commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024

amir-zeldes commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024

dan-zeman commented Jun 19, 2024 •

edited

Loading

dan-zeman commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024

amir-zeldes commented Jun 19, 2024

dan-zeman commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024

dan-zeman commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024 via email

dan-zeman commented Jun 20, 2024

AngledLuffa commented Jun 20, 2024

Tokenization standard difference with GSD #9

Tokenization standard difference with GSD #9

Comments

AngledLuffa commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024

dan-zeman commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024

amir-zeldes commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024

dan-zeman commented Jun 19, 2024 • edited Loading

dan-zeman commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024

amir-zeldes commented Jun 19, 2024

dan-zeman commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024

dan-zeman commented Jun 19, 2024

AngledLuffa commented Jun 19, 2024 via email

dan-zeman commented Jun 20, 2024

AngledLuffa commented Jun 20, 2024

dan-zeman commented Jun 19, 2024 •

edited

Loading