Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization standard difference with GSD #9

Closed
AngledLuffa opened this issue Jun 19, 2024 · 15 comments
Closed

Tokenization standard difference with GSD #9

AngledLuffa opened this issue Jun 19, 2024 · 15 comments

Comments

@AngledLuffa
Copy link

I found a tokenization difference between the Spanish datasets which makes them somewhat incompatible. If the clitics cause a word to acquire an accent, in GSD the pieces keep the accent, whereas in AnCora the pieces do not. In PUD the accents also remain. Would be great to unify them, perhaps by changing AnCora to keep the accents:

GSD

# sent_id = es-train-003-s271
# text = Jacob, desempleado por una discusión que tuvo con Bretton James, y sabiendo que Winnie está esperando un hijo suyo, decide persuadir a Winnie de liberar el fideicomiso, para depositárselo a Gordon Gekko quien le ha prometido usarlos para consolidar una fortuna para Winnie y él.
33-35   depositárselo   _       _       _       _       _       _       _       _
33      depositár       depositar       VERB    _       VerbForm=Inf    28      advcl   _       _
34      se      él      PRON    _       Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      33      expl:pv _       _
35      lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     33      obj     _       _

AnCora

# sent_id = 3LB-CAST-a12-3-s7
# text = Luego se deciden a afeitárselo y entonces se dan cuenta de cuál es su verdadera carencia la carencia de bigote.
# orig_file_sentence 007#17
5-7     afeitárselo     _       _       _       _       _       _       _       _
5       afeitar afeitar VERB    vmn0000 VerbForm=Inf    3       xcomp   3:xcomp ArgTem=arg1:tem
6       se      él      PRON    _       Case=Dat|Person=3|PrepCase=Npr|PronType=Prs     5       obl:arg 5:obl:arg       _
7       lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     5       obj     5:obj   _

PUD

# newdoc id = n01040
# sent_id = n01040028
# text = Estudiantes como Rai han estado reuniéndose con consejeros en el colegio para hablar de lo que pasó, pero esta dice que el mayor consuelo lo obtiene viendo a sus amigos.
# text_en = Students like Rai have been meeting with counsellors at the school to talk about what happened, but she said the biggest comfort has come from seeing her friends.
6-7     reuniéndose     _       _       _       _       _       _       _       _
6       reuniéndo       reunir  VERB    VBG     VerbForm=Ger    0       root    _       _
7       se      él      PRON    SE      Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      6       compound:prt    _       _
@AngledLuffa
Copy link
Author

... although I mention adding the accents for AnCora, removing them from GSD and PUD would be just as satisfactory

@dan-zeman
Copy link
Member

The rule in Spanish (and generally in the original UD guidelines, until English went its own way :-)) is that the pieces have the FORM that they would have if they occurred as orthographic words. Therefore, the preferable solution is to remove the accents from GSD and PUD.

@AngledLuffa
Copy link
Author

Therefore, the preferable solution is to remove the accents from GSD and PUD.

Is that a viable solution, or would that be a weird modification to someone else's treebank?

@amir-zeldes
Copy link

The rule in Spanish (and generally in the original UD guidelines, until English went its own way :-)) is that the pieces have the FORM that they would have if they occurred as orthographic words.

I don't think English was the first (or the last) to prefer MWTs with tokens that sum up to the whole. For example, UD_Arabic-PADT is older and keeps clitic pronouns in their orthographic form inside MWTs (ه does not become هُوَ). We copied that in UD_Hebrew-IAHLTwiki, and UD_Coptic is the same.

I think if there is a way to make MWT sub-tokenization concatenative, it is by far preferable, since it makes tokenization much easier in that language. I don't really see the value in removing the accents given that we also have lemmatization. It just means you now have to use a seq2seq tokenizer and have more chances of errors. Of course in some cases it's inevitable, since you don't have a viable segmentation (e.g. Portuguese "a" can contain two tokens, so that's no solvable using concatenative MWTs), but if it's fairly easy to do I would absolutely prefer that in any language I work with.

Disclaimer: I don't really work much with UD Spanish and there may be really important existing tools or other corpora that favor stripping the accents, in which case that needs to be considered of course.

@AngledLuffa
Copy link
Author

I think if there is a way to make MWT sub-tokenization concatenative, it is by far preferable, since it makes tokenization much easier in that language.

Agreed, which is why I posted this issue here instead of GSD & PUD, since AnCora is the one which would change under that scheme. Although for Spanish, tokens such as del would be hard to make concatenative.

(Even in English we have some examples that are a bit of a stretch, such as gonna.)

@dan-zeman
Copy link
Member

dan-zeman commented Jun 19, 2024

I don't think English was the first (or the last) to prefer MWTs with tokens that sum up to the whole. For example, UD_Arabic-PADT is older and keeps clitic pronouns in their orthographic form inside MWTs (ه does not become هُوَ).

True. In the case of PADT, it is legacy tokenization from the pre-UD version of PADT. Whenever an orthographic word was split into multiple nodes in the original PADT, it was reflected as a multiword token in UD, without trying to revisit the rules and possibly adjust them in the UD spirit. Partly for time/capacity reasons, partly simply because the person doing the conversion (me) did not possess the necessary knowledge to even spot the issue. (Besides, I am not sure that هُوَ would be my preferred solution when I think of it now. I am still no expert on Arabic but it seems to me that this form is nominative while the required form would be accusative. It is possible that the expected form (paradigm slot) never occurs as a free form in the language; in such cases, taking a substring of the surface token is probably the only option.)

@dan-zeman
Copy link
Member

Note for myself: COSER sides with AnCora:

# sent_id = astu-480
# text = Lo que no pasó este año por aquí, preguntándome por la capilla yo creo que fueron pa la playa todos pero tenía que enseñales desde aquí por donde tenían que ir pa ya pero creo que taba así to l día.
10-11	preguntándome	_	_	_	_	_	_	_	_
10	preguntando	preguntar	VERB	_	VerbForm=Ger	4	advcl	_	_
11	me	yo	PRON	pc1cs000	Case=Dat|Number=Sing|Person=1|PrepCase=Npr|PronType=Prs	10	expl:pv	_	_
12	por	por	ADP	sps00	_	14	case	_	_
13	la	el	DET	da0fs0	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	14	det	_	_
14	capilla	capilla	NOUN	ncfs000	Gender=Fem|Number=Sing	10	obl	_	_

@AngledLuffa
Copy link
Author

Note for myself: COSER sides with AnCora

Fair point. I had only checked the three I mentioned.

Would it be valid to rewrite the forms to match one or the other standard? I do agree with @amir-zeldes that keeping accents is preferable (bearing in mind my opinion is an engineering opinion, not a linguistic opinion) but my only really strong desire is to see them be unified somehow.

dan-zeman added a commit to UniversalDependencies/UD_Spanish-PUD that referenced this issue Jun 19, 2024
@amir-zeldes
Copy link

I am not sure that هُوَ would be my preferred solution when I think of it now. I am still no expert on Arabic but it seems to me that this form is nominative while the required form would be accusative

This is a tricky position, because some environments are "MWT only", like you say. But arguably this is true even of the paradigm examples such as Romance article fusion: if we require a masculine article such that it is governed by "à" then it can only be "u" (which historically is true, it's just L-vocalization).

So the question is one of granularity - how specific we take the required form's environment to be. If we don't want specific prepositions to be an environment and we say it's "accusative" or deprel=obj, we could also think that the canonical Standard Arabic independent accusative pronoun is إياه and start putting that into every MWT with a clitic object. To be clear, I don't think this is the right thing to do at all. Ultimately, this kind of analysis seems to import much more assumptions and complexity into the treebank, whereas leaving the ه as is is both easy and better for engineering reasons.

@dan-zeman
Copy link
Member

Fixed UD_Spanish-PUD in UniversalDependencies/UD_Spanish-PUD@36178ff. It turns out it was already mostly in line with AnCora. Using

[áéí](r|ndo)(me|te|se|l[eoa]s?|nos|os){1,2}\t

I found 17 gerunds with clitics. Out of them, 16 were already good and only reuniéndo had to be fixed.

@AngledLuffa
Copy link
Author

That's kind of funny, that I was manually searching and the first one I came across the one which was different from all the others.

So it sounds like changing GSD would be the more accepted solution? Is that something we can do? Is that something I would need to do?

@dan-zeman
Copy link
Member

I am looking into it. There are 235 instances in GSD. Many of them are already in line with AnCora, too, some are not. But 15 of them have not been even segmented.

@AngledLuffa
Copy link
Author

AngledLuffa commented Jun 19, 2024 via email

dan-zeman added a commit to UniversalDependencies/UD_Spanish-GSD that referenced this issue Jun 19, 2024
dan-zeman added a commit to UniversalDependencies/UD_Spanish-GSD that referenced this issue Jun 19, 2024
dan-zeman added a commit to UniversalDependencies/UD_Spanish-GSD that referenced this issue Jun 19, 2024
dan-zeman added a commit to UniversalDependencies/UD_Spanish-GSD that referenced this issue Jun 20, 2024
@dan-zeman
Copy link
Member

Done. In the end the "many are already in line" claim held for dev and test data, while almost all instances in train had to be fixed.

@AngledLuffa
Copy link
Author

Thank you, this will greatly improve the interoperability of the treebanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants