_ea_ and _eo_ CONJ+[DET,ADP] wrong contractions #5

ceramisch · 2018-03-24T20:05:40Z

Strangely, the conjunction e (and) appears contracted with the next token o or a

This would be correct in Arabic but it is never done in Portuguese, e does not contract with any other word.

Correcting the problem for o can be automatic, but a requires manual intervention because it is ambiguous between a determiner (a, feminine definite article the -- most cases) and the preposition a (to - rarer).

ceramisch · 2018-03-24T20:06:45Z

Complement: 887 cases in the whole corpus

dan-zeman · 2018-03-29T07:49:05Z

This seems to be related to / duplicate of #1 and UniversalDependencies/docs#294 by @pedrobalage. (But that pull request has not been merged, so yes, we still need a fix.)

pedrobalage · 2018-03-29T07:53:37Z

Sorry guys, I didn't have time to follow up this issue. Maybe @ceramisch can help.

ceramisch · 2018-03-30T09:53:02Z

I think that the fix cannot be done fully automatically (but I can try to take care of it after release 2.2):

we need to manually re-annotate the UPOS of o and a as PRON or DET
then, we need to assign its dependencies:
- we can automatically choose DEPREL=det of the next noun to the right, if UPOS=DET,
- we can (probably) automatically choose DEPREL=obj if UPOS=PRON, but the target HEAD should be set manually

Then comes my question: any recommendation of an annotation tool capable of dealing with UD2.0 directly, to make this task easier?

martinpopel · 2018-03-30T15:05:22Z

any recommendation of an annotation tool capable of dealing with UD2.0 directly, to make this task easier?

For manual annotation, see http://universaldependencies.org/tools.html#third-party-tools (but not all the tools listed there support full UD 2.0 including enhanced deps).
For automatic edits, there is e.g. Udapi (I am the main author), which was used e.g. for detection of Portuguese multi-word tokens in the PT-PUD treebank.

amir-zeldes · 2018-03-31T08:27:30Z

For automatic edits using simple declarative rules (no coding needed) I can also recommend DepEdit, which we use to convert UD_English-GUM:

https://corpling.uis.georgetown.edu/depedit/

arademaker · 2018-03-31T12:50:34Z

@ceramisch we are using our library cl-conllu and the Emacs mode we developed, both listed in the UD tools.

I just checked that this corpus has many validation issues. Are you planning to solve these issues for this release 2.2 ? I can’t promise to help in the next 2 days.

dan-zeman · 2018-03-31T16:30:11Z

I think it will be better to leave this corpus out of the shared task. The participants will have enough on their plate even without it, and its inconsistency with Bosque worries me. However, it can still be in the full release 2.2 after the shared task. Then the deadline is June 15.

ceramisch · 2018-04-03T20:21:21Z

I just checked that this corpus has many validation issues. Are you planning to solve these issues for this release 2.2 ? I can’t promise to help in the next 2 days.

No (as you probably realized). I would be happy to help solving this annoying "eo" issue but I can only work on this after the rush of the PARSEME shared task is over (i.e. mid-May)

arademaker · 2020-05-01T19:53:37Z

The issue was done by @ceramisch , I only fixed the commits to avoid unnecessary copies of files in the repository.

The python code from @ceramisch marks all tokens that need manual revision with XXXXX and log those tokens in a log file. I moved all logs for a single file scripts/issue-5-py.log. My Lisp code scripts/issue-5.lisp remove the XXXXX from the MISC fields counting them on each file. Unfortunately, the numbers differ.

ceramisch · 2020-05-05T15:55:24Z

The numbers probably differ because I had manually removed the XXXXX from the cases I had already checked manually. I will try to finish the manual verification for the next release so that this issue can be definitely closed. Thanks for sorting things out with the duplications, @arademaker and sorry for the mess: I think now I understand better the idea of branches for UD treebank development :-)

arademaker · 2020-05-05T15:57:54Z

I will create a workbench branch in the next days and I will split the files into files with only 10 sentences. This will help to edit the files. So please contact me before any change in the files. Anyway, we are now in the freezing period of UD 2.6 release, no changes in dev and master are allowed.

ceramisch · 2020-05-05T16:01:06Z

OK, no worries.

arademaker · 2020-05-28T05:10:25Z

@ceramisch branch workbench created and folder documents/ created.

arademaker · 2021-04-20T00:09:07Z

see also #9

arademaker · 2021-04-20T00:15:57Z

this issue was solved, I didn't find cases of ea or eo

dan-zeman added the bug label Mar 29, 2018

arademaker mentioned this issue Mar 31, 2018

validation error: L3 Syntax leaf-mark-case #7

Closed

dan-zeman mentioned this issue Apr 23, 2018

Tokenization issue in UD_Portuguese-BR UniversalDependencies/docs#294

Closed

arademaker mentioned this issue May 28, 2018

rules for batch changes LR-POR/cl-conllu#50

Open

arademaker assigned ceramisch Apr 27, 2020

arademaker added this to the UD release 2.6 milestone Apr 27, 2020

arademaker added a commit that referenced this issue May 1, 2020

solves issue #5

45e770a

arademaker added a commit that referenced this issue May 1, 2020

related to #5, removed sentences that I have checked

d8d8e4c

arademaker added a commit that referenced this issue May 1, 2020

related to #5

2938fd3

arademaker mentioned this issue May 3, 2020

release 2.6 #16

Closed

arademaker closed this as completed Apr 20, 2021

AngledLuffa mentioned this issue Apr 4, 2023

Tokenization of clitics #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_ea_ and _eo_ CONJ+[DET,ADP] wrong contractions #5

_ea_ and _eo_ CONJ+[DET,ADP] wrong contractions #5

ceramisch commented Mar 24, 2018

ceramisch commented Mar 24, 2018

dan-zeman commented Mar 29, 2018

pedrobalage commented Mar 29, 2018

ceramisch commented Mar 30, 2018 •

edited

Loading

martinpopel commented Mar 30, 2018

amir-zeldes commented Mar 31, 2018

arademaker commented Mar 31, 2018

dan-zeman commented Mar 31, 2018

ceramisch commented Apr 3, 2018

arademaker commented May 1, 2020 •

edited

Loading

ceramisch commented May 5, 2020

arademaker commented May 5, 2020

ceramisch commented May 5, 2020

arademaker commented May 28, 2020

arademaker commented Apr 20, 2021

arademaker commented Apr 20, 2021

_ea_ and _eo_ CONJ+[DET,ADP] wrong contractions #5

_ea_ and _eo_ CONJ+[DET,ADP] wrong contractions #5

Comments

ceramisch commented Mar 24, 2018

ceramisch commented Mar 24, 2018

dan-zeman commented Mar 29, 2018

pedrobalage commented Mar 29, 2018

ceramisch commented Mar 30, 2018 • edited Loading

martinpopel commented Mar 30, 2018

amir-zeldes commented Mar 31, 2018

arademaker commented Mar 31, 2018

dan-zeman commented Mar 31, 2018

ceramisch commented Apr 3, 2018

arademaker commented May 1, 2020 • edited Loading

ceramisch commented May 5, 2020

arademaker commented May 5, 2020

ceramisch commented May 5, 2020

arademaker commented May 28, 2020

arademaker commented Apr 20, 2021

arademaker commented Apr 20, 2021

ceramisch commented Mar 30, 2018 •

edited

Loading

arademaker commented May 1, 2020 •

edited

Loading