Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_ea_ and _eo_ CONJ+[DET,ADP] wrong contractions #5

Closed
ceramisch opened this issue Mar 24, 2018 · 16 comments
Closed

_ea_ and _eo_ CONJ+[DET,ADP] wrong contractions #5

ceramisch opened this issue Mar 24, 2018 · 16 comments
Assignees
Labels

Comments

@ceramisch
Copy link
Contributor

Strangely, the conjunction e (and) appears contracted with the next token o or a

This would be correct in Arabic but it is never done in Portuguese, e does not contract with any other word.

Correcting the problem for o can be automatic, but a requires manual intervention because it is ambiguous between a determiner (a, feminine definite article the -- most cases) and the preposition a (to - rarer).

@ceramisch
Copy link
Contributor Author

Complement: 887 cases in the whole corpus

@dan-zeman
Copy link
Member

This seems to be related to / duplicate of #1 and UniversalDependencies/docs#294 by @pedrobalage. (But that pull request has not been merged, so yes, we still need a fix.)

@dan-zeman dan-zeman added the bug label Mar 29, 2018
@pedrobalage
Copy link

Sorry guys, I didn't have time to follow up this issue. Maybe @ceramisch can help.

@ceramisch
Copy link
Contributor Author

ceramisch commented Mar 30, 2018

I think that the fix cannot be done fully automatically (but I can try to take care of it after release 2.2):

  • we need to manually re-annotate the UPOS of o and a as PRON or DET
  • then, we need to assign its dependencies:
    • we can automatically choose DEPREL=det of the next noun to the right, if UPOS=DET,
    • we can (probably) automatically choose DEPREL=obj if UPOS=PRON, but the target HEAD should be set manually

Then comes my question: any recommendation of an annotation tool capable of dealing with UD2.0 directly, to make this task easier?

@martinpopel
Copy link
Member

any recommendation of an annotation tool capable of dealing with UD2.0 directly, to make this task easier?

For manual annotation, see http://universaldependencies.org/tools.html#third-party-tools (but not all the tools listed there support full UD 2.0 including enhanced deps).
For automatic edits, there is e.g. Udapi (I am the main author), which was used e.g. for detection of Portuguese multi-word tokens in the PT-PUD treebank.

@amir-zeldes
Copy link

For automatic edits using simple declarative rules (no coding needed) I can also recommend DepEdit, which we use to convert UD_English-GUM:

https://corpling.uis.georgetown.edu/depedit/

@arademaker
Copy link
Collaborator

@ceramisch we are using our library cl-conllu and the Emacs mode we developed, both listed in the UD tools.

I just checked that this corpus has many validation issues. Are you planning to solve these issues for this release 2.2 ? I can’t promise to help in the next 2 days.

@dan-zeman
Copy link
Member

I think it will be better to leave this corpus out of the shared task. The participants will have enough on their plate even without it, and its inconsistency with Bosque worries me. However, it can still be in the full release 2.2 after the shared task. Then the deadline is June 15.

@ceramisch
Copy link
Contributor Author

I just checked that this corpus has many validation issues. Are you planning to solve these issues for this release 2.2 ? I can’t promise to help in the next 2 days.

No (as you probably realized). I would be happy to help solving this annoying "eo" issue but I can only work on this after the rush of the PARSEME shared task is over (i.e. mid-May)

@arademaker
Copy link
Collaborator

arademaker commented May 1, 2020

The issue was done by @ceramisch , I only fixed the commits to avoid unnecessary copies of files in the repository.

The python code from @ceramisch marks all tokens that need manual revision with XXXXX and log those tokens in a log file. I moved all logs for a single file scripts/issue-5-py.log. My Lisp code scripts/issue-5.lisp remove the XXXXX from the MISC fields counting them on each file. Unfortunately, the numbers differ.

arademaker added a commit that referenced this issue May 1, 2020
@arademaker arademaker mentioned this issue May 3, 2020
@ceramisch
Copy link
Contributor Author

The numbers probably differ because I had manually removed the XXXXX from the cases I had already checked manually. I will try to finish the manual verification for the next release so that this issue can be definitely closed. Thanks for sorting things out with the duplications, @arademaker and sorry for the mess: I think now I understand better the idea of branches for UD treebank development :-)

@arademaker
Copy link
Collaborator

I will create a workbench branch in the next days and I will split the files into files with only 10 sentences. This will help to edit the files. So please contact me before any change in the files. Anyway, we are now in the freezing period of UD 2.6 release, no changes in dev and master are allowed.

@ceramisch
Copy link
Contributor Author

OK, no worries.

@arademaker
Copy link
Collaborator

@ceramisch branch workbench created and folder documents/ created.

@arademaker
Copy link
Collaborator

see also #9

@arademaker
Copy link
Collaborator

this issue was solved, I didn't find cases of ea or eo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants