Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a constituency model for Icelandic #1389

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

ingunnjk
Copy link

BEFORE YOU START: please make sure your pull request is against the dev branch.
We cannot accept pull requests against the main branch.
See our contributing guide for details.

Description

I want to add an Icelandic constituency model for Icelandic. I have made changes to prepare_con_dataset.py and added the script convert_icepahc.py in stanza/utils/datasets/constituency. I have also added an Icelandic BERT model to stanza/resources/default_packages.py.

Fixes Issues

A list of issues/bugs with # references. (e.g., #123)

Unit test coverage

Are there unit tests in place to make sure your code is functioning correctly?
(see here for a simple example)

Known breaking changes/behaviors

None.

@AngledLuffa
Copy link
Collaborator

Hey, thanks for sending this. Very helpful.

Are you familiar with squashing? If so, would you be up for squashing commit #1 with #2 and #3 with #4? If not, I can do it on my end.

In your change comments, you talk about how there is a version which is simplified already. Is it necessary to use that for the parser? If yes, did you use a script to do those simplifications?

…tuency parser

Create convert_icepahc.py

Update default_packages.py

Added a BERT model for Icelandic

Update default_packages.py
@ingunnjk
Copy link
Author

Hi!

I have now squashed the commits!

I tried using the original IcePaHC file for the parser but that didn't seem to work. I also removed some things that shouldn't be a part of the parsing like empty nodes (traces, comments etc.) and some of it was by code and some by hand so there is unfortunately not a script for that, I'm sorry.

Also one other thing, you added * to the list of functional tags to cut off since * is used as a separator in the treebank file that is used for the training. I would actually prefer if it is not cut off! I removed the ending of most constituents, like ADJPOC, but I would like to keep some of them that I think would be useful, for example IPMAT. I trained a model where * was cut off and then also when it is not cut off and the difference in scoring is only like 0,25% (from 90,63% to 90,38%). If it would then be possible to change the * to - in the output from the parser, that would be greatly appreciated (or change * to - beforehand and not cutting - off for the Icelandic parser?).

@AngledLuffa
Copy link
Collaborator

I also removed some things that shouldn't be a part of the parsing like empty nodes (traces, comments etc.) and some of it was by code and some by hand so there is unfortunately not a script for that, I'm sorry.

The tree reading code automatically removes empty nodes. Perhaps it would be possible to recreate the conversion by hand. I'm not too opposed to using editing files, but it does look like they update this corpus fairly often:

https://github.com/antonkarl/icecorpus

which suggests that being able to automatically convert their work to the internal Stanza format would be useful.

Also one other thing, you added * to the list of functional tags to cut off since * is used as a separator in the treebank file that is used for the training

I see how that could be useful, in that English also has some tags such as NP-TMP which could be kept instead of discarded. I'll have to think about how to do that. Probably pass in the dataset as a parameter to read_treebank

@ingunnjk
Copy link
Author

Okay! So would I have to do (or rather update) the conversion script in order to make the Icelandic constituency model a part of Stanza or could that come as an update later on?

@AngledLuffa
Copy link
Collaborator

In my ideal world, it would show up all at once :) If that doesn't work with you, we can also start from your converted version of the treebank. If you can send us the partially finished script, I can also help as needed, since you're already donating a bunch of work to us.

@AngledLuffa
Copy link
Collaborator

Any room for me to help with the conversion script?

@ingunnjk
Copy link
Author

Sorry for the late answer, I haven't been able to work on this for the past weeks. In the conversion script (since the tree reading code automatically removes empty nodes) it is mostly a matter of removing lemmas (so (NS-D löndum-land) becomes (NS-D löndum)) and combining word parts that start/end with $ (in most cases a noun and its determiner), for example should (NS-N tungur$-tunga) (D-N $nar-hinn) become (NS-N tungurnar) and (RP upp$-upp) (VBPI $heldur-halda) should become (VBPI uppheldur). I don't know exactly when I'll be able to continue working on this (hopefully next week) but if you have the time you could for sure help move it along.

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented May 17, 2024 via email

@AngledLuffa
Copy link
Collaborator

I checked in a script which uses Tsurgeon to convert a single tree:

6b468b9

If you have a list of cases where $ combines two words into one word, I can expand that to cover the rest of the treebank. Obviously it's easy enough to combine neighboring $ words, but I'm not sure which ones take precedence in cases other than the two you've mentioned so far.

I put a brief introduction using this example on the github.io

https://stanfordnlp.github.io/stanza/tsurgeon.html

It doesn't really say anything other than the example I just checked in, though.

PS sorry for reusing your script's name and therefore making it slightly more annoying to do this PR

@AngledLuffa
Copy link
Collaborator

Any thoughts on the tregex / tsurgeon solution for this issue?

@AngledLuffa
Copy link
Collaborator

@ingunnjk any thoughts on how best to proceed with the conversion of the Icelandic dataset? If you have a set of heuristics on how to combine the $ nodes, I can turn that into a sequence of Tsurgeon expressions which hopefully handles the conversion from the raw dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants