Add a constituency model for Icelandic #1389

ingunnjk · 2024-04-24T15:59:39Z

BEFORE YOU START: please make sure your pull request is against the dev branch.
We cannot accept pull requests against the main branch.
See our contributing guide for details.

Description

I want to add an Icelandic constituency model for Icelandic. I have made changes to prepare_con_dataset.py and added the script convert_icepahc.py in stanza/utils/datasets/constituency. I have also added an Icelandic BERT model to stanza/resources/default_packages.py.

Fixes Issues

A list of issues/bugs with # references. (e.g., #123)

Unit test coverage

Are there unit tests in place to make sure your code is functioning correctly?
(see here for a simple example)

Known breaking changes/behaviors

None.

AngledLuffa · 2024-04-26T03:47:51Z

Hey, thanks for sending this. Very helpful.

Are you familiar with squashing? If so, would you be up for squashing commit #1 with #2 and #3 with #4? If not, I can do it on my end.

In your change comments, you talk about how there is a version which is simplified already. Is it necessary to use that for the parser? If yes, did you use a script to do those simplifications?

…tuency parser Create convert_icepahc.py Update default_packages.py Added a BERT model for Icelandic Update default_packages.py

ingunnjk · 2024-04-26T11:02:07Z

Hi!

I have now squashed the commits!

I tried using the original IcePaHC file for the parser but that didn't seem to work. I also removed some things that shouldn't be a part of the parsing like empty nodes (traces, comments etc.) and some of it was by code and some by hand so there is unfortunately not a script for that, I'm sorry.

Also one other thing, you added * to the list of functional tags to cut off since * is used as a separator in the treebank file that is used for the training. I would actually prefer if it is not cut off! I removed the ending of most constituents, like ADJPOC, but I would like to keep some of them that I think would be useful, for example IPMAT. I trained a model where * was cut off and then also when it is not cut off and the difference in scoring is only like 0,25% (from 90,63% to 90,38%). If it would then be possible to change the * to - in the output from the parser, that would be greatly appreciated (or change * to - beforehand and not cutting - off for the Icelandic parser?).

AngledLuffa · 2024-04-26T20:24:21Z

I also removed some things that shouldn't be a part of the parsing like empty nodes (traces, comments etc.) and some of it was by code and some by hand so there is unfortunately not a script for that, I'm sorry.

The tree reading code automatically removes empty nodes. Perhaps it would be possible to recreate the conversion by hand. I'm not too opposed to using editing files, but it does look like they update this corpus fairly often:

https://github.com/antonkarl/icecorpus

which suggests that being able to automatically convert their work to the internal Stanza format would be useful.

Also one other thing, you added * to the list of functional tags to cut off since * is used as a separator in the treebank file that is used for the training

I see how that could be useful, in that English also has some tags such as NP-TMP which could be kept instead of discarded. I'll have to think about how to do that. Probably pass in the dataset as a parameter to read_treebank

ingunnjk · 2024-04-30T15:37:33Z

Okay! So would I have to do (or rather update) the conversion script in order to make the Icelandic constituency model a part of Stanza or could that come as an update later on?

AngledLuffa · 2024-04-30T18:53:29Z

In my ideal world, it would show up all at once :) If that doesn't work with you, we can also start from your converted version of the treebank. If you can send us the partially finished script, I can also help as needed, since you're already donating a bunch of work to us.

AngledLuffa · 2024-05-09T20:57:41Z

Any room for me to help with the conversion script?

ingunnjk · 2024-05-17T08:33:35Z

Sorry for the late answer, I haven't been able to work on this for the past weeks. In the conversion script (since the tree reading code automatically removes empty nodes) it is mostly a matter of removing lemmas (so (NS-D löndum-land) becomes (NS-D löndum)) and combining word parts that start/end with $ (in most cases a noun and its determiner), for example should (NS-N tungur$-tunga) (D-N $nar-hinn) become (NS-N tungurnar) and (RP upp$-upp) (VBPI $heldur-halda) should become (VBPI uppheldur). I don't know exactly when I'll be able to continue working on this (hopefully next week) but if you have the time you could for sure help move it along.

AngledLuffa · 2024-05-17T15:12:58Z

It's a little surprising to me that we would want to combine nodes that have determiners with their nouns. Then again, I do see in the UD version that there's a token uppheldur and no situation where the two tokens are split in pieces. Either way, these two operations should be pretty straightforward to do with the Tsurgeon interface https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/tsurgeon/Tsurgeon.html ... yuck, it seems I haven't written up how to use tsurgeon anywhere. I'll do that today and write up a brief example of the lemma removal. The node combination might be a little trickier, but if you tell me which tag to choose when combining two tags, that should be possible as well

…

On Fri, May 17, 2024 at 1:33 AM Ingunn Jóhanna Kristjánsdóttir < ***@***.***> wrote: Sorry for the late answer, I haven't been able to work on this for the past weeks. In the conversion script (since the tree reading code automatically removes empty nodes) it is mostly a matter of removing lemmas (so *(NS-D löndum-land)* becomes *(NS-D löndum)*) and combining word parts that start/end with $ (in most cases a noun and its determiner), for example should *(NS-N tungur$-tunga) (D-N $nar-hinn)* become *(NS-N tungurnar)* and *(RP upp$-upp) (VBPI $heldur-halda)* should become *(VBPI uppheldur)*. I don't know exactly when I'll be able to continue working on this (hopefully next week) but if you have the time you could for sure help move it along. — Reply to this email directly, view it on GitHub <#1389 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWJZBIAOOG6ZG3HNKITZCW6HLAVCNFSM6AAAAABGXGSW26VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJXGAZTAOBYGQ> . You are receiving this because you commented.Message ID: ***@***.***>

AngledLuffa · 2024-05-18T08:05:15Z

I checked in a script which uses Tsurgeon to convert a single tree:

6b468b9

If you have a list of cases where $ combines two words into one word, I can expand that to cover the rest of the treebank. Obviously it's easy enough to combine neighboring $ words, but I'm not sure which ones take precedence in cases other than the two you've mentioned so far.

I put a brief introduction using this example on the github.io

https://stanfordnlp.github.io/stanza/tsurgeon.html

It doesn't really say anything other than the example I just checked in, though.

PS sorry for reusing your script's name and therefore making it slightly more annoying to do this PR

AngledLuffa · 2024-06-22T22:45:57Z

Any thoughts on the tregex / tsurgeon solution for this issue?

AngledLuffa · 2024-07-22T19:22:59Z

@ingunnjk any thoughts on how best to proceed with the conversion of the Icelandic dataset? If you have a set of heuristics on how to combine the $ nodes, I can turn that into a sequence of Tsurgeon expressions which hopefully handles the conversion from the raw dataset

Update prepare_con_dataset.py for the training of an Icelandic consti…

77d753d

…tuency parser Create convert_icepahc.py Update default_packages.py Added a BERT model for Icelandic Update default_packages.py

ingunnjk force-pushed the patch-1 branch from 9e03f8e to 77d753d Compare April 26, 2024 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a constituency model for Icelandic #1389

Add a constituency model for Icelandic #1389

ingunnjk commented Apr 24, 2024

AngledLuffa commented Apr 26, 2024

ingunnjk commented Apr 26, 2024

AngledLuffa commented Apr 26, 2024

ingunnjk commented Apr 30, 2024

AngledLuffa commented Apr 30, 2024

AngledLuffa commented May 9, 2024

ingunnjk commented May 17, 2024

AngledLuffa commented May 17, 2024 via email

AngledLuffa commented May 18, 2024

AngledLuffa commented Jun 22, 2024

AngledLuffa commented Jul 22, 2024

Add a constituency model for Icelandic #1389

Are you sure you want to change the base?

Add a constituency model for Icelandic #1389

Conversation

ingunnjk commented Apr 24, 2024

Description

Fixes Issues

Unit test coverage

Known breaking changes/behaviors

AngledLuffa commented Apr 26, 2024

ingunnjk commented Apr 26, 2024

AngledLuffa commented Apr 26, 2024

ingunnjk commented Apr 30, 2024

AngledLuffa commented Apr 30, 2024

AngledLuffa commented May 9, 2024

ingunnjk commented May 17, 2024

AngledLuffa commented May 17, 2024 via email

AngledLuffa commented May 18, 2024

AngledLuffa commented Jun 22, 2024

AngledLuffa commented Jul 22, 2024