Update MultiCoNer #3006

helpmefindaname · 2022-12-03T01:10:17Z

This PR adds the following features:

MultiCoNer can now download the dataset from the S3 opendata registry. This also includes the test set.
When loading a columncorpus, comments can be used to add metadata e.g. # id 2c8f5b49-4df5-44b5-8c2a-bdc340010ea3 domain=de-lowner will add a metadata domain with the value de-lowner. In general, this is about tab separated <key>=<value> patterns.
MultiCoNerV2 is added, but needs to be added manually similar to how it used to be with MultiCoNer.
Loading datasets via a columncorpus won't warn about creating empty sentences for every single example.
When loading a columncorpus, each last token of a sentence now correctly has a whitespace_after=0, instead of possible 1

alanakbik · 2022-12-03T02:45:46Z

flair/datasets/sequence_labeling.py

+            tokens.append(token)
+
+        sentence: Sentence = Sentence(text=tokens)


This looks much cleaner than creating an empty sentence and adding tokens. I wonder if this means we can retire the add_token method entirely in a separate PR?

Looking at the usage after this PR, we still have the following usages:

Tests, which can be rewritten

UniversalDependenciesDataset which works the same way as the ColumnDataset worked before

Sentence - constructor, which I would prefer to keep.

We can surely make the add_token method private

I added a proposal to this PR

alanakbik · 2022-12-03T02:46:48Z

flair/datasets/sequence_labeling.py

+            for comment_row in comment.split("\t"):
+                if "=" in comment_row:
+                    key, value = comment_row.split("=", 2)
+                    sentence.add_metadata(key, value)
+


Thanks for adding this, we really needed this feature in ColumnDataset :)

alanakbik · 2022-12-06T01:46:05Z

@helpmefindaname thanks for adding this and for the add_token modifications!

helpmefindaname added 5 commits December 3, 2022 01:51

use aws download for NER_MULTI_CONER and support NER_MULTI_CONER_V2

5df6699

parse comments in columncorpus to possible metadata

29f18f6

add warning if MultiCoNerV2 dataset is not downloaded manually

6799a4f

fix typing

7e7dcbe

fix text columncorpus last token has no whitespace after

884dbc3

alanakbik reviewed Dec 3, 2022

View reviewed changes

helpmefindaname added 2 commits December 3, 2022 23:25

make "add_token" method private

382475b

remove unused import and format code

250697d

alanakbik merged commit 4f45c91 into flairNLP:master Dec 6, 2022

helpmefindaname deleted the multiconer_dataset_update branch December 6, 2022 10:00

dobbersc mentioned this pull request Dec 11, 2022

ValueError When Loading TACRED #3018

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update MultiCoNer #3006

Update MultiCoNer #3006

helpmefindaname commented Dec 3, 2022 •

edited

Loading

alanakbik Dec 3, 2022

helpmefindaname Dec 3, 2022

helpmefindaname Dec 3, 2022

alanakbik Dec 3, 2022

alanakbik commented Dec 6, 2022

		tokens.append(token)

		sentence: Sentence = Sentence(text=tokens)

Update MultiCoNer #3006

Update MultiCoNer #3006

Conversation

helpmefindaname commented Dec 3, 2022 • edited Loading

alanakbik Dec 3, 2022

Choose a reason for hiding this comment

helpmefindaname Dec 3, 2022

Choose a reason for hiding this comment

helpmefindaname Dec 3, 2022

Choose a reason for hiding this comment

alanakbik Dec 3, 2022

Choose a reason for hiding this comment

alanakbik commented Dec 6, 2022

helpmefindaname commented Dec 3, 2022 •

edited

Loading