Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MultiCoNer #3006

Merged

Conversation

helpmefindaname
Copy link
Collaborator

@helpmefindaname helpmefindaname commented Dec 3, 2022

This PR adds the following features:

  • MultiCoNer can now download the dataset from the S3 opendata registry. This also includes the test set.
  • When loading a columncorpus, comments can be used to add metadata e.g. # id 2c8f5b49-4df5-44b5-8c2a-bdc340010ea3 domain=de-lowner will add a metadata domain with the value de-lowner. In general, this is about tab separated <key>=<value> patterns.
  • MultiCoNerV2 is added, but needs to be added manually similar to how it used to be with MultiCoNer.
  • Loading datasets via a columncorpus won't warn about creating empty sentences for every single example.
  • When loading a columncorpus, each last token of a sentence now correctly has a whitespace_after=0, instead of possible 1

Comment on lines +649 to +651
tokens.append(token)

sentence: Sentence = Sentence(text=tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much cleaner than creating an empty sentence and adding tokens. I wonder if this means we can retire the add_token method entirely in a separate PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the usage after this PR, we still have the following usages:

  • Tests, which can be rewritten
  • UniversalDependenciesDataset which works the same way as the ColumnDataset worked before
  • Sentence - constructor, which I would prefer to keep.

We can surely make the add_token method private

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a proposal to this PR

Comment on lines +694 to +698
for comment_row in comment.split("\t"):
if "=" in comment_row:
key, value = comment_row.split("=", 2)
sentence.add_metadata(key, value)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, we really needed this feature in ColumnDataset :)

@alanakbik
Copy link
Collaborator

@helpmefindaname thanks for adding this and for the add_token modifications!

@alanakbik alanakbik merged commit 4f45c91 into flairNLP:master Dec 6, 2022
@helpmefindaname helpmefindaname deleted the multiconer_dataset_update branch December 6, 2022 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants