Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ✨ initial implementation of JsonlCorpora and Datasets #2653

Merged

Conversation

AnotherStranger
Copy link
Contributor

I created an initial implementation for JSONL datasets using doccanos JSONL format.
I tried to replicate the behavior of the existing ColumnCorpus.
I'd love to get feedback on the implementation.

Closes #2605

Copy link
Collaborator

@tadejmagajna tadejmagajna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Like like how comprehensive the tests are.

The only question I have is why some parts of the code seem like they're hardcoded for ner label_type only - it would be nice to have support for any type of tagging.

rev: stable
hooks:
- id: black
language_version: python3.6
- repo: https://github.com/pycqa/isort
rev: stable
rev: 5.10.1
hooks:
- id: isort
name: isort (python)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to have a new line at end of fine so GitHub doesn't complain

Suggested change
name: isort (python)
name: isort (python)

Comment on lines 5 to 6
- id: black
language_version: python3.6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad indentation. Note that these changes to this file are already proposed in #2651

Suggested change
- id: black
language_version: python3.6
- id: black
language_version: python3.6

# Add IOB tags
prefix = "B"
for token in sentence[start_idx : end_idx + 1]:
token.add_label("ner", f"{prefix}-{label}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this code assume that label_type is always ner? What is you trying to do PoS tagging or any other type of labelling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my bad!
I forgot to parameterize this. I will change it ASAP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added label_type as a parameter with the default value ner.

@alanakbik alanakbik merged commit 6de1268 into flairNLP:master Mar 15, 2022
@alanakbik
Copy link
Collaborator

@AnotherStranger thanks a lot for adding this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

✨ Add Jsonl corpus support
3 participants