Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jakarta in Indonesia #7

Open
AngledLuffa opened this issue Dec 15, 2022 · 8 comments
Open

Jakarta in Indonesia #7

AngledLuffa opened this issue Dec 15, 2022 · 8 comments

Comments

@AngledLuffa
Copy link
Contributor

Phrases like this: one entity or two?

Conll has "Old Trafford in Manchester" as two, but our standard would normally have "Jakarta, Indonesia" as one

@AngledLuffa
Copy link
Contributor Author

Similar:

Querétaro       B-Location
,       I-Location
in      I-Location
Mexico  I-Location

paraguay_mercopress_7.txt.tsv

@AngledLuffa
Copy link
Contributor Author

Wagga   B-Location
Wagga   I-Location
,       O
in      O
southern        O
New     B-Location
South   I-Location
Wales   I-Location

@AngledLuffa
Copy link
Contributor Author

Birmingham, Alabama, in the United States - where to draw the line, or is it one entity?

@AngledLuffa
Copy link
Contributor Author

Pusad   B-Location
,       O
in      O
the     O
Yavatmal        B-Location
district        O
of      O
Maharashtra     B-Location

@AngledLuffa
Copy link
Contributor Author

where "in" really becomes a problem is when it merges multiple tags:

University      B-Organization
of      I-Organization
the     I-Organization
Witwatersrand   I-Organization  B-Location
in      O       I-Location
Johannesburg    O       I-Location
,       O       I-Location
South   O       I-Location
Africa  O       I-Location

or

Dublin  I-Organization  B-Location
in      O       I-Location
Ireland O       I-Location
University      B-Organization
of      I-Organization
Wroclaw I-Organization  B-Location
in      O
Poland  B-Location

@SecroLoL
Copy link
Collaborator

Thanks for these. I think we need to make a consistent labeling job here. Maybe we can say that if it's just a comma separating them, it should be one entity and if there are any other words then it shouldn't? What are your thoughts

@AngledLuffa
Copy link
Contributor Author

I think as long as we're consistent, we're fine, but the example where two labels overlap because the in makes for a larger LOC is rather problematic

#7 (comment)

@SecroLoL
Copy link
Collaborator

Correct. I think the move here is that I'll edit all occurrences that I can find to have entities connected by commas to be one entity span, and then we can have entities that are separated by anything else (e.g. in) to be separate spans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants