Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization issues: ' followed by s, m, t, etc #1

Open
AngledLuffa opened this issue Dec 11, 2022 · 13 comments
Open

tokenization issues: ' followed by s, m, t, etc #1

AngledLuffa opened this issue Dec 11, 2022 · 13 comments

Comments

@AngledLuffa
Copy link
Contributor

it's gets tokenized into three tokens, it, ', s

that should be fixed

same with 'm 't etc

@SecroLoL
Copy link
Collaborator

Are you saying that it should be tokenized as it, 's?

@AngledLuffa
Copy link
Contributor Author

yes, those should be it, 's and i, 'm, etc

@AngledLuffa
Copy link
Contributor Author

lmk if you need or want some assistance scripting changes like that

@SecroLoL
Copy link
Collaborator

I think I've got this, thanks! Will let you know if I need help though

@SecroLoL
Copy link
Collaborator

SecroLoL commented Aug 9, 2024

How about cases where a noun is followed by 's? Are these annotated properly?
Example:

The	O
opposition	O
's	O
poor	O
election	O
results	O

@SecroLoL
Copy link
Collaborator

SecroLoL commented Aug 9, 2024

Here's what I'm seeing when inspecting some processed data:

national	O
truth	O
-	O
telling	O
process	O
would	O
have	O
on	O
Australia	B-Location
,	O
it	O
's	O
remarkable	O
.	O

"	O
One	O
of	O
the	O
things	O
that	O
we	O
're	O
thinking	O
about	O
I	O
'm	O
a	O
non	O
-	O
conformist	O
politician	O
.	O
I	O
'm	O
a	O
revolutionary	O
,	O
'	O
'	O
Bouteflika	B-Person
told	O
The	B-Organization
Associated	I-Organization
Press	I-Organization

Can't find the cases you're talking about. Was that perhaps only for the raw annotated data?

@AngledLuffa
Copy link
Contributor Author

the possessive 's and the contraction 'm are correct

when i was going through the data myself, i'd occasionally fix them when i came across such errors

cd processed_annotated
grep "^s  O$" * | less        # that's a tab character between s and O

af_afrol_16.txt.tsv:s   O
af_afrol_18.txt.tsv:s   O
af_allaf_15.txt.tsv:s   O
af_allaf_24.txt.tsv:s   O
af_allaf_24.txt.tsv:s   O
af_ips_10.txt.tsv:s     O
af_ips_10.txt.tsv:s     O
af_ips_10.txt.tsv:s     O
etc etc

@AngledLuffa
Copy link
Contributor Author

i'm fairly certain most of those can be cleaned up via a script...

just look for s on a line by itself, especially after a ' or a curly apostrophe, check that the labels are the same, combine the rows

again, i can take that on ... maybe i should just go ahead and do that

@SecroLoL
Copy link
Collaborator

If you could, that would be great. If you have time, of course.

@AngledLuffa
Copy link
Contributor Author

I'm about half done with checking incorrect ', but am uncovering a whole bunch of other random tokenization errors in the process.

(the fancy apostrophe) on a line by itself, followed by s, d, t, etc

Ms .

Jr .

' ' and backticks or curly apostrophes

. . . instead of as a single token

46-41 or other scores / votes

U . S .

and in one file, cuba_diariodecuba_5.txt.tsv, César got cut off many times. I suspect there will be other words like that which need to be cleaned up

@AngledLuffa
Copy link
Contributor Author

alright, i have taken on the tokenizations and the ' tokenizations

the others are still TODO

@AngledLuffa
Copy link
Contributor Author

US, titles, and ellipses are now cleared up. Would still like to look for decade+s

@AngledLuffa
Copy link
Contributor Author

did the decades as well

maybe still need to look for ' ' on two separate lines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants