Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent segmentation bug when defining user defined symbols #217

Closed
howlinghuffy opened this issue Oct 22, 2018 · 6 comments
Closed

Apparent segmentation bug when defining user defined symbols #217

howlinghuffy opened this issue Oct 22, 2018 · 6 comments

Comments

@howlinghuffy
Copy link

howlinghuffy commented Oct 22, 2018

I'm not sure if this is a bug or by design, but I am experiencing some weird segmentation behaviour when using --user_defined_symbols to train sentencepiece.

It seems that sentencepiece does not take these symbols into account when training, and instead actually converts them to spaces for the segmentation process. This results in some subword tokens being included in the generated vocabulary, despite never actually appearing in the training text.

My input data is annotated with @ characters throughout to represent preprocessing information. I want to define @ as a user defined symbol.

For example, many of my training sentences are in a format similar to the following:

This is a sample @218@sentence.
This is @242@another sample @218@sentence.
This is a @218@third sample sentence.
This is a @218@fourth sample sentence.

I train with the following command:

spm_train --input=input.txt --model_prefix=m --vocab_size=4000 --user_defined_symbols=@

However, the vocabulary generated is not what I would expect.

Expected Vocab

<unk>	0
@	0
▁	-1.94486
218	-2.94693
s	-3.29286
▁the	-3.76292
.	-4.09046
,	-4.14452
▁a	-4.18573
▁of	-4.3175
▁and	-4.55276
242	-4.63683
▁in	-4.67848
▁t	-4.71242
▁T	-4.79053
and so on...

Actual Vocab

<unk>	0
@	0
▁	-1.94486
▁218	-2.94693
s	-3.29286
▁the	-3.76292
.	-4.09046
,	-4.14452
▁a	-4.18573
▁of	-4.3175
▁and	-4.55276
▁242	-4.63683
▁in	-4.67848
▁t	-4.71242
▁T	-4.79053
and so on...

As you can see, two of the top ranking tokens in the actual vocab are ▁218 and ▁242, despite the numbers 218 and 242 never being preceeded by a space in the training data. Intuitively, I would expect the tokens 218 and 242 (without a space preceeding them) to be high in the vocabulary instead .

Encoding still separates the user_defined_symbols as expected, but unfortunately the vocabulary contains useless tokens (▁218 and ▁242) and is missing desirable tokens ( 218 and 242), meaning that they are suboptimally encoded as 21 81 and 2 42` respectively.

Expected Encoding

▁This ▁is ▁a ▁sample ▁ @ 218 @ s ent ence .
▁This ▁is ▁ @ 242 @ an other ▁sample ▁ @ 218 @ s ent ence .
▁This ▁is ▁a ▁ @ 218 @ th ir d ▁sample ▁sentence .
▁This ▁is ▁a ▁ @ 218 @ f our th ▁sample ▁sentence .

Actual Encoding

▁This ▁is ▁a ▁sample ▁ @ 21 8 @ s ent ence .
▁This ▁is ▁ @ 2 42 @ an other ▁sample ▁ @ 21 8 @ s ent ence .
▁This ▁is ▁a ▁ @ 21 8 @ th ir d ▁sample ▁sentence .
▁This ▁is ▁a ▁ @ 21 8 @ f our th ▁sample ▁sentence .

Is this an expected behaviour? And if not, is there an easy fix for this?

Thanks in advance!

@howlinghuffy howlinghuffy changed the title Apparent segmentation bug when using identity normalization Apparent segmentation bug when defining user defined symbols Oct 22, 2018
@taku910
Copy link
Collaborator

taku910 commented Oct 23, 2018

Seems this is expected behavior. When '@' is specified as user defined symbol, '@' is always treated as one piece, meaning that no piece containing '@' inside will not be extracted. Probably the name of 'symbol' is misleading. The intention is user-defined-piece.

It would be tricky to perform the expected encoding, but one solution would be
--user_defined_symbols=@218@,@242@

Then, @218@ and @242@ are encoded as one piece.

@howlinghuffy
Copy link
Author

howlinghuffy commented Oct 23, 2018

Thanks for the quick response @taku910.

That doesn't quite answer the issue I'm having - I expect the '@' to be treated as a single piece, and I don't expect any other piece to contain an '@' symbol.

What is confusing me is the fact that ▁218 and ▁242 are extracted as pieces (note that these pieces have a space symbol at their beginning), despite the numbers 218 and 242 never occuring in the data with a space directly before them. These numbers only ever occur in the data with an @ symbol on either side of them. As such, I would expect the numbers 218 and 242 to be extracted as pieces, but without a leading ▁ character.

Do you know why the extracted pieces 218 and 242 would include a space character before them?

@taku910
Copy link
Collaborator

taku910 commented Nov 8, 2018

OK, I got it and reproduce this bug on my environment.

During the training, user-defined-symbols are simply replaced with ' ', so

This is a sample @218@sentence. is treated as This is a sample 218 sentence.

This causes a bug. I would like to fix but not sure how it is easy. Anyway, thank you for the report.

@howlinghuffy
Copy link
Author

Thanks for tracking it down @taku910, that makes sense.

Instead of replacing the symbols with '', would it perhaps be better to use the user-defined-symbols to split the sentence into multiple sentences for training?

This would mean that This is a sample @218@sentence is treated as 3 sentences:

This is a sample
218
sentence

Would that work? Unfortunately I am not well-versed in C, so I may not be too helpful in contributing a patch for that modification.

@taku910
Copy link
Collaborator

taku910 commented Nov 9, 2018

Thank you for the suggestion. Actually, '\t' is reserved for a piece boundary marker in sentencepiece.
Just replacing user defined symbols with '\t' works.

#237

@taku910
Copy link
Collaborator

taku910 commented Nov 11, 2018

Let me close this bug. Please reopen it if you find any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants