Apparent segmentation bug when defining user defined symbols #217

howlinghuffy · 2018-10-22T05:21:09Z

I'm not sure if this is a bug or by design, but I am experiencing some weird segmentation behaviour when using --user_defined_symbols to train sentencepiece.

It seems that sentencepiece does not take these symbols into account when training, and instead actually converts them to spaces for the segmentation process. This results in some subword tokens being included in the generated vocabulary, despite never actually appearing in the training text.

My input data is annotated with @ characters throughout to represent preprocessing information. I want to define @ as a user defined symbol.

For example, many of my training sentences are in a format similar to the following:

This is a sample @218@sentence.
This is @242@another sample @218@sentence.
This is a @218@third sample sentence.
This is a @218@fourth sample sentence.

I train with the following command:

spm_train --input=input.txt --model_prefix=m --vocab_size=4000 --user_defined_symbols=@

However, the vocabulary generated is not what I would expect.

Expected Vocab

<unk>	0
@	0
▁	-1.94486
218	-2.94693
s	-3.29286
▁the	-3.76292
.	-4.09046
,	-4.14452
▁a	-4.18573
▁of	-4.3175
▁and	-4.55276
242	-4.63683
▁in	-4.67848
▁t	-4.71242
▁T	-4.79053
and so on...

Actual Vocab

<unk>	0
@	0
▁	-1.94486
▁218	-2.94693
s	-3.29286
▁the	-3.76292
.	-4.09046
,	-4.14452
▁a	-4.18573
▁of	-4.3175
▁and	-4.55276
▁242	-4.63683
▁in	-4.67848
▁t	-4.71242
▁T	-4.79053
and so on...

As you can see, two of the top ranking tokens in the actual vocab are ▁218 and ▁242, despite the numbers 218 and 242 never being preceeded by a space in the training data. Intuitively, I would expect the tokens 218 and 242 (without a space preceeding them) to be high in the vocabulary instead .

Encoding still separates the user_defined_symbols as expected, but unfortunately the vocabulary contains useless tokens (▁218 and ▁242) and is missing desirable tokens ( 218 and 242), meaning that they are suboptimally encoded as 21 81 and 2 42` respectively.

Expected Encoding

▁This ▁is ▁a ▁sample ▁ @ 218 @ s ent ence .
▁This ▁is ▁ @ 242 @ an other ▁sample ▁ @ 218 @ s ent ence .
▁This ▁is ▁a ▁ @ 218 @ th ir d ▁sample ▁sentence .
▁This ▁is ▁a ▁ @ 218 @ f our th ▁sample ▁sentence .

Actual Encoding

▁This ▁is ▁a ▁sample ▁ @ 21 8 @ s ent ence .
▁This ▁is ▁ @ 2 42 @ an other ▁sample ▁ @ 21 8 @ s ent ence .
▁This ▁is ▁a ▁ @ 21 8 @ th ir d ▁sample ▁sentence .
▁This ▁is ▁a ▁ @ 21 8 @ f our th ▁sample ▁sentence .

Is this an expected behaviour? And if not, is there an easy fix for this?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

taku910 · 2018-10-23T09:30:31Z

Seems this is expected behavior. When '@' is specified as user defined symbol, '@' is always treated as one piece, meaning that no piece containing '@' inside will not be extracted. Probably the name of 'symbol' is misleading. The intention is user-defined-piece.

It would be tricky to perform the expected encoding, but one solution would be
--user_defined_symbols=@218@,@242@

Then, @218@ and @242@ are encoded as one piece.

howlinghuffy · 2018-10-23T09:57:06Z

Thanks for the quick response @taku910.

That doesn't quite answer the issue I'm having - I expect the '@' to be treated as a single piece, and I don't expect any other piece to contain an '@' symbol.

What is confusing me is the fact that ▁218 and ▁242 are extracted as pieces (note that these pieces have a space symbol at their beginning), despite the numbers 218 and 242 never occuring in the data with a space directly before them. These numbers only ever occur in the data with an @ symbol on either side of them. As such, I would expect the numbers 218 and 242 to be extracted as pieces, but without a leading ▁ character.

Do you know why the extracted pieces 218 and 242 would include a space character before them?

taku910 · 2018-11-08T17:07:08Z

OK, I got it and reproduce this bug on my environment.

During the training, user-defined-symbols are simply replaced with ' ', so

This is a sample @218@sentence. is treated as This is a sample 218 sentence.

This causes a bug. I would like to fix but not sure how it is easy. Anyway, thank you for the report.

howlinghuffy · 2018-11-08T23:23:18Z

Thanks for tracking it down @taku910, that makes sense.

Instead of replacing the symbols with '', would it perhaps be better to use the user-defined-symbols to split the sentence into multiple sentences for training?

This would mean that This is a sample @218@sentence is treated as 3 sentences:

This is a sample
218
sentence

Would that work? Unfortunately I am not well-versed in C, so I may not be too helpful in contributing a patch for that modification.

taku910 · 2018-11-09T17:50:28Z

Thank you for the suggestion. Actually, '\t' is reserved for a piece boundary marker in sentencepiece.
Just replacing user defined symbols with '\t' works.

#237

taku910 · 2018-11-11T08:21:04Z

Let me close this bug. Please reopen it if you find any issues.

howlinghuffy changed the title ~~Apparent segmentation bug when using identity normalization~~ Apparent segmentation bug when defining user defined symbols Oct 22, 2018

taku910 closed this as completed Nov 11, 2018

wnhsu mentioned this issue May 7, 2021

user defined char set #649

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparent segmentation bug when defining user defined symbols #217

Apparent segmentation bug when defining user defined symbols #217

howlinghuffy commented Oct 22, 2018 •

edited

Loading

taku910 commented Oct 23, 2018

howlinghuffy commented Oct 23, 2018 •

edited

Loading

taku910 commented Nov 8, 2018

howlinghuffy commented Nov 8, 2018

taku910 commented Nov 9, 2018

taku910 commented Nov 11, 2018

Apparent segmentation bug when defining user defined symbols #217

Apparent segmentation bug when defining user defined symbols #217

Comments

howlinghuffy commented Oct 22, 2018 • edited Loading

Expected Vocab

Actual Vocab

Expected Encoding

Actual Encoding

taku910 commented Oct 23, 2018

howlinghuffy commented Oct 23, 2018 • edited Loading

taku910 commented Nov 8, 2018

howlinghuffy commented Nov 8, 2018

taku910 commented Nov 9, 2018

taku910 commented Nov 11, 2018

howlinghuffy commented Oct 22, 2018 •

edited

Loading

howlinghuffy commented Oct 23, 2018 •

edited

Loading