Bug in BPE algorithm #318

xbelonogov · 2019-04-17T12:15:59Z

I think the BPE algorithm is not working properly. This code snippet reproduces the bug.

import sentencepiece as spm

vocab_size= 9
model_prefix = 'model'
train_data_file = 'corpus.txt'
text = "bc a aaa"

with open(train_data_file, 'w') as fout:
    fout.write(text)

spm.SentencePieceTrainer.Train(f'--input={train_data_file} --model_prefix={model_prefix} --vocab_size={vocab_size} --model_type=bpe')

sp = spm.SentencePieceProcessor();
sp.Load(model_prefix + '.model')
for i in range(vocab_size):
    s = sp.IdToPiece(i)
    s = s.replace(chr(9601), '_')
    print(f'i: {i} piece: {s}')

The input for BPE is "bc a aaa"
I got the following output.

i: 0 piece: <unk>
i: 1 piece: <s>
i: 2 piece: </s>
i: 3 piece: _a
i: 4 piece: bc
i: 5 piece: a
i: 6 piece: _
i: 7 piece: b
i: 8 piece: c

The first merged pair should be _ + a = _a and it's correct. (i: 3 piece: _a)
After that the second pair should be a + a = aa but algorithm produced pair bc. (i: 4 piece: bc)
I used the debug output and found out that at the second iteration here symbol->freq for aa is 0 and symbol->freq for bc is 1.

It happened because logic in this if statement is incorrect. You can't simply remove this positions.

The text was updated successfully, but these errors were encountered:

taku910 · 2019-04-24T16:26:05Z

Thank you for the report.

Do you think we can fix this issue just by removing the condition in the if statement?
Let me run large test to make sure it biring no side effects. (In real corpus, I believe that it will not have huge different)

tombosc · 2020-04-22T17:12:13Z

@xbelonogov , I agree that after the first merge, the count for "aa" should be 1. Indeed, if we note "X=aa", the corpus becomes "bcXXaa". Then both "bc" and "aa" have a count of 1, in fact, all the bigrams occur only once. So why would you prefer "aa" to be merged instead of "bc" as they should have the same counts? Thanks.

xiefangqi · 2021-12-07T12:09:41Z

I see that the v0.1.96 output is still "bc", not "aa".
I put some logs to watch the process,
the first iteration:
symbol: ▁a freq: 2
symbol: aa freq: 1
symbol: ▁b freq: 1
symbol: bc freq: 1

▁a put into the final_pieces_, and the freq of "aa" reset to 0.

the second iteration:
symbol: aa freq: 0
symbol: ▁b freq: 1
symbol: bc freq: 1
symbol: _aa freq: 1

sentencepiece put "bc" into the final_pieces.

Is there a problem with this process?

taku910 · 2023-04-24T07:27:21Z

Sorry for the late response. This bug is going to be fixed in the next release.

taku910 · 2023-05-02T04:25:33Z

Fixed in https://github.com/google/sentencepiece/releases/tag/v0.1.99

taku910 added the bug label Jan 10, 2021

taku910 closed this as completed May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in BPE algorithm #318

Bug in BPE algorithm #318

xbelonogov commented Apr 17, 2019 •

edited

Loading

taku910 commented Apr 24, 2019

tombosc commented Apr 22, 2020

xiefangqi commented Dec 7, 2021 •

edited

Loading

taku910 commented Apr 24, 2023

taku910 commented May 2, 2023

Bug in BPE algorithm #318

Bug in BPE algorithm #318

Comments

xbelonogov commented Apr 17, 2019 • edited Loading

taku910 commented Apr 24, 2019

tombosc commented Apr 22, 2020

xiefangqi commented Dec 7, 2021 • edited Loading

taku910 commented Apr 24, 2023

taku910 commented May 2, 2023

xbelonogov commented Apr 17, 2019 •

edited

Loading

xiefangqi commented Dec 7, 2021 •

edited

Loading