"▁" character can be separated when using BPE-dropout #67

TIXFeniks · 2020-04-04T10:43:39Z

When including BPE-dropout, word boundary character ('▁') can be separated from the first character of the word. This scenario is untested and could be harmful for the model training.

steps to reproduce:

>>> bpe.encode(["hello"], output_type=yttm.OutputType.SUBWORD, dropout_prob=1.0)                 
Out[45]: [['▁', 'h', 'e', 'l', 'l', 'o']]

(this also happens with dropout_prob < 1; I used 1 just for it to be reproducible)

Perhaps this behavior could be controlled by a flag to always merge '▁' and the next token?

The text was updated successfully, but these errors were encountered:

tnq177 · 2020-04-09T07:01:23Z

second this.

xbelonogov · 2020-04-09T13:13:04Z

Hi
This behaviour isn’t related to BPE-dropout.

If characters '▁' and 'h' did not merge, then they occurred not too many times together. This means that the algorithm instead of that combined the more frequent pairs of characters and most likely the more useful.

Could you describe in more detail why these symbols should be merged with higher priority?

tnq177 · 2020-04-09T13:46:13Z

@xbelonogov I think @TIXFeniks refers to the special token '▁' that merges subwords, not the underscore '_'.

xbelonogov · 2020-04-10T08:44:17Z

Yes, I also meant this special token '▁'. (Edited the previous comment)

tnq177 · 2020-04-10T11:03:44Z

@xbelonogov I think '▁' should not be a token on its own but should always be attached to other token to indicate that's a subword, no?

xbelonogov · 2020-04-11T08:32:10Z

It is not obvious to me.
In practice for reasonably large vocabulary special token '▁' is almost always merged with the first symbol.

tnq177 · 2020-04-12T04:22:10Z

I'm not 100% clear about how BPE is implemented in YTTM but let's take subword-nmt as an example. In subword-nmt, the word separator character (usually space " ") is not considered as part of training data to learn BPE. They only look at pairs of symbols within a word. If a word is split into BPE subwords, they append a special token "@@" at the end of former subword. For example, "hello" --> "he@@ llo". When apply BPE-dropout, it could become something like "h@@ e@@ ll@@ o". The special token "@@" itself can never be a token on its own, but is always appended to other true tokens to indicate how to merge subwords together later.

In this case, '▁' should behave similarly. If '▁' is a separate token, a NMT model could mistakenly learn to generate '▁' and even after merging subwords, there are a lot of spaces in between words. For example, my Slovak-English model generates this sentence: When I was 11 , I remember one mor ning w ak ing up the j oy ful sound s in my house .

xbelonogov · 2020-04-12T08:54:28Z

YTTM is very similar to subword-nmt.
The difference is the following:

In YTTM you can specify the exact number of tokens in the output vocabulary.
In Subword-nmt you specify the number of joining operations.

Subword-nmt creates 2 tokens for each character from alphabet: the original one and with @@ in the end. That can cause a problem if you are working with language with a big alphabet like Chinese. Final vocabulary may be too large.

The way how word splitting works is equivalent.

In Subword-nmt there are two types of suffixes [empty] and @@. The first means the end of the word, the second means the continuation.
In YTTM there are two types of prefixes ▁ and [empty]. The first means the beginning of the word, the second means the continuation.

Regarding your problem. Can you check how often this special token occurs alone in your train data? I just checked on some english dataset. This occurs on average once in 50 sentences. So I think this should not affect performance.

Also you can explore YTTM model with the following command.

yttm vocab --model model.yttm --verbose

It's easy to see that all tokens like ▁t, ▁h, etc exist.

TIXFeniks · 2020-04-28T11:46:26Z

Regarding your problem. Can you check how often this special token occurs alone in your train data? I just checked on some english dataset. This occurs on average once in 50 sentences. So I think this should not affect performance. Yes, ▁ being separated from the word is indeed rare when not using BPE-dropout, but as BPE-dropout gets introduced, this happens very often (once in every couple of sentences in my real case, or once every word in the example I provided before with dropout probability of 1).

The solution could be an ability to disable dropout for such merges ( ▁ with something ) and not the others.

TIXFeniks · 2020-07-07T08:31:36Z

@xbelonogov, what do you think about my suggestion from the previous message?

xbelonogov · 2020-07-10T15:04:40Z

Hi, @TIXFeniks
Your suggestion looks reasonable. But I don't want to make one more options for disabling this type of splits. Every new options make interface more cumbersome and decrease usability. I am okay with doing this by default.

I asked Ivan Provilkov. But he isn't sure that this improve performance. If you have experiments that prove effectiveness of this, I will change default behaviour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"▁" character can be separated when using BPE-dropout #67

"▁" character can be separated when using BPE-dropout #67

TIXFeniks commented Apr 4, 2020

tnq177 commented Apr 9, 2020

xbelonogov commented Apr 9, 2020 •

edited

Loading

tnq177 commented Apr 9, 2020

xbelonogov commented Apr 10, 2020

tnq177 commented Apr 10, 2020

xbelonogov commented Apr 11, 2020

tnq177 commented Apr 12, 2020

xbelonogov commented Apr 12, 2020

TIXFeniks commented Apr 28, 2020 •

edited

Loading

TIXFeniks commented Jul 7, 2020

xbelonogov commented Jul 10, 2020

"▁" character can be separated when using BPE-dropout #67

"▁" character can be separated when using BPE-dropout #67

Comments

TIXFeniks commented Apr 4, 2020

tnq177 commented Apr 9, 2020

xbelonogov commented Apr 9, 2020 • edited Loading

tnq177 commented Apr 9, 2020

xbelonogov commented Apr 10, 2020

tnq177 commented Apr 10, 2020

xbelonogov commented Apr 11, 2020

tnq177 commented Apr 12, 2020

xbelonogov commented Apr 12, 2020

TIXFeniks commented Apr 28, 2020 • edited Loading

TIXFeniks commented Jul 7, 2020

xbelonogov commented Jul 10, 2020

xbelonogov commented Apr 9, 2020 •

edited

Loading

TIXFeniks commented Apr 28, 2020 •

edited

Loading