Skip to content
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.

"▁" character can be separated when using BPE-dropout #67

Open
TIXFeniks opened this issue Apr 4, 2020 · 11 comments
Open

"▁" character can be separated when using BPE-dropout #67

TIXFeniks opened this issue Apr 4, 2020 · 11 comments

Comments

@TIXFeniks
Copy link

When including BPE-dropout, word boundary character ('▁') can be separated from the first character of the word. This scenario is untested and could be harmful for the model training.

steps to reproduce:

>>> bpe.encode(["hello"], output_type=yttm.OutputType.SUBWORD, dropout_prob=1.0)                 
Out[45]: [['▁', 'h', 'e', 'l', 'l', 'o']]

(this also happens with dropout_prob < 1; I used 1 just for it to be reproducible)

Perhaps this behavior could be controlled by a flag to always merge '▁' and the next token?

@tnq177
Copy link

tnq177 commented Apr 9, 2020

second this.

@xbelonogov
Copy link
Contributor

xbelonogov commented Apr 9, 2020

Hi
This behaviour isn’t related to BPE-dropout.

If characters '▁' and 'h' did not merge, then they occurred not too many times together. This means that the algorithm instead of that combined the more frequent pairs of characters and most likely the more useful.

Could you describe in more detail why these symbols should be merged with higher priority?

@tnq177
Copy link

tnq177 commented Apr 9, 2020

@xbelonogov I think @TIXFeniks refers to the special token '▁' that merges subwords, not the underscore '_'.

@xbelonogov
Copy link
Contributor

Yes, I also meant this special token '▁'. (Edited the previous comment)

@tnq177
Copy link

tnq177 commented Apr 10, 2020

@xbelonogov I think '▁' should not be a token on its own but should always be attached to other token to indicate that's a subword, no?

@xbelonogov
Copy link
Contributor

It is not obvious to me.
In practice for reasonably large vocabulary special token '▁' is almost always merged with the first symbol.

@tnq177
Copy link

tnq177 commented Apr 12, 2020

I'm not 100% clear about how BPE is implemented in YTTM but let's take subword-nmt as an example. In subword-nmt, the word separator character (usually space " ") is not considered as part of training data to learn BPE. They only look at pairs of symbols within a word. If a word is split into BPE subwords, they append a special token "@@" at the end of former subword. For example, "hello" --> "he@@ llo". When apply BPE-dropout, it could become something like "h@@ e@@ ll@@ o". The special token "@@" itself can never be a token on its own, but is always appended to other true tokens to indicate how to merge subwords together later.

In this case, '▁' should behave similarly. If '▁' is a separate token, a NMT model could mistakenly learn to generate '▁' and even after merging subwords, there are a lot of spaces in between words. For example, my Slovak-English model generates this sentence: When I was 11 , I remember one mor ning w ak ing up the j oy ful sound s in my house .

@xbelonogov
Copy link
Contributor

YTTM is very similar to subword-nmt.
The difference is the following:

  • In YTTM you can specify the exact number of tokens in the output vocabulary.
  • In Subword-nmt you specify the number of joining operations.

Subword-nmt creates 2 tokens for each character from alphabet: the original one and with @@ in the end. That can cause a problem if you are working with language with a big alphabet like Chinese. Final vocabulary may be too large.

The way how word splitting works is equivalent.

  • In Subword-nmt there are two types of suffixes [empty] and @@. The first means the end of the word, the second means the continuation.
  • In YTTM there are two types of prefixes and [empty]. The first means the beginning of the word, the second means the continuation.

Regarding your problem. Can you check how often this special token occurs alone in your train data? I just checked on some english dataset. This occurs on average once in 50 sentences. So I think this should not affect performance.

Also you can explore YTTM model with the following command.

yttm vocab --model model.yttm --verbose

It's easy to see that all tokens like ▁t, ▁h, etc exist.

@TIXFeniks
Copy link
Author

TIXFeniks commented Apr 28, 2020

Regarding your problem. Can you check how often this special token occurs alone in your train data? I just checked on some english dataset. This occurs on average once in 50 sentences. So I think this should not affect performance. Yes, being separated from the word is indeed rare when not using BPE-dropout, but as BPE-dropout gets introduced, this happens very often (once in every couple of sentences in my real case, or once every word in the example I provided before with dropout probability of 1).

The solution could be an ability to disable dropout for such merges ( with something ) and not the others.

@TIXFeniks
Copy link
Author

@xbelonogov, what do you think about my suggestion from the previous message?

@xbelonogov
Copy link
Contributor

Hi, @TIXFeniks
Your suggestion looks reasonable. But I don't want to make one more options for disabling this type of splits. Every new options make interface more cumbersome and decrease usability. I am okay with doing this by default.

I asked Ivan Provilkov. But he isn't sure that this improve performance. If you have experiments that prove effectiveness of this, I will change default behaviour.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants