space/none mode potentiel issue with case_markup #176

Zenglinxiao · 2020-10-09T13:03:40Z

When using case_markup in space/none mode, unexpected behavior happens:

>>> pyonmttok.Tokenizer("none", case_markup=True).tokenize("你好世界，这是一个Test。") 
... (['｟mrk_case_modifier_C｠', '你好世界，这是一个test。'], None)
>>> pyonmttok.Tokenizer("none", case_markup=True).detokenize(['｟mrk_case_modifier_C｠', '你好世界，这是一个test。'])
... '你好世界，这是一个test。'

As you can see, .detokenize can not rebuild the original text. Same behavior exists for space.

While mode conservative or aggressive does not suffer this issue. But the result compare to no case_markup is not consistent, as they split the text to insert markup placeholder.

>>> pyonmttok.Tokenizer("conservative").tokenize("你好世界，这是一个Test。")
... (['你好世界', '，', '这是一个Test', '。'], None)
>>> pyonmttok.Tokenizer("conservative", case_markup=True).tokenize("你好世界，这是一个Test。") 
... (['你好世界', '，', '这是一个', '｟mrk_case_modifier_C｠', 'test', '。'], None)

The text was updated successfully, but these errors were encountered:

guillaumekln · 2020-10-09T13:25:29Z

Case markups are not really supported for the "none" and "space" tokenization modes.

case_markup enables segment_case to avoid tokens with mixed casing, but "space" and "none" modes are not allowed to split in the middle of tokens. "space" mode can only split on spaces and "none" mode do not split at all.

Should we just raise an error in this case? Or maybe can you describe what was your expectation when using case_markup and space/none modes?

Zenglinxiao · 2020-10-09T17:03:15Z

I want to use the case_markup feature with sentencepiece in order to decrease the vocab duplication caused by the case, not sure what's the best practice to do it.
As in the sp_model_path section, you mentioned using none for sentencepiece. Therefore I came into this issue.
Just wondering if the none is a must for the use of sentencepiece though.

Correct me if I'm wrong:
With the original sentencepiece, each sentence is handle by: replace whitespace with spacer, then doing the segmentation.
With Tokenizer(sentencepiece model) under none, each sentence is handled as: split placeholder off from the sentence, the rest of the sentence feed to sentencepiece to do segmentation. But if the placeholder is in the middle of the sentence, then the original sentence will be split into two-part and feed to sentencepiece separately. In the end, even with none, the input to sentencepiece is still a list of tokens(I personally prefer "phrase" in this case) rather than a single sentence, therefore not same behavior with spm_encode.

I then tried the following experiments:

EX1: use `pyonmttok.Tokenizer(...)` -> `.tokenize_file(corpus_file)` -> train sentencepiece model on this pretokenized corpus
EX2: Initialize `pyonmttok.Tokenizer(...)` as pretokenizer -> `learner(tokenizer, **other_opts)` -> ingest corpus_file -> learn model

These two appoarch gives different model & vocab, which I think caused by the way Tokenizer ingest file. Models learned with Tokenizer use "tokens"("phrase") rather than "sentence".
This idea of ingest_tokens won't cause issue apparently with BPE, but for sentencepiece which expect a sentence and don't assume language-depend logic (space as natural delimiter to words is language-depend I think) may not guarantee the same result w.r.t. original sentencepiece.
So, Do you have any idea or recommandation on how to correctly use Tokenizer when working with sentencepiece?

guillaumekln · 2020-10-09T18:35:23Z

Your understanding is correct.

The behavior is the same as spm_encode as long as you don't use placeholders. When you use placeholders, you are using a feature that is specific to the Tokenizer and that SentencePiece has no concept of. At this point we expect users to only use the Tokenizer, for training and applying SentencePiece models.

So the recommendation would simply be: do not use SentencePiece scripts directly.

The use case sounds reasonable. A similar issue came up in https://forum.opennmt.net/t/problem-in-tokenize-and-detokenize-during-translation/3954. The difficulty is that we need to lowercase the phrase before SentencePiece so that different casing result in the same segmentation. We would need to add some code to find the original casing after applying SentencePiece. I will look into it.

Alternatively, you can try using mode "conservative" or "aggressive". SentencePiece will be used as a subtokenizer like BPE.

panosk · 2020-12-13T16:48:24Z

I'm also trying to make sentencepiece work with case_markup. I got it working somehow by adding the Tokenizer's case placeholders as user_defined_symbols in sentencepiece. I still get a few <unk>s that I don't get when using sentencepiece in none mode, but with this hack they are reduced a lot. Now the question is: should I lowercase the corpus in order to train the sentencepiece model and get a vocabulary with all subwords lowercased?

panosk · 2020-12-14T11:29:39Z

OK, this actually seems to work. I lowercased, created a sentencepiece model and vocab with onmt-build-vocab with the case placeholders as user_defined_symbols and trained with the raw training files a test model for 5k steps. There are only very few unks that mostly occur in case splitting (which makes sense) and in uppercase tokens (that I can't figure out why).

Maybe this could be adapted in code, so when sentencepiece is used with a mode other than none and with case_markup, the case_markup placeholders should be predefined as user_defined_symbols.

@guillaumekln , could you please clarify what happens and in what order when creating a sentencepiece model and vocab with onmt-build-vocab with mode aggressive and case_markup?

guillaumekln · 2020-12-14T12:20:23Z

could you please clarify what happens and in what order when creating a sentencepiece model and vocab with onmt-build-vocab with mode aggressive and case_markup?

Actually this is not possible with onmt-build-vocab from OpenNMT-tf. It always applies a none tokenization before training the SentencePiece model. Looks like we need to add some errors when trying to configure a custom tokenization (or add support for it).

To set a different tokenization, you could use the SentencePieceLearner directly. Here's what happen when you use a tokenizer with aggressive and case_markup:

we apply the aggressive tokenization
we lowercase the tokens
we feed each lowercase tokens as a "sentence" to SentencePiece

The insertion of case markup tokens does not happen in this learning phase. They are added during tokenization after applying the SentencePiece model.

panosk · 2020-12-14T12:54:31Z

Actually this is not possible with onmt-build-vocab from OpenNMT-tf. It always applies a none tokenization before training the SentencePiece model. Looks like we need to add some errors when trying to configure a custom tokenization (or add support for it).

I see, so essentially it's like running the spm_train directly and onmt-build-vocab just takes care of converting the vocabulary.

To set a different tokenization, you could use the SentencePieceLearner directly.

My attempt was to avoid a separate preprocessing step and have everything ready with onmt-build-vocab -> train, but this seems necessary.

The insertion of case markup tokens does not happen in this learning phase. They are added during tokenization after applying the SentencePiece model.

Yes, and if the SentencePiece model already contains the case markup tokens as user defined symbols, then sentencepiece ignores them when it decodes so the case can be restored correctly and the translated text seems (mostly) fine. But some inconsistencies remain, due to case splitting that creates unseen tokens/subwords to sentencepiece.

guillaumekln · 2020-12-14T13:09:34Z

My attempt was to avoid a separate preprocessing step and have everything ready with onmt-build-vocab -> train, but this seems necessary.

Yes. I added support for pre-tokenization in the PR linked above.

But some inconsistencies remain, due to case splitting that creates unseen tokens/subwords to sentencepiece.

When using aggressive and case_markup, case splitting is applied as part of the aggressive tokenization and before SentencePiece. So there should not be unseen tokens in this case.

panosk · 2020-12-14T13:33:47Z

That's fantastic, thanks!

panosk · 2020-12-29T20:03:36Z

I thought I should leave some feedback on this:

I get lots of unks, all of them after punctuation marks (parenthesis, quotes, etc). I inspected a bit and I noticed that OpenNMTTokenizer does not add the space marker in front of such symbols. How could we eliminate this inconsistency with the way SentencePiece handles punctuation marks?
When SentencePiece is used after pre-tokenization, there is a catch: the number of lines fed to SentencePiece do not correspond to the actual corpus lines but to single tokens, and this makes the sentencepiece_trainer explode and crash with bad malloc before creating the suffix array, even though there's still RAM available. After I put a limit of 100M sentences (which should be single tokens, actually) I was able to train the model without issues. I suspect I can push this limit to 150-200M.

guillaumekln · 2020-12-29T20:40:47Z

I get lots of unks, all of them after punctuation marks (parenthesis, quotes, etc). I inspected a bit and I noticed that OpenNMTTokenizer does not add the space marker in front of such symbols. How could we eliminate this inconsistency with the way SentencePiece handles punctuation marks?

You generated the vocabulary with onmt-build-vocab from OpenNMT-tf, right? When using SentencePiece with pre-tokenization, the output tokens are actually not meant to be compatible with the vocabulary generated by SentencePiece. We should fix the script to rebuild the vocabulary in this case.

panosk · 2021-01-04T09:43:37Z

Yes, the vocab is built with onmt-build-vocab. I just noticed the related PR in OpenNMT-tf repo, thanks!

panosk · 2021-01-28T17:34:22Z

Some more feedback: I updated pyonmttok and OpenNMT-tf and tried to build a new vocab with sentencepiece and case_markup. The sp model and the vocab are build, but the user-defined symbols are not included in the vocabulary, even though they are recognized and mentioned from sentencepiece when training starts.
Also, now the only option that is accepted from onmt-build-vocab for building a sentencepiece model is none. This means we loose some of the goodies aggressive offers but at least we should be able to use case_markup, right?

guillaumekln · 2021-01-28T18:03:41Z

To summarize what was done in the latest update, there are now 2 modes when generating the SentencePiece vocabulary:

When no pretokenizer is set:

Start SentencePiece training on raw data
Convert the SentencePiece vocabulary to OpenNMT-tf format

When a pretokenizer is set:

Tokenize the training data with the pretokenization
Start SentencePiece training where each line is a single token (SentencePiece is trained as a subtokenizer)
Tokenize the training data with the SentencePiece model
Extract the N most frequent tokens

The sp model and the vocab are build, but the user-defined symbols are not included in the vocabulary, even though they are recognized and mentioned from sentencepiece when training starts.

Are the user-defined symbols in the training data? As said above, the training data is retokenized with SentencePiece so the symbols should appear in the tokenized data to be included in the vocabulary.

Also, now the only option that is accepted from onmt-build-vocab for building a sentencepiece model is none. This means we loose some of the goodies aggressive offers but at least we should be able to use case_markup, right?

You should still be able to use another tokenization mode such as aggressive. Is there an error or bug?

panosk · 2021-01-28T19:35:21Z

I should get a better grasp of it, so I could use your help. First here is the command:

onmt-build-vocab --tokenizer_config ../../../Tokenization/lower_tokenization.yml --size 32000 --sentencepiece user_defined_symbols="｟D01｠,｟D02｠,｟D03｠,｟D04｠,｟D05｠,｟mrk_case_modifier_C｠,｟mrk_case_modifier_L｠,｟mrk_case_modifier_U｠,｟mrk_case_modifier_M｠,｟mrk_case_modifier_N｠,｟mrk_begin_case_region_C｠,｟mrk_begin_case_region_L｠,｟mrk_begin_case_region_U｠,｟mrk_begin_case_region_M｠,｟mrk_begin_case_region_N｠,｟mrk_end_case_region_C｠,｟mrk_end_case_region_L｠,｟mrk_end_case_region_U｠,｟mrk_end_case_region_M｠,｟mrk_end_case_region_N｠" character_coverage=1 input_sentence_size=10000000 num_threads=16 --size_multiple 8 --save_vocab vocab/base corpus.combined

Here is my lower_tokenization.yml:

type: OpenNMTTokenizer
 params:
   mode: none
   case_markup: true
   spacer_annotate: true
   soft_case_region: true
   preserve_placeholders: true
   preserve_segmented_tokens: true
   #segment_case: true
   #segment_numbers: true

So, with this configuration, I think I'm using "Mode 1" and all options are ignored; the sp model and vocab are built, but the user-defined symbols are not added to the vocab, which confuses me. These symbols are not included in the corpus, but this is not a problem when using sentencepiece directly to create a model and vocab --it adds the user-defined symbols even when not present in the training corpus.

When I change mode to anything else in my config (aggressive, conservative, etc), onmt-build-vocab refuses to run and throws this:

   tokenizer = tokenizers.make_tokenizer(args.tokenizer_config)
  File "/home/panos/venv36/lib/python3.6/site-packages/opennmt/tokenizers/tokenizer.py", line 322, in make_tokenizer
    tokenizer = tokenizer_class(**tokenizer_params)
  File "/home/panos/venv36/lib/python3.6/site-packages/opennmt/tokenizers/opennmt_tokenizer.py", line 23, in __init__
    self._tokenizer = pyonmttok.Tokenizer(**kwargs)
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. pyonmttok._ext.Tokenizer(tokenizer: pyonmttok._ext.Tokenizer)
    2. pyonmttok._ext.Tokenizer(mode: str, *, bpe_model_path: str = '', bpe_vocab_path: str = '', bpe_vocab_threshold: int = 50, bpe_dropout: float = 0, vocabulary_path: str = '', vocabulary_threshold: int = 0, sp_model_path: str = '', sp_nbest_size: int = 0, sp_alpha: float = 0.1, joiner: str = '￭', joiner_annotate: bool = False, joiner_new: bool = False, spacer_annotate: bool = False, spacer_new: bool = False, case_feature: bool = False, case_markup: bool = False, soft_case_regions: bool = False, no_substitution: bool = False, preserve_placeholders: bool = False, preserve_segmented_tokens: bool = False, segment_case: bool = False, segment_numbers: bool = False, segment_alphabet_change: bool = False, support_prior_joiners: bool = False, segment_alphabet: object = None)

Invoked with: kwargs: mode='aggresive', case_markup=True, spacer_annotate=True, soft_case_region=True, preserve_placeholders=True, preserve_segmented_tokens=True

If I get it correctly, "Mode 2" requires using any other mode except none. So, how would you advise to train sentencepiece with onmt-build-vocab in order to get case_markup and all the other nice things from aggressive, if possible?

Thanks for your patience and your help.

guillaumekln · 2021-01-29T08:35:09Z

So, with this configuration, I think I'm using "Mode 1"

Sorry for the confusion but when I said "When a pretokenizer is set", it's whenever the option --tokenizer_config is set. It's easier to explain this way. So this configuration should trigger "Mode 2".

When I change mode to anything else in my config (aggressive, conservative, etc), onmt-build-vocab refuses to run and throws this:

There is a typo in your config: it should be soft_case_regions not soft_case_region.

dmar1n · 2021-01-29T09:44:58Z

I'm following this thread with a lot of interest, many thanks @guillaumekln and @panosk.

So, if I understand well, it should be possible to pretokenise raw data using the aggressive mode, then create SP vocabs from that pretokenised data, then use the converted vocabs to segment text for training and inference with the OpenNMT tokeniser. I also understand this can be done manually or via the script.

However, I suppose that for the aggressive mode to work as expected when tokenising/detokenising, one should apply joiner annotation; otherwise, I see many possible ambiguity cases when detokenising. On the other hand, if a SP model is used, the tokens are generated with the spacer annotation by default, which is incompatible with the joiner annotation according to the doc.

Am I right? Or applying the aggressive mode does not need joiner annotation at all, and therefore, is fully compatible with using SP vocab models? Otherwise, could this be solved by applying different parameters when pretokenising for vocab creation and pretokenising for training/inference?

panosk · 2021-01-29T10:00:59Z

Hi @dmar1n ,
You can use the option spacer_annotate in which case the joiner is the same symbol used by sentencepiece.

@guillaumekln ,
Apologies for the naive typo, indeed now I can use aggressive to build the sentencepiece model and vocab. However, the user-defined symbols are not included, as they have 0 frequency. Maybe a condition may be added when extracting the N most frequent tokens to keep entries with 0 frequency, as these tokens will be only meta-tokens. Then again, why is that extra step needed? I mean, doesn't the vocab created by sentencepiece already contain the most frequent tokens?

guillaumekln · 2021-01-29T10:20:36Z

@dmar1n
Joiner and spacer annotation is a postprocessing step, so it can work with any tokenization modes:

$ echo "Hello World!" | cli/tokenize --mode aggressive --joiner_annotate
Hello World ￭!
$ echo "Hello World!" | cli/tokenize --mode aggressive --spacer_annotate
Hello ▁World !

$ echo "Hello World!" | cli/tokenize --mode none --sp_model_path ~/data/wmt_ende/wmtende.model 
▁H ello ▁World !
$ echo "Hello World!" | cli/tokenize --mode none --sp_model_path ~/data/wmt_ende/wmtende.model --joiner_annotate
H ￭ello World ￭!

On the other hand, if a SP model is used, the tokens are generated with the spacer annotation by default, which is incompatible with the joiner annotation according to the doc.

When you use SentencePiece via the OpenNMT Tokenizer, the spacers are removed internally and converted into metadata so that we can later decide if we want to inject joiners or spacers.

From the user perspective, using a pretokenization with SentencePiece should be the same as using a pretokenization with BPE.

@panosk

Then again, why is that extra step needed? I mean, doesn't the vocab created by sentencepiece already contain the most frequent tokens?

This extra step is needed because the internal SentencePiece vocabulary is invalid when using a pretokenization. The basic example is when you want to use joiner annotation with SentencePiece: the SentencePiece internal vocabulary will contain spacers, but the external vocabulary should include joiners. This is why we need to get the vocabulary from the training data, and not from the SentencePiece internal representation.

But I'm not sure to understand the use case of user-defined symbols with 0 frequency. If they are not in the tokenized training data why should they appear in the vocabulary?

panosk · 2021-01-29T11:27:21Z

Thanks for the explanations @guillaumekln , I see.

But I'm not sure to understand the use case of user-defined symbols with 0 frequency. If they are not in the tokenized training data why should they appear in the vocabulary?

I'm adding these symbols later for training the NMT model and for inference, at least that was the case when I was using sentencepiece directly --I may have to adapt it now, no big deal.
Anyway, I'll run a few iterations with the resulting sp model and vocab and see how it goes.

panosk · 2021-01-29T20:53:03Z

After a few tests, I can confirm that the user-defined symbols must be included in the vocab. Apart from any custom symbols (which can be included in the corpus for training the sp model), the major problem is with the case markup symbols which cannot be included in the training corpus beforehand but should be in the vocab anyway, otherwise casing doesn't work and there are countless <unks>s in their place.

Just to make sure that I'm not doing anything wrong from my part, after creating the sp model and vocab, I used the same tokenization .yml config for the actual NMT training, with the extra option sp_model_path: /path_to_sp.model

dmar1n · 2021-02-01T12:31:13Z

To complete @panosk comments, I have also run some tests with the same idea (applying aggressive mode with case markup as pretok and SentencePiece as vocab model).

I first tried manually by building the SentencePiece model on pretokenised text (which already included special symbols). This sort of worked (no errors), but I had the same problem as @panosk: the predictions had many unks, presumably related to the aggressive tokenisation.

With the script, I managed to reduce the amount of unks a lot, but there are still some in the evaluation predictions. This does not seem to impact the quality too much, but I cannot explain from where these unks come from, since the validation data should be fully covered by the vocab.

On the other hand, I wonder if this is somehow an inevitable side effect of using pretokenised data with the aggressive mode, and then maybe the replace_unk would help.

Concretely, I'm creating the vocabs with the script and the following --tokenizer_config:

type: OpenNMTTokenizer
params:
  case_markup: true
  joiner_annotate: false
  mode: aggressive
  segment_alphabet_change: true
  segment_case: true
  segment_numbers: true
  spacer_annotate: true
  support_prior_joiners: false

@panosk

the major problem is with the case markup symbols which cannot be included in the training corpus beforehand

When you tokenise the data for training, do you pretokenise using the OpenNMT tokeniser? This should add the case markup symbols to the training data. At least, this worked for me.

guillaumekln · 2021-02-01T14:22:05Z

The case markup symbols should be included in the vocabulary. I just the try building the following dummy vocabulary to make sure it works:

$ echo "Hello world!" > tmp.txt
$ onmt-build-vocab --sentencepiece --size 12 --tokenizer_config '{"mode": "aggressive", "case_markup": true, "joiner_annotate": true}' --save_vocab output tmp.txt
$ cat output.vocab 
<blank>
<s>
</s>
￭l
￭o
｟mrk_case_modifier_C｠
h
￭e
w
￭r
￭d
￭!

dmar1n · 2021-02-01T14:53:22Z

I confirm the case markup tokens are included in the vocabulary. These are the first lines of my target vocab:

<blank>
<s>
</s>
｟mrk_case_modifier_C｠
▁de
,
▁la
'
.
▁l
’
▁et
▁les
▁des
▁à
｟mrk_begin_case_region_U｠
｟mrk_end_case_region_U｠

And indeed, the predictions include the symbols.

Here is an example of prediction with unk:

Target prediction: ｟PH｠｟mrk_case_modifier_C｠ <unk> ▁dossier :
Validation reference: ｟PH｠ dossier ▁d ’ enquête :
Validation source: ｟PH｠｟mrk_case_modifier_C｠ inquiry ▁file :

In this sentence, ｟PH｠ is a custom symbol correctly predicted.

panosk · 2021-02-01T19:29:22Z

Well... I was using a lowercased version of my corpus with onmt-build-vocab (facepalm). This explains the absence of the case-markup symbols from the vocab but it still doesn't explain the plethora of <unk>s, as @dmar1n notices too. Once I realized I had been using the lowercased version of my corpus, I was almost certain a new test will show much more promising results, but I was surprised to see the amount of <unk>s and the model performance were not affected by much (at least for the first few thousand steps). As a comparison, using a vanilla sentencepiece model and vocab (which is just converted to the proper format with onmt-build-vocab) gives 0 <unk>s even in the very first evaluation step. Now the amount of sentences containing at least 1 unk accounts for ~20% of the total number of the predictions.

I also wonder if the increased amount of <unk>s is the price we have to pay for getting case handling.

dmar1n · 2021-02-02T08:21:07Z

But this is really strange, because <unk>s don't make sense in validation data, which is necessarily covered by the vocab. And moreover, the <unk>s seem to appear instead of normal words/tokens. With SentencePiece/BPE, the only <unk>s possible should be very rare characters not covered by the vocab.

I'm editing this post, as the example I gave was not exact. Here is a real case:

Source:｟PH｠｟mrk_case_modifier_C｠ create ▁a ▁structured ▁interview
Hyp: ｟PH｠｟mrk_case_modifier_C｠ <unk> ▁un ▁entretien ▁structuré
Ref: ｟PH｠｟mrk_case_modifier_C｠ créer ▁un ▁entretien ▁structuré

The source vocab has create and ▁create
The target vocab has créer and ▁créer

guillaumekln · 2021-02-03T08:55:01Z

When training the SentencePiece model, do you set the input_sentence_size option?

With SentencePiece/BPE, the only s possible should be very rare characters not covered by the vocab.

That's only true for plain SentencePiece. When using a pretokenization with either SentencePiece or BPE, <unk>s are possible depending on the data distribution when generating the vocabulary.

I'm just not sure sure why the <unk> frequency is so high. In particular I don't see how the example above can happen if all expected tokens are in the vocabulary.

I understand the initial goal of this issue is to train case insensitive SentencePiece models. We might need to think of a different approach that does not involve a full pretokenization.

dmar1n · 2021-02-03T11:59:12Z

When training the SentencePiece model, do you set the input_sentence_size option?

Yes, but with a value in the order of millions. Otherwise, data is monolingual, of good quality and deduplicated.

In particular I don't see how the example above can happen if all expected tokens are in the vocabulary.

Actually, the example had the unk at 5k steps, but it corrected itself in a subsequent prediction. In general, I noticed that the number of unks is reduced as the training goes on. However, sentences with one or more unks still remain even after a significant number of steps (at 17k steps, I counted 209 sentences with unk out of 2k validation lines with BLEU scores already plateauing).

After a number of tests, I can confirm what @panosk pointed out: the issue seems to be linked to a non-alphabetic character preceding the token, such as apostrophes, parenthesis, etc.

To give you another more representative example (at 17k steps):

▁children ▁( unaccompanied ▁or ▁with ▁their ▁families )
▁les ▁enfants ▁( <unk> ▁ou ▁avec ▁leur ▁famille )
▁les ▁enfants ▁( seuls ▁ou ▁accompagnés ▁de ▁leur ▁famille )

In this case, ▁unaccompanied and ▁seuls are in the vocabs, but not their variants without the spacer.

guillaumekln · 2021-02-03T12:06:13Z

Yes, but with a value in the order of millions.

Just to note that when using a pretokenization, input_sentence_size corresponds to a number of words, since the SentencePiece model is trained at the word-level and not the sentence-level.

After a number of tests, I can confirm what @panosk pointed out: the issue seems to be linked to a non-alphabetic character preceding the token, such as apostrophes, parenthesis, etc.

Maybe using joiner_annotate instead could improve the situation?

panosk · 2021-02-03T12:24:56Z

Just to note that when using a pretokenization, input_sentence_size corresponds to a number of words, since the SentencePiece model is trained at the word-level and not the sentence-level.

You are right, but I was careful with that. So while with a normal sentencepiece training in sentence level I set a limit of 10M sentences, with pretokenization I set a limit of 300M (tokens) which should be enough --at least that's a safe high limit for 64GB of RAM.

Maybe using joiner_annotate instead could improve the situation?

That's a good idea, I'll try it asap!

dmar1n · 2021-02-03T12:40:17Z

Thanks a lot for the hints, @guillaumekln. I was indeed using a value of 10M. I will remove that argument and limit the initial corpus beforehand to 10M lines.

Regarding the joiner annotation, this was my initial idea when I first intervened in the thread. Unfortunately, when using joiner annotation, I got some incompatibility error with SentencePiece models. I will try again, though.

dmar1n · 2021-02-03T14:19:55Z

Here you have some updates. I have tried with the joiner annotation. The vocabs are correctly created (there are the expected joiners and no spacers). But when I tokenise the training data, I get the following error:

ValueError: SentencePiece vocabulary restriction requires the tokenization to use "spacer_annotate" (same as spm_encode)

If I then change the config to have spacer annotation (using the vocabs correctly created with the joiners), I get extremely segmented data, which is normal given that the vocab does not have any spacer.

guillaumekln · 2021-02-03T14:57:21Z

I see.

At this point why not using BPE? Since managing case with SentencePiece currently requires a pretokenization (could be improved in the future), it seems there is little benefit over BPE. From experience the following BPE tokenization should work well in many cases:

pyonmttok.Tokenizer(
    "aggressive",
    bpe_model_path=...,
    vocabulary_path=...,
    joiner_annotate=True,
    case_markup=True,
    soft_case_regions=True,
    preserve_placeholders=True,
    preserve_segmented_tokens=True,
    segment_case=True,
    segment_numbers=True,
    segment_alphabet_change=True,
)

dmar1n · 2021-02-03T17:16:35Z

Thanks for the config sample! I see there are options in that configuration that I was not specifying in my tests.

And for clarification, I have been using BPE as a subword model via SentencePiece all the time. I referred to SentencePiece just as the library used to subtokenise, which I configure via the option --sentencepiece model_type=bpe.

Update: I think I understand better now. So, the simplest way to proceed would be to create a BPE model, or a BPE-based tokeniser using the Python wrapper, with the required OpenNMT tokeniser options. This should indeed simplify the process a lot. I will try this approach and let you know. Many thanks again for your help!

guillaumekln · 2021-02-04T09:00:25Z

Yes, I meant using the BPE implementation in the Tokenizer. The BPE training is not integrated in onmt-build-vocab, but it should be fairly easy to use the Python API to train the model, apply it on the training data, and then build the vocabulary.

panosk · 2021-02-04T09:21:28Z

@guillaumekln , I know this gets a bit off, but could you please verify the below steps for using BPE? I've been using sentencepiece since forever and all my code is adapted to it, but I really need case handling so I'll test BPE extensively.

Tokenize source and target corpus with OpenNMTTokenizer using aggressive mode and all the case and other options I want
Use subword-nmt learn-joint-bpe-and-vocab with both training files
Use onmt-build-vocab --from_vocab bpe-vocab.{src,tgt} --save_vocab onmt-vocab.{src,tgt} --size_multiple 8 //I keep the vocab sizes as resulted from subword-nmt
Replace @@ with the joiner symbol in the vocabs
Add the BPE model and the converted vocabularies in the tokenization .yml files
Train?

Thanks in advance!

guillaumekln · 2021-02-04T09:55:44Z

I recommend training the BPE model with the Tokenizer directly. It will take care of many details and ensure consistency. Here's a basic workflow:

import pyonmttok

tokenizer = pyonmttok.Tokenizer(
    "aggressive",
    joiner_annotate=True,
    case_markup=True,
    soft_case_regions=True,
    preserve_placeholders=True,
    preserve_segmented_tokens=True,
    segment_case=True,
    segment_numbers=True,
    segment_alphabet_change=True,
)

learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)
learner.ingest_file("train.txt")

tokenizer = learner.learn("bpe.model")
tokenizer.tokenize_file("train.txt", "train.txt.tok", num_threads=4)

Then build the vocabulary from train.txt.tok:

onmt-build-vocab --save_vocab bpe.vocab train.txt.tok

(Note: symbols=32000 is the number of BPE merge operations, and not the vocabulary size. There will probably be more unique tokens in the tokenized data.)

Finally you can either train directly on train.txt.tok without configuring the tokenization .yml files, or re-tokenize train.txt using the BPE model and vocabulary restriction (the vocabulary_path argument).

Let's try not to diverge too much from the initial issue. For further discussion about BPE, please consider opening a topic on the forum.

panosk · 2021-02-04T10:37:25Z

Thanks a lot!

Let's try not to diverge too much from the initial issue. For further discussion about BPE, please consider opening a topic on the forum.

Absolutely!

dmar1n · 2021-02-05T07:35:54Z

I followed the approach suggested to build vocabs and tokenise training data. Until here, everything works like a charm. After 15k training steps, though, there are still many <unk>s, but now with a different pattern: the <unk>s appear between digits or special symbols. After some analysis, it seems that these <unk>s correspond to joiners that end up between those characters. Here is an example:

Source: (￭ 2 ￭ 0 ￭ 1 ￭ 9 ￭)
Hyp: (￭ 2 <unk> 0 <unk> 1 <unk> 9 ￭)

As you can see, each parenthesis has its joiner attached, while the numbers have spaces around; unfortunately, all indicates that these orphaned joiners are systematically replaced with <unk>s in the predictions. Interestingly enough, this does not happen with alphabetic or punctuation tokens.

I replicated the proposed settings/workflow line by line, but maybe I missed an important option here? Otherwise, it shouldn't be difficult to fix this issue in a postprocessing step, but I guess it would be better to find the root cause first. I will look at it and let you know if I find anything relevant.

panosk · 2021-02-05T08:10:16Z

Hi @dmar1n ,

If you followed the steps for using BPE directly in the tokenizer with no sentencepiece involvement, I can confirm that it works like a charm and I get 0 <unk>s right from the start, so maybe you missed sth.

As @guillaumekln noted, we are getting off track from the initial issue, so feel free to post your last comment in the forum and we can continue there.

dmar1n · 2021-02-05T08:17:23Z

Thanks, @panosk, it's good to know that it works for you. I confirm I followed the exact workflow and options suggested. Also note that the issue remains the same for me; that is, not being able to use case markup in any configuration with subword tokenisation. Anyway, I will give it another try and post the issue in the forum, if still unresolved. Thanks both for your help!

Just a quick update. The suggested solution did work eventually. I think it was a problem of the versions installed. With the latest versions, it works great. Thanks again!

guillaumekln · 2021-02-15T14:02:37Z

To get back to the initial issue and request, case_markup with "true" SentencePiece would definitely be useful. But I still did not find a good solution that ticks all the boxes:

the SentencePiece model is case insensitive ("hello", "Hello", "HELLO" have the same segmentation)
the vocabulary generated by SentencePiece is compatible with the Tokenizer output (meaning the Tokenizer output tokens are exactly the same as the SentencePiece tokens)
mixed-casing tokens are always split on case changes ("WiFi" can not be segmented into "wifi", "w ifi", or "wif i")

So I'm not sure it is possible to effectively implement this outside of SentencePiece. If you have any ideas, please let me know.

guillaumekln added the enhancement label Oct 12, 2020

guillaumekln mentioned this issue Dec 14, 2020

Support pre-tokenization for SentencePiece training in build-vocab OpenNMT/OpenNMT-tf#753

Merged

space/none mode potentiel issue with case_markup #176

space/none mode potentiel issue with case_markup #176

Comments

Zenglinxiao commented Oct 9, 2020

guillaumekln commented Oct 9, 2020

Zenglinxiao commented Oct 9, 2020

guillaumekln commented Oct 9, 2020 • edited Loading

panosk commented Dec 13, 2020

panosk commented Dec 14, 2020

guillaumekln commented Dec 14, 2020 • edited Loading

panosk commented Dec 14, 2020

guillaumekln commented Dec 14, 2020

panosk commented Dec 14, 2020

panosk commented Dec 29, 2020

guillaumekln commented Dec 29, 2020

panosk commented Jan 4, 2021

panosk commented Jan 28, 2021

guillaumekln commented Jan 28, 2021

panosk commented Jan 28, 2021

guillaumekln commented Jan 29, 2021

dmar1n commented Jan 29, 2021

panosk commented Jan 29, 2021

guillaumekln commented Jan 29, 2021 • edited Loading

panosk commented Jan 29, 2021

panosk commented Jan 29, 2021 • edited Loading

dmar1n commented Feb 1, 2021

guillaumekln commented Feb 1, 2021

dmar1n commented Feb 1, 2021

panosk commented Feb 1, 2021

dmar1n commented Feb 2, 2021 • edited Loading

guillaumekln commented Feb 3, 2021

dmar1n commented Feb 3, 2021

guillaumekln commented Feb 3, 2021

panosk commented Feb 3, 2021

dmar1n commented Feb 3, 2021

dmar1n commented Feb 3, 2021

guillaumekln commented Feb 3, 2021

dmar1n commented Feb 3, 2021 • edited Loading

guillaumekln commented Feb 4, 2021

panosk commented Feb 4, 2021

guillaumekln commented Feb 4, 2021 • edited Loading

panosk commented Feb 4, 2021

dmar1n commented Feb 5, 2021 • edited Loading

panosk commented Feb 5, 2021

dmar1n commented Feb 5, 2021 • edited Loading

guillaumekln commented Feb 15, 2021

guillaumekln commented Oct 9, 2020 •

edited

Loading

guillaumekln commented Dec 14, 2020 •

edited

Loading

guillaumekln commented Jan 29, 2021 •

edited

Loading

panosk commented Jan 29, 2021 •

edited

Loading

dmar1n commented Feb 2, 2021 •

edited

Loading

dmar1n commented Feb 3, 2021 •

edited

Loading

guillaumekln commented Feb 4, 2021 •

edited

Loading

dmar1n commented Feb 5, 2021 •

edited

Loading

dmar1n commented Feb 5, 2021 •

edited

Loading