Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN unigram model score error with sentencepiece 0.1.98 #851

Closed
lucaslingle opened this issue Apr 14, 2023 · 3 comments
Closed

NaN unigram model score error with sentencepiece 0.1.98 #851

lucaslingle opened this issue Apr 14, 2023 · 3 comments
Assignees
Labels

Comments

@lucaslingle
Copy link

On a clean ubuntu machine with sentencepiece 0.1.98 installed via pip, I am getting nan scores when training a unigram model.

For example, the following script does not work. However, it worked with version 0.1.97.

import tempfile
import tensorflow_datasets as tfds
import sentencepiece as spm

def dump_chars_to_tempfile(ds, maxchars):
    char_count = 0
    with tempfile.NamedTemporaryFile(delete=False, prefix="/tmp/ds_chars") as outfp:
        for document_chars in ds:
            if char_count >= maxchars:
                break
            outfp.write(document_chars + b" ")
            char_count += len(document_chars)
        return outfp.name, char_count

chardump_ds = tfds.load("wiki40b/en:1.3.0", split="train").map(lambda r: r["text"]).as_numpy_iterator()
fname, _ = dump_chars_to_tempfile(ds=chardump_ds, maxchars=int(1e8))

temp_fp = tempfile.NamedTemporaryFile(delete=False, prefix="/tmp/sp_tmp")
spm.SentencePieceTrainer.Train(
    input=fname,
    vocab_size=32000,
    character_coverage=1.0,
    model_prefix=temp_fp.name,
    model_type="unigram",
    user_defined_symbols=[],
    pad_id=0,
    bos_id=-1,  # disable bos id
    eos_id=1,
    unk_id=2,
)

The stacktrace is

sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /tmp/ds_charsrpp7ukvr
  input_format: 
  model_prefix: /tmp/sp_tmphrtwan9z
  model_type: UNIGRAM
  vocab_size: 32000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 2
  bos_id: -1
  eos_id: 1
  pad_id: 0
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(351) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(183) LOG(INFO) Loading corpus: /tmp/ds_charsrpp7ukvr
trainer_interface.cc(378) LOG(WARNING) Found too long line (4536 > 4192).
trainer_interface.cc(380) LOG(WARNING) Too long lines are skipped in the training.
trainer_interface.cc(381) LOG(WARNING) The maximum length can be changed with --max_sentence_length=<size> flag.
trainer_interface.cc(407) LOG(INFO) Loaded all 425807 sentences
trainer_interface.cc(414) LOG(INFO) Skipped 1935 too long sentences.
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <pad>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(423) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(428) LOG(INFO) Normalizing sentences...
trainer_interface.cc(537) LOG(INFO) all chars count=88479119
trainer_interface.cc(548) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=3623
trainer_interface.cc(559) LOG(INFO) Final character coverage=1
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 425801 sentences.
unigram_model_trainer.cc(247) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(251) LOG(INFO) Extracting frequent sub strings... node_num=43616444
unigram_model_trainer.cc(301) LOG(INFO) Initialized 577125 seed sentencepieces
unigram_model_trainer.cc(150) [!std::isnan(score)] 
Program terminated with an unrecoverable error.

Thought the developers would want to know. I will use version 0.1.97 in the meantime. Thank you!

@taku910
Copy link
Collaborator

taku910 commented Apr 15, 2023

Thank you. It seems that the seed vocabulary has an extremely large score. It would be a critical, so we will fix it soon. Thank you for the report.

@taku910 taku910 self-assigned this Apr 20, 2023
@taku910 taku910 added the bug label Apr 20, 2023
@taku910
Copy link
Collaborator

taku910 commented May 2, 2023

@ngan-nt
Copy link

ngan-nt commented Jul 7, 2023

Hi, I still have this problem even though I install 0.1.99 version. Thank you!

trainer_interface.cc(522) LOG(INFO) Found null character. The corpus must be encoded in utf-8.                                                                                           
trainer_interface.cc(537) LOG(INFO) all chars count=2194048217
trainer_interface.cc(548) LOG(INFO) Done: 99.99% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=3063
trainer_interface.cc(559) LOG(INFO) Final character coverage=0.9999
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 967719 sentences.
unigram_model_trainer.cc(222) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(226) LOG(INFO) Extracting frequent sub strings... node_num=1237217680                                                                                           
unigram_model_trainer.cc(274) LOG(INFO) Initialized 1003063 seed sentencepieces
trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 967719
trainer_interface.cc(608) LOG(INFO) Done! 40765528
unigram_model_trainer.cc(564) LOG(INFO) Using 40765528 sentences for EM training
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=702716 obj=59.4182 num_tokens=479275856 num_tokens/piece=682.034                                                              
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=530593 obj=80.5757 num_tokens=480490361 num_tokens/piece=905.572                                                              
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=363670 obj=58.2552 num_tokens=502565215 num_tokens/piece=1381.93                                                              
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=1 size=357052 obj=62.4634 num_tokens=519818158 num_tokens/piece=1455.86                                                              
unigram_model_trainer.cc(580) LOG(INFO) EM sub_iter=0 size=267764 obj=58.906 num_tokens=510585662 num_tokens/piece=1906.85                                                               
unigram_model_trainer.cc(125) [!std::isnan(score)]
Program terminated with an unrecoverable error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants
@lucaslingle @taku910 @ngan-nt and others