Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in LM pretraining #62

Open
blazejdolicki opened this issue Mar 25, 2020 · 1 comment
Open

error in LM pretraining #62

blazejdolicki opened this issue Mar 25, 2020 · 1 comment

Comments

@blazejdolicki
Copy link

blazejdolicki commented Mar 25, 2020

What I did?

  • Checked out the pretrain-lm branch because it has clear instructions how to pretrain LM (Example how to pretrain lm + introduction of config_name #57).
  • Installed required packages.
  • Executed bash prepare_wiki.sh de
  • Executed python -W ignore -m multifit new multifit_paper_version replace_ --name my_lm - train_ --pretrain-dataset data/wiki/de-100
  • Received the following traceback:
    python -W ignore -m multifit new multifit_paper_version replace_ --name my_lm - train_ --pretrain-dataset data/wiki/de-100
    Setting LM weights seed seed to 0
    Running tokenization: 'lm-notst' ...
    Wiki text was split to 1 articles
    Wiki text was split to 1 articles
    Wiki text was split to 1 articles
    Traceback (most recent call last):
    File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
    File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
    File "/home/ubuntu/multifit/multifit/__main__.py", line 16, in <module>
    fire.Fire(Experiment())
    File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
    File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 468, in _Fire
    target=component.__name__)
    File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
    File "/home/ubuntu/multifit/multifit/training.py", line 587, in train_
    self.pretrain_lm.train_(pretrain_dataset)
    File "/home/ubuntu/multifit/multifit/training.py", line 275, in train_
    learn = self.get_learner(data_lm=dataset.load_lm_databunch(bs=self.bs, bptt=self.bptt, limit=self.limit))
    File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 208, in load_lm_databunch
    limit=limit)
    File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 258, in load_n_cache_databunch
    databunch = self.databunch_from_df(bunch_class, train_df, valid_df, **args)
    File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 271, in databunch_from_df
    **args)
    File "/home/ubuntu/multifit/fastai_contrib/text_data.py", line 147, in make_data_bunch_from_df
    TextList.from_df(valid_df, path, cols=text_cols, processor=processor))
    File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fastai/data_block.py", line 434, in __init__
    if not self.train.ignore_empty and len(self.train.items) == 0:
    TypeError: len() of unsized object

From initial debugging, train.items is an ndarray with shape () . When I print it, it returns articles in German. I suppose this part suggests a problem Wiki text was split to 1 articles - I reckon the wiki text should be split in more than 1 article. So maybe something goes wrong in read_wiki_articles() in dataset.py... This is my educated guess, but I don't know where to go from here.

@blazejdolicki
Copy link
Author

My package versions differ slightly from those in requirements.txt, maybe sacremoses is related:
fire 0.3.0
sacremoses 0.0.38
sentencepiece 0.1.85
fastai 1.0.47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant