Specifying a validation set #66

FOX111 · 2020-04-16T11:15:09Z

I'm training a language model similar to what has been shown here https://github.com/n-waves/multifit/blob/master/notebooks/CLS-JA.ipynb

While running cls_dataset.load_clas_databunch(bs=exp.finetune_lm.bs).show_batch()
I'm getting this output

Running tokenization: 'lm-notst' ...
Validation set not found using 10% of trn
Data lm-notst, trn: 26925, val: 2991
Size of vocabulary: 15000
First 20 words in vocab: ['xxunk', 'xxpad', 'xxbos', 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep', '', '▁', '▁,', '▁.', '▁в', 'а', 'и', 'е', '▁и', 'й', '▁на', 'х']
Running tokenization: 'cls' ...
Data cls, trn: 26925, val: 2991
Running tokenization: 'tst' ...
/home/explorer/miniconda3/envs/fast/lib/python3.6/site-packages/fastai/data_block.py:537: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the following unknown labels, the corresponding items have been discarded.
201, 119, 192, 162, 168...
if getattr(ds, 'warn', False): warn(ds.warn)
Data tst, trn: 2991, val: 7448

I assume this to be a problem with misrepresentation of labels in a validation set that was inferred automatically. Is there a way to explicitly pass a validation set?

The text was updated successfully, but these errors were encountered:

Qe42 · 2020-06-22T15:00:10Z

name your files: train.csv, dev.csv, test.csv and unsup.csv or read the from_df options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying a validation set #66

Specifying a validation set #66

FOX111 commented Apr 16, 2020

Qe42 commented Jun 22, 2020

Specifying a validation set #66

Specifying a validation set #66

Comments

FOX111 commented Apr 16, 2020

Qe42 commented Jun 22, 2020