Model cannot converge #1

theoqian · 2021-10-02T05:56:01Z

I try to train a mask_align model with default config in the repo (only change data paths) and DE-EN training data from https://github.com/lilt/alignment-scripts. In some of training steps the losses are nan and at end of training the loss increases from about 7 to 70.

epoch = 5, step = 49980, loss: nan, f_loss: nan, b_loss: nan, agree_loss: nan, entropy_loss: nan (0.246 sec)
epoch = 5, step = 49990, loss: 64.210, f_loss: 67.750, b_loss: 60.188, agree_loss: 0.000, entropy_loss: 0.241 (0.507 sec)
epoch = 5, step = 50000, loss: 69.115, f_loss: 72.500, b_loss: 65.312, agree_loss: 0.000, entropy_loss: 0.240 (0.652 sec)

carboncoo · 2021-10-03T06:26:46Z

Hi, this is most likely due to the presence of sentence pairs of length 1 in the training data. Our masking strategy does not allow this to happen, so we filter them out. You can use thualign/scripts/remove_single.py to filter the corpus and try training again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model cannot converge #1

Model cannot converge #1

theoqian commented Oct 2, 2021

carboncoo commented Oct 3, 2021

Model cannot converge #1

Model cannot converge #1

Comments

theoqian commented Oct 2, 2021

carboncoo commented Oct 3, 2021