Training does not start #17

acDante · 2019-01-12T17:32:33Z

Hi, Luheng:
Thanks for your great work! I encountered some strange errors during training. I used the following to start training your model :
python python/train.py --config=./config/srl_config.json --model=./output --train=./sample_data/sentences_with_gold.txt --dev=./sample_data/sentences_with_gold.txt --task=srl

And I got these outputs in the terminal:

/scratch/users/duxi/miniconda3/envs/deep_srl/lib/python2.7/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:04:00.0)
Task: srl
Embedding size=100
Extracting features
Extraced 19 words and 9 tags
Max training sentence length: 9
Max development sentence length: 9
Warning: not using official gold predicates. Not for formal evaluation.
Dev data has 1 batches.
Data loading duration was 0:00:14.
[WARNING] Log directory ./output is not empty, previous checkpoints might be overwritten
Preparation duration was 0:00:00.
Using 2 feature types, projected output dim=200.
('lstm_0_rdrop', 0.1, True)
<neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe5782bab50>
('lstm_1_rdrop', 0.1, True)
<neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe5782620d0>
('lstm_2_rdrop', 0.1, True)
<neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe56bf82c10>
('lstm_3_rdrop', 0.1, True)
<neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe570087f90>
('lstm_4_rdrop', 0.1, True)
<neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe5781912d0>
('lstm_5_rdrop', 0.1, True)
<neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe570090f90>
('lstm_6_rdrop', 0.1, True)
<neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe57809b590>
('lstm_7_rdrop', 0.1, True)
<neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe578203f90>
embedding_0 embedding_0 [ 19 100]
embedding_1 embedding_1 [ 2 100]
lstm_0_W lstm_0_W [ 200 1800]
lstm_0_U lstm_0_U [ 300 1500]
lstm_0_b lstm_0_b [1800]
lstm_1_W lstm_1_W [ 300 1800]
lstm_1_U lstm_1_U [ 300 1500]
lstm_1_b lstm_1_b [1800]
lstm_2_W lstm_2_W [ 300 1800]
lstm_2_U lstm_2_U [ 300 1500]
lstm_2_b lstm_2_b [1800]
lstm_3_W lstm_3_W [ 300 1800]
lstm_3_U lstm_3_U [ 300 1500]
lstm_3_b lstm_3_b [1800]
lstm_4_W lstm_4_W [ 300 1800]
lstm_4_U lstm_4_U [ 300 1500]
lstm_4_b lstm_4_b [1800]
lstm_5_W lstm_5_W [ 300 1800]
lstm_5_U lstm_5_U [ 300 1500]
lstm_5_b lstm_5_b [1800]
lstm_6_W lstm_6_W [ 300 1800]
lstm_6_U lstm_6_U [ 300 1500]
lstm_6_b lstm_6_b [1800]
lstm_7_W lstm_7_W [ 300 1800]
lstm_7_U lstm_7_U [ 300 1500]
lstm_7_b lstm_7_b [1800]
softmax_W softmax_W [300 9]
softmax_b softmax_b [9]

After these output, I never got other terminal output and the file "./output/checkpoints.tsv" remains empty even after the training is started for a long time. It seems the training does not make any progress at all. I am not sure if this is a GPU-specific issue: I am using cuda8.0 + cudnn8.0 and here is my theano configuration file:

[global]
device = cuda
floatX = float64
mode = FAST_RUN

[cuda]
root=/usr/local/cuda-8.0/

[dnn]
enable=True
include_path=/usr/local/cuda-8.0/include
library_path=/usr/local/cuda-8.0/lib64

[lib]
cnmem = 0.8

[nvcc]
fastmath = True

[gcc]
cxxflags=-Wno-narrowing
~

Could you give me any ideas about the potential reason ?

The text was updated successfully, but these errors were encountered:

Huijun-Cui · 2019-03-30T05:11:57Z

hi do you solve this problem , I also account this problem ,

acDante · 2019-03-30T12:55:05Z

I guess this problem is caused by incompatible GPU configuration, but I did not solve this problem .. I would suggest you to take a look at allennlp. This model is also implemented there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training does not start #17

Training does not start #17

acDante commented Jan 12, 2019

Huijun-Cui commented Mar 30, 2019

acDante commented Mar 30, 2019

Training does not start #17

Training does not start #17

Comments

acDante commented Jan 12, 2019

Huijun-Cui commented Mar 30, 2019

acDante commented Mar 30, 2019