Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed with "make draft MODEL_NAME=test" #26

Open
xixiaoyao opened this issue Jun 14, 2022 · 2 comments
Open

failed with "make draft MODEL_NAME=test" #26

xixiaoyao opened this issue Jun 14, 2022 · 2 comments

Comments

@xixiaoyao
Copy link

logs as following, thanks

convert squad examples to features: 100%|█████████████████████████████████████████████████████████████████████████| 902/902 [00:00<00:00, 2092.37it/s]
add example index and unique id: 100%|██████████████████████████████████████████████████████████████████████████| 902/902 [00:00<00:00, 439863.06it/s]
06/14/2022 22:26:05 - INFO - main - Number of trainable params: 258,127,108
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
06/14/2022 22:26:05 - INFO - main - ***** Running training *****
06/14/2022 22:26:05 - INFO - main - Num examples = 1218
06/14/2022 22:26:05 - INFO - main - Num Epochs = 2
06/14/2022 22:26:05 - INFO - main - Instantaneous batch size per GPU = 48
06/14/2022 22:26:05 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 384
06/14/2022 22:26:05 - INFO - main - Gradient Accumulation steps = 1
06/14/2022 22:26:05 - INFO - main - Total optimization steps = 8
Epoch: 0%| | 0/2 [00:00<?, ?it/s]06/14/2022 22:26:05 - INFO - main -
[Epoch 1]
06/14/2022 22:26:05 - INFO - main - Initialize pre-batch of size 2 for Epoch 1

raceback (most recent call last): | 0/4 [00:00<?, ?it/s]
File "train_rc.py", line 593, in
main()
File "train_rc.py", line 537, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "train_rc.py", line 222, in train
outputs = model(**inputs)
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ssd1/zhangyiming/densephrase/DensePhrases/densephrases/encoder.py", line 132, in forward
start, end = self.embed_phrase(input_ids, attention_mask, token_type_ids)
File "/ssd1/zhangyiming/densephrase/DensePhrases/densephrases/encoder.py", line 94, in embed_phrase
outputs = self.phrase_encoder(
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/modeling_bert.py", line 707, in forward
attention_mask, input_shape, self.device
File "/ssd3/wangxiao/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/modeling_utils.py", line 113, in device
return next(self.parameters()).device
StopIteration

@jhyuklee
Copy link
Member

Hi, which version of DensePhrases are you using?
And I think this might be related to the device (GPU) issue.

@xixiaoyao
Copy link
Author

v1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants