Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Resource temporarily unavailable" when distributed training on full Freebase #240

Open
RweBs opened this issue Oct 28, 2021 · 0 comments
Open

Comments

@RweBs
Copy link

RweBs commented Oct 28, 2021

I am running the following command on a cluster of 4 machines.

DGLBACKEND=pytorch dglke_dist_train --path ~/my_task --ip_config ~/my_task/ip_config8.txt \
--num_client_proc 40 --model TransE_l2 --dataset Freebase --data_path ~/my_task --hidden_dim 128 \
--gamma 10.0 --lr 0.1 --batch_size 1024 --neg_sample_size 256 --max_step 12800 --log_interval 256 \
--batch_size_eval 1024 --neg_sample_size_eval 1024 --test -adv --regularization_coef 1.00E-09 \
--no_save_emb --num_thread 1 >> fb-dglke.txt

I got following errors:

/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
/usr/local/lib64/python3.6/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
Traceback (most recent call last):
  File "/usr/local/bin/dglke_server", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/dglke/kvserver.py", line 232, in main
    start_server(args)
  File "/usr/local/lib/python3.6/site-packages/dglke/kvserver.py", line 227, in start_server
    my_server.start()
  File "/usr/local/lib64/python3.6/site-packages/dgl/contrib/dis_kvstore.py", line 509, in start
    _sender_connect(self._sender)
  File "/usr/local/lib64/python3.6/site-packages/dgl/network.py", line 98, in _sender_connect
    _CAPI_DGLSenderConnect(sender)
  File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/base.py", line 62, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: Resource temporarily unavailable

File "/usr/local/lib/python3.6/site-packages/dglke/models/pytorch/tensor_models.py", line 77, in decorated_function
    raise exception.__class__(trace)
dgl._ffi.base.DGLError: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/dglke/models/pytorch/tensor_models.py", line 65, in _queue_result
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/dglke/train_pytorch.py", line 1492, in dist_train_test
    client = connect_to_kvstore(args, entity_pb, relation_pb, l2g)
  File "/usr/local/lib/python3.6/site-packages/dglke/train_pytorch.py", line 1111, in connect_to_kvstore
    my_client.connect()
  File "/usr/local/lib64/python3.6/site-packages/dgl/contrib/dis_kvstore.py", line 953, in connect
    _receiver_wait(self._receiver, client_ip, int(client_port), self._server_count)
  File "/usr/local/lib64/python3.6/site-packages/dgl/network.py", line 116, in _receiver_wait
    _CAPI_DGLReceiverWait(receiver, ip_addr, int(port), int(num_sender))
  File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/usr/local/lib64/python3.6/site-packages/dgl/_ffi/base.py", line 62, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: Resource temporarily unavailable

terminate called after throwing an instance of 'dmlc::Error'
  what():  [11:13:56] /opt/dgl/src/graph/network/socket_communicator.cc:144: Check failed: tmp != -1 (-1 vs. -1) :
Stack trace:
  [bt] (0) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(dgl::network::SocketSender::SendLoop(dgl::network::TCPSocket*, dgl::network::MessageQueue*)+0x7a6) [0x7f3002adce16]
  [bt] (1) /lib64/libstdc++.so.6(+0xb5070) [0x7f305d5ee070]
  [bt] (2) /lib64/libpthread.so.0(+0x7dd5) [0x7f306f1f6dd5]
  [bt] (3) /lib64/libc.so.6(clone+0x6d) [0x7f306e816ead]

When I tried the following command, I found that the number of servers and clients were different on each machine:

ps -ef | grep dglke_server | grep -v grep | wc -l (result: 8, 7, 8, 8)
ps -ef | grep dglke_client | grep -v grep | wc -l (result: 161, 106, 148, 110)

Experimental configuration:

python 3.6.8, dgl 0.4.3, dglke 0.1.0
each machine has 512G memory

When I try to change ''--num_client_proc 40'' to ''--num_client_proc 8 '' or less, it works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant