Multi-node GPU support is still outstanding #46

mortonjt · 2020-07-08T04:26:01Z

It looks like multi-node GPU support is still an outstanding task - if I execute the following script to run on 4 nodes (16 gpus)

workers=40
nodes=4
layers=2
RESULTS=results/full_run_w${workers}_n${nodes}_l${layers}
mkdir -p $RESULTS
deepblast-train \
    --train-pairs $DIR/train.txt \
    --test-pairs $DIR/test.txt \
    --valid-pairs $DIR/valid.txt \
    --output-directory $RESULTS \
    --nodes $nodes \
    --num-workers $workers \
    --learning-rate 1e-5 \
    --visualization-fraction 0.001 \
    --batch-size $((64 * nodes)) \
    --layers $layers \
    --grad-accum 10 \
    --gpus 4 \
    --backend ddp

I get the following error

Traceback (most recent call last):
  File "/home/jmorton/miniconda3/envs/alignment/bin/deepblast-train", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/jmorton/research/gert/deepblast/scripts/deepblast-train", line 67, in <module>
    main(hparams)
  File "/home/jmorton/research/gert/deepblast/scripts/deepblast-train", line 47, in main
    trainer.fit(model)
  File "/home/jmorton/miniconda3/envs/alignment/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 964, in fit
    self.set_random_port()
  File "/home/jmorton/miniconda3/envs/alignment/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 392, in set_random_port
    assert self.num_nodes == 1, 'random port can only be called from single node training'
AssertionError: random port can only be called from single node training

Its likely because this line of code just originated from a merge yesterday here: Lightning-AI/pytorch-lightning#2512 (comment)

The text was updated successfully, but these errors were encountered:

mortonjt · 2023-03-07T22:08:15Z

Done.

mortonjt mentioned this issue Jul 8, 2020

Fix ddp tests + .test() Lightning-AI/pytorch-lightning#2512

Merged

mortonjt closed this as completed Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node GPU support is still outstanding #46

Multi-node GPU support is still outstanding #46

mortonjt commented Jul 8, 2020

mortonjt commented Mar 7, 2023

Multi-node GPU support is still outstanding #46

Multi-node GPU support is still outstanding #46

Comments

mortonjt commented Jul 8, 2020

mortonjt commented Mar 7, 2023