Skip to content

Commit

Permalink
Increase NCCL timeout to 3 hours
Browse files Browse the repository at this point in the history
When training on a large dataset using DDP, the scanning process will be very long, and it will raise NCCL timeout error. Change the default timeout 30min to 3 hours, same as ultralytics yolov8 (ultralytics/ultralytics#3343)

Signed-off-by: Troy <wudashuo@vip.qq.com>
  • Loading branch information
wudashuo committed Nov 8, 2023
1 parent 84ec8b5 commit ccc374c
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
import sys
import time
from copy import deepcopy
from datetime import datetime
from datetime import datetime, timedelta
from pathlib import Path

try:
Expand Down Expand Up @@ -529,7 +529,7 @@ def main(opt, callbacks=Callbacks()):
assert torch.cuda.device_count() > LOCAL_RANK, 'insufficient CUDA devices for DDP command'
torch.cuda.set_device(LOCAL_RANK)
device = torch.device('cuda', LOCAL_RANK)
dist.init_process_group(backend='nccl' if dist.is_nccl_available() else 'gloo')
dist.init_process_group(backend='nccl' if dist.is_nccl_available() else 'gloo', timeout=timedelta(seconds=10800))

# Train
if not opt.evolve:
Expand Down

0 comments on commit ccc374c

Please sign in to comment.