Skip to content

Commit

Permalink
fix getenv
Browse files Browse the repository at this point in the history
  • Loading branch information
glenn-jocher committed Jun 18, 2021
1 parent fb342fc commit 382ce4f
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,9 @@
from utils.wandb_logging.wandb_utils import WandbLogger, check_wandb_resume

logger = logging.getLogger(__name__)
LOCAL_RANK = int(getattr(os.environ, 'LOCAL_RANK', -1)) # https://pytorch.org/docs/stable/elastic/run.html
RANK = int(getattr(os.environ, 'RANK', -1))
WORLD_SIZE = int(getattr(os.environ, 'WORLD_SIZE', 1))
LOCAL_RANK = int(os.getenv('LOCAL_RANK', -1)) # https://pytorch.org/docs/stable/elastic/run.html
RANK = int(os.getenv('RANK', -1))
WORLD_SIZE = int(os.getenv('WORLD_SIZE', 1))


def train(hyp, # path/to/hyp.yaml or hyp dictionary
Expand Down

1 comment on commit 382ce4f

@glenn-jocher
Copy link
Member Author

@glenn-jocher glenn-jocher commented on 382ce4f Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code was not correctly retrieving the environment variables, os.getenv() fixes that.

Edit: hmm maybe the default gloo is a better choice, we'd have to compare speeds. I saw some super speeds with NCCL. I started up a P4d spot instance and was able to do a 3:09 epoch (3 minutes on train, 1 minute val single-gpu) on COCO with this command in the docker image, but with hanging at the end of training.

python -m torch.distributed.run --nproc_per_node 8 train.py --batch-size 128 --data coco.yaml --weights yolov5l.pt --epochs 1 --device 0,1,2,3,4,5,6,7

Please sign in to comment.