add batch norm sychronization when multi-card training. #460

AlexWang1900 · 2020-07-21T02:45:37Z

🚀 Feature

in train.py:
add:

    if device.type != 'cpu' and torch.cuda.device_count() > 1 and torch.distributed.is_available():
       # add:  model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
        dist.init_process_group(backend='nccl',  # distributed backend
                                init_method='tcp://127.0.0.1:9999',  # init method
                                world_size=1,  # number of nodes
                                rank=0)  # node rank
        model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True)

Motivation

I found this when I am using an efficientDet model , they have this feature.

Pitch

it seems better using this, but I didn't do an ablation test.

Alternatives

Additional context

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2020-07-21T03:06:12Z

@AlexWang1900 thanks for the suggestion. We've done a recent PR #401 that introduced much more multi-gpu functionality including syncbatchnorm using python train.py --sync

The dev work there is still ongoing though and may change significantly again soon to introduce an mp.spawn based approach.

AlexWang1900 added the enhancement New feature or request label Jul 21, 2020

AlexWang1900 closed this as completed Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add batch norm sychronization when multi-card training. #460

add batch norm sychronization when multi-card training. #460

AlexWang1900 commented Jul 21, 2020

glenn-jocher commented Jul 21, 2020

add batch norm sychronization when multi-card training. #460

add batch norm sychronization when multi-card training. #460

Comments

AlexWang1900 commented Jul 21, 2020

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

glenn-jocher commented Jul 21, 2020