Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

add batch norm sychronization when multi-card training. #460

Closed
AlexWang1900 opened this issue Jul 21, 2020 · 1 comment
Closed

add batch norm sychronization when multi-card training. #460

AlexWang1900 opened this issue Jul 21, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@AlexWang1900
Copy link
Contributor

馃殌 Feature

in train.py:
add:

    if device.type != 'cpu' and torch.cuda.device_count() > 1 and torch.distributed.is_available():
       # add:  model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
        dist.init_process_group(backend='nccl',  # distributed backend
                                init_method='tcp://127.0.0.1:9999',  # init method
                                world_size=1,  # number of nodes
                                rank=0)  # node rank
        model = torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True)


Motivation

I found this when I am using an efficientDet model , they have this feature.

Pitch

it seems better using this, but I didn't do an ablation test.

Alternatives

Additional context

@AlexWang1900 AlexWang1900 added the enhancement New feature or request label Jul 21, 2020
@glenn-jocher
Copy link
Member

@AlexWang1900 thanks for the suggestion. We've done a recent PR #401 that introduced much more multi-gpu functionality including syncbatchnorm using python train.py --sync

The dev work there is still ongoing though and may change significantly again soon to introduce an mp.spawn based approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants