Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with multi-GPU training #2442

Closed
kimalaacer opened this issue Jan 4, 2021 · 3 comments
Closed

Help with multi-GPU training #2442

kimalaacer opened this issue Jan 4, 2021 · 3 comments

Comments

@kimalaacer
Copy link

Instructions To Reproduce the Issue:

I am trying to use multi-GPU training using Jupiter within DLVM (google compute engine with 4 Tesla T4).
my code only runs on 1 GPU, the other 3 are not utilized.
I am able to train with custom dataset and getting acceptable results, but wish to use 4 GPUs for faster training.

  1. Full runnable code or full changes you made:
from detectron2.engine import DefaultTrainer
from detectron2.evaluation import COCOEvaluator

class CocoTrainer(DefaultTrainer):

  @classmethod
  def build_evaluator(cls, cfg, dataset_name, output_folder=None):

    if output_folder is None:
        os.makedirs("coco_eval", exist_ok=True)
        output_folder = "coco_eval"

    return COCOEvaluator(dataset_name, cfg, False, output_folder)




class MyTrainer(DefaultTrainer):
    @classmethod
    def build_evaluator(cls, cfg, dataset_name, output_folder=None):
        if output_folder is None:
            output_folder = os.path.join(cfg.OUTPUT_DIR, "inference")
        return COCOEvaluator(dataset_name, cfg, True, output_folder)
                     
    def build_hooks(self):
        hooks = super().build_hooks()
        hooks.insert(-1,LossEvalHook(
            cfg.TEST.EVAL_PERIOD,
            self.model,
            build_detection_test_loader(
                self.cfg,
                self.cfg.DATASETS.TEST[0],
                DatasetMapper(self.cfg,True)
            )
        ))
        return hooks

# TRAIN

cfg = get_cfg()
cfg.OUTPUT_DIR = experiment_folder  
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("RED_coco_train",)
cfg.DATASETS.TEST = ("RED_coco_val",)
cfg.INPUT.MIN_SIZE_TRAIN = (640,)
cfg.INPUT.MAX_SIZE_TRAIN = 1024

cfg.MODEL.PIXEL_MEAN = mean_bgr

cfg.MODEL.PIXEL_STD = std_bgr

cfg.INPUT.MASK_FORMAT = "polygon"
cfg.MODEL.ANCHOR_GENERATOR.NAME = "DefaultAnchorGenerator"

cfg.DATALOADER.NUM_WORKERS = 4

cfg.MODEL.WEIGHTS = 'my-bucket/detectron2/detectron2/model_zoo/model_final_2d9806.pkl' 
cfg.SOLVER.IMS_PER_BATCH =32
cfg.SOLVER.CHECKPOINT_PERIOD = 100
cfg.SOLVER.BASE_LR = 0.008  # pick a good LR
cfg.SOLVER.MAX_ITER = 20000   # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512  # faster, and good enough for this toy dataset (default: 512)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class (ballon)
cfg.MODEL.ROI_HEADS.IOU_THRESHOLDS = [0.2]
cfg.SOLVER.REFERENCE_WORLD_SIZE = 4
cfg.CUDNN_BENCHMARK = True
cfg.TEST.EVAL_PERIOD = 500
cfg.MODEL.DEVICE = 'cuda'



os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = MyTrainer(cfg) 
#trainer = DefaultTrainer(cfg) #
#trainer = CocoTrainer(cfg) #
trainer.resume_or_load(resume=False)
trainer.scheduler.milestones=cfg.SOLVER.STEPS
trainer.train()

2. What exact command you run:

> - > I also tried changing the defaults to 4 GPU but did not work. 
> - 
> - >  Tried: launch(trainer.train(), num_gpus_per_machine = 4, num_machines=1, machine_rank=0, dist_url=None, args = args)
> - 
> - was able to train, but also on a single GPU. 
> - 
> - > also tried to use !export CUDA_VISIBLE_DEVICES=0,1,2,3 and did not work. 
> - 
> - > tried: CUDA_VISIBLE_DEVICES = 0,1,2,3
!python train_net.py \
--config-file './configs/COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml' \
--num-gpus 4 \
OUTPUT_DIR experiment_folder \
DATASETS.TRAIN "RED_coco_train"\
DATASETS.TEST "RED_coco_val" \
INPUT.MIN_SIZE_TRAIN 640\
INPUT.MAX_SIZE_TRAIN 1024 \
MODEL.PIXEL_MEAN [221.595, 192.27, 129.54]\
MODEL.PIXEL_STD [10.71, 27.54, 69.36] \
INPUT.MASK_FORMAT "polygon" \
MODEL.ANCHOR_GENERATOR.NAME "DefaultAnchorGenerator" \
DATALOADER.NUM_WORKERS 4 \
MODEL.WEIGHTS './detectron2/model_zoo/model_final_2d9806.pkl' \
SOLVER.IMS_PER_BATCH 64 \
SOLVER.BASE_LR 0.008 \
SOLVER.MAX_ITER 20000 \
MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE 512 \
MODEL.ROI_HEADS.NUM_CLASSES 1 \
MODEL.ROI_HEADS.IOU_THRESHOLDS [0.2] \
SOLVER.REFERENCE_WORLD_SIZE 4 \
CUDNN_BENCHMARK True \
TEST.EVAL_PERIOD 2000 \
MODEL.DEVICE 'cuda'

did not use multi GPU, and did not perform any training. 

## Expected behavior:
perform multi- GPU training. 

# Environment:

----------------------  ------------------------------------------------------------------------------
sys.platform            linux
Python                  3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) [GCC 7.5.0]
numpy                   1.18.5
detectron2              0.3 @/home/jupyter/my-bucket/detectron2/detectron2
Compiler                GCC 8.3
CUDA compiler           CUDA 11.0
detectron2 arch flags   7.5
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.7.1 @/opt/conda/lib/python3.7/site-packages/torch
PyTorch debug build     False
GPU available           True
GPU 0,1,2,3             Tesla T4 (arch=7.5)
CUDA_HOME               /usr/local/cuda
Pillow                  7.2.0
torchvision             0.8.2 @/opt/conda/lib/python3.7/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5
fvcore                  0.1.2.post20201212
cv2                     4.4.0
----------------------  ------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Jan 4, 2021

Please see the example https://github.com/facebookresearch/detectron2/blob/master/tools/train_net.py on how to implement multi-gpu training.

@ppwwyyxx ppwwyyxx closed this as completed Jan 4, 2021
@kimalaacer
Copy link
Author

Thanks,
I went back and tried to , but it is not working with 4 gpu (only one).
I will keep trying.

@kimalaacer
Copy link
Author

I tried:
!export CUDA_VISIBLE_DEVICES=0,1,2,3
!export NGPU=4
!python -m torch.distributed.launch --nproc_per_node=4 ./tools/train_net.py
--num-gpus 4
--config-file './configs/COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml'
--num-machines 1
#--machine-rank =0
--dist-url 'auto'
--resume False
--eval-only False\

but I am getting an error: train_net.py: error: unrecognized arguments: --local_rank=0

it seems the torch is adding a --local_rank to the args.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants