Help with multi-GPU training #2442

kimalaacer · 2021-01-04T03:26:18Z

Instructions To Reproduce the Issue:

I am trying to use multi-GPU training using Jupiter within DLVM (google compute engine with 4 Tesla T4).
my code only runs on 1 GPU, the other 3 are not utilized.
I am able to train with custom dataset and getting acceptable results, but wish to use 4 GPUs for faster training.

Full runnable code or full changes you made:

from detectron2.engine import DefaultTrainer
from detectron2.evaluation import COCOEvaluator

class CocoTrainer(DefaultTrainer):

  @classmethod
  def build_evaluator(cls, cfg, dataset_name, output_folder=None):

    if output_folder is None:
        os.makedirs("coco_eval", exist_ok=True)
        output_folder = "coco_eval"

    return COCOEvaluator(dataset_name, cfg, False, output_folder)




class MyTrainer(DefaultTrainer):
    @classmethod
    def build_evaluator(cls, cfg, dataset_name, output_folder=None):
        if output_folder is None:
            output_folder = os.path.join(cfg.OUTPUT_DIR, "inference")
        return COCOEvaluator(dataset_name, cfg, True, output_folder)
                     
    def build_hooks(self):
        hooks = super().build_hooks()
        hooks.insert(-1,LossEvalHook(
            cfg.TEST.EVAL_PERIOD,
            self.model,
            build_detection_test_loader(
                self.cfg,
                self.cfg.DATASETS.TEST[0],
                DatasetMapper(self.cfg,True)
            )
        ))
        return hooks

# TRAIN

cfg = get_cfg()
cfg.OUTPUT_DIR = experiment_folder  
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("RED_coco_train",)
cfg.DATASETS.TEST = ("RED_coco_val",)
cfg.INPUT.MIN_SIZE_TRAIN = (640,)
cfg.INPUT.MAX_SIZE_TRAIN = 1024

cfg.MODEL.PIXEL_MEAN = mean_bgr

cfg.MODEL.PIXEL_STD = std_bgr

cfg.INPUT.MASK_FORMAT = "polygon"
cfg.MODEL.ANCHOR_GENERATOR.NAME = "DefaultAnchorGenerator"

cfg.DATALOADER.NUM_WORKERS = 4

cfg.MODEL.WEIGHTS = 'my-bucket/detectron2/detectron2/model_zoo/model_final_2d9806.pkl' 
cfg.SOLVER.IMS_PER_BATCH =32
cfg.SOLVER.CHECKPOINT_PERIOD = 100
cfg.SOLVER.BASE_LR = 0.008  # pick a good LR
cfg.SOLVER.MAX_ITER = 20000   # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512  # faster, and good enough for this toy dataset (default: 512)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class (ballon)
cfg.MODEL.ROI_HEADS.IOU_THRESHOLDS = [0.2]
cfg.SOLVER.REFERENCE_WORLD_SIZE = 4
cfg.CUDNN_BENCHMARK = True
cfg.TEST.EVAL_PERIOD = 500
cfg.MODEL.DEVICE = 'cuda'



os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = MyTrainer(cfg) 
#trainer = DefaultTrainer(cfg) #
#trainer = CocoTrainer(cfg) #
trainer.resume_or_load(resume=False)
trainer.scheduler.milestones=cfg.SOLVER.STEPS
trainer.train()

2. What exact command you run:

> - > I also tried changing the defaults to 4 GPU but did not work. 
> - 
> - >  Tried: launch(trainer.train(), num_gpus_per_machine = 4, num_machines=1, machine_rank=0, dist_url=None, args = args)
> - 
> - was able to train, but also on a single GPU. 
> - 
> - > also tried to use !export CUDA_VISIBLE_DEVICES=0,1,2,3 and did not work. 
> - 
> - > tried: CUDA_VISIBLE_DEVICES = 0,1,2,3
!python train_net.py \
--config-file './configs/COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml' \
--num-gpus 4 \
OUTPUT_DIR experiment_folder \
DATASETS.TRAIN "RED_coco_train"\
DATASETS.TEST "RED_coco_val" \
INPUT.MIN_SIZE_TRAIN 640\
INPUT.MAX_SIZE_TRAIN 1024 \
MODEL.PIXEL_MEAN [221.595, 192.27, 129.54]\
MODEL.PIXEL_STD [10.71, 27.54, 69.36] \
INPUT.MASK_FORMAT "polygon" \
MODEL.ANCHOR_GENERATOR.NAME "DefaultAnchorGenerator" \
DATALOADER.NUM_WORKERS 4 \
MODEL.WEIGHTS './detectron2/model_zoo/model_final_2d9806.pkl' \
SOLVER.IMS_PER_BATCH 64 \
SOLVER.BASE_LR 0.008 \
SOLVER.MAX_ITER 20000 \
MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE 512 \
MODEL.ROI_HEADS.NUM_CLASSES 1 \
MODEL.ROI_HEADS.IOU_THRESHOLDS [0.2] \
SOLVER.REFERENCE_WORLD_SIZE 4 \
CUDNN_BENCHMARK True \
TEST.EVAL_PERIOD 2000 \
MODEL.DEVICE 'cuda'

did not use multi GPU, and did not perform any training. 

## Expected behavior:
perform multi- GPU training. 

# Environment:

----------------------  ------------------------------------------------------------------------------
sys.platform            linux
Python                  3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) [GCC 7.5.0]
numpy                   1.18.5
detectron2              0.3 @/home/jupyter/my-bucket/detectron2/detectron2
Compiler                GCC 8.3
CUDA compiler           CUDA 11.0
detectron2 arch flags   7.5
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.7.1 @/opt/conda/lib/python3.7/site-packages/torch
PyTorch debug build     False
GPU available           True
GPU 0,1,2,3             Tesla T4 (arch=7.5)
CUDA_HOME               /usr/local/cuda
Pillow                  7.2.0
torchvision             0.8.2 @/opt/conda/lib/python3.7/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5
fvcore                  0.1.2.post20201212
cv2                     4.4.0
----------------------  ------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

The text was updated successfully, but these errors were encountered:

ppwwyyxx · 2021-01-04T08:02:54Z

Please see the example https://github.com/facebookresearch/detectron2/blob/master/tools/train_net.py on how to implement multi-gpu training.

kimalaacer · 2021-01-04T17:46:48Z

Thanks,
I went back and tried to , but it is not working with 4 gpu (only one).
I will keep trying.

kimalaacer · 2021-01-04T22:16:14Z

I tried:
!export CUDA_VISIBLE_DEVICES=0,1,2,3
!export NGPU=4
!python -m torch.distributed.launch --nproc_per_node=4 ./tools/train_net.py
--num-gpus 4
--config-file './configs/COCO-InstanceSegmentation/mask_rcnn_X_101_32x8d_FPN_3x.yaml'
--num-machines 1
#--machine-rank =0
--dist-url 'auto'
--resume False
--eval-only False\

but I am getting an error: train_net.py: error: unrecognized arguments: --local_rank=0

it seems the torch is adding a --local_rank to the args.

ppwwyyxx closed this as completed Jan 4, 2021

kimalaacer mentioned this issue Jan 4, 2021

multi-GPU not working #2449

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with multi-GPU training #2442

Help with multi-GPU training #2442

kimalaacer commented Jan 4, 2021

ppwwyyxx commented Jan 4, 2021

kimalaacer commented Jan 4, 2021

kimalaacer commented Jan 4, 2021

Help with multi-GPU training #2442

Help with multi-GPU training #2442

Comments

kimalaacer commented Jan 4, 2021

Instructions To Reproduce the Issue:

ppwwyyxx commented Jan 4, 2021

kimalaacer commented Jan 4, 2021

kimalaacer commented Jan 4, 2021