Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated training not deterministic despite identical setup and reproducibility flags #4260

Open
j-rausch opened this issue May 23, 2022 · 3 comments

Comments

@j-rausch
Copy link

Hi, I'm working on an experiment where I noticed large differences between models trained with identical configs and random seeds. I'm trying to understand the causes for this.

I've upgraded to a more recent PyTorch version that introduced flags for deterministic training between multiple executions:
https://pytorch.org/docs/1.11/notes/randomness.html?highlight=reproducibility

However, despite using these flags and the most recent detectron2 sources, the final trained models and their validation accuracies can differ greatly on a custom dataset set of mine (~2 AP).
These differences occur in multiple runs on the same machine (identical device, code, config, random seed).

I've been looking into reproducing this problem and also observe this for the unaltered detectron2 demo training code. I've added a minimal script to reproduce the training and observe rather big differences between the first logged losses of three subsequent runs.

Instructions To Reproduce the Issue:

  1. Full runnable code or full changes you made:
    script to reproduce the experiment (deterministic_example.py)
import os
os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"
import torch
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)

from detectron2.config import get_cfg
from detectron2.engine import DefaultTrainer, default_argument_parser, default_setup, launch



def setup(args):
    """
    Create configs and perform basic setups.
    """
    cfg = get_cfg()
    cfg.merge_from_file(args.config_file)
    cfg.merge_from_list(args.opts)
    cfg.freeze()
    default_setup(cfg, args)
    return cfg

def main(args):

    cfg = setup(args)

    trainer = DefaultTrainer(cfg)
    trainer.resume_or_load(resume=False)
    return trainer.train()


if __name__ == "__main__":
    args = default_argument_parser().parse_args()
    print("Command Line Args:", args)
    launch(
        main,
        args.num_gpus,
        num_machines=args.num_machines,
        machine_rank=args.machine_rank,
        dist_url=args.dist_url,
        args=(args,),
    )
git rev-parse HEAD; git diff
e091a07ef573915056f8c2191b774aad0e38d09c
  1. What exact command you run:
CUDA_VISIBLE_DEVICES=0 python deterministic_example.py --num-gpus 1 --config-file ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml SOLVER.IMS_PER_BATCH 1 SEED 42 DATALOADER.NUM_WORKERS 1
  1. Full logs or other relevant observations:
Command Line Args: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1'])
[05/23 15:49:06 detectron2]: Rank of current process: 0. World size: 1
[05/23 15:49:08 detectron2]: Environment info:
----------------------  --------------------------------------------------------------------------------------------------------------------------
sys.platform            linux
Python                  3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:39:04) [GCC 10.3.0]
numpy                   1.22.3
detectron2              0.6 @/rootpath/git/detectron2/detectron2
Compiler                GCC 9.3
CUDA compiler           CUDA 11.5
detectron2 arch flags   6.1
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.11.0+cu115 @/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0                   NVIDIA TITAN Xp (arch=6.1)
Driver version          510.47.03
CUDA_HOME               /usr/local/cuda-11.5
Pillow                  9.1.0
torchvision             0.12.0+cu115 @/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                  0.1.5.post20220504
iopath                  0.1.9
cv2                     4.5.5
----------------------  --------------------------------------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.5
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.5, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

[05/23 15:49:08 detectron2]: Command line arguments: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1'])
[05/23 15:49:08 detectron2]: Contents of args.config_file=./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml:
_BASE_: "../Base-RCNN-FPN.yaml"
MODEL:
  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
  MASK_ON: True
  RESNETS:
    DEPTH: 50


  FILTER_EMPTY_ANNOTATIONS: true
  NUM_WORKERS: 1
  REPEAT_THRESHOLD: 0.0
  SAMPLER_TRAIN: TrainingSampler
DATASETS:
  PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
  PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
  PROPOSAL_FILES_TEST: []
  PROPOSAL_FILES_TRAIN: []
  TEST:
  - coco_2017_val
  TRAIN:
  - coco_2017_train
GLOBAL:
  HACK: 1.0
INPUT:
  CROP:
    ENABLED: false
    SIZE:
    - 0.9
    - 0.9
    TYPE: relative_range
  FORMAT: BGR
  MASK_FORMAT: polygon
  MAX_SIZE_TEST: 1333
  MAX_SIZE_TRAIN: 1333
  MIN_SIZE_TEST: 800
  MIN_SIZE_TRAIN:
  - 640
  - 672
  - 704
  - 736
  - 768
  - 800
  MIN_SIZE_TRAIN_SAMPLING: choice
  RANDOM_FLIP: horizontal
MODEL:
  ANCHOR_GENERATOR:
    ANGLES:
    - - -90
      - 0
      - 90
    ASPECT_RATIOS:
    - - 0.5
      - 1.0
      - 2.0
    NAME
    OFFSET: 0.0
    SIZES:
    - - 32
    - - 64
    - - 128
    - - 256
    - - 512
  BACKBONE:
    FREEZE_AT: 2
    NAME: build_resnet_fpn_backbone
  DEVICE: cuda
  FPN:
    FUSE_TYPE: sum
    IN_FEATURES:
    - res2
    - res3
    - res4
    - res5
    NORM: ''
    OUT_CHANNELS: 256
  KEYPOINT_ON: false
  LOAD_PROPOSALS: false
  MASK_ON: true
  META_ARCHITECTURE: GeneralizedRCNN
  PANOPTIC_FPN:
    COMBINE:
      ENABLED: true
      INSTANCES_CONFIDENCE_THRESH: 0.5
      OVERLAP_THRESH: 0.5
      STUFF_AREA_LIMIT: 4096
    INSTANCE_LOSS_WEIGHT: 1.0
  PIXEL_MEAN:
  - 103.53
  - 116.28
  - 123.675
  PIXEL_STD:
  - 1.0
  - 1.0
  - 1.0
  PROPOSAL_GENERATOR:
    MIN_SIZE: 0
    NAME: RPN
  RESNETS:
    DEFORM_MODULATED: false
    DEFORM_NUM_GROUPS: 1
    DEFORM_ON_PER_STAGE:
    -
    - false
    DEPTH: 50
    NORM: FrozenBN
    NUM_GROUPS: 1
    OUT_FEATURES:
    - res2
    - res3
    - res4
    - res5
    RES2_OUT_CHANNELS: 256
    RES5_DILATION: 1
    STEM_OUT_CHANNELS: 64
    STRIDE_IN_1X1: true
    WIDTH_PER_GROUP: 64
  RETINANET:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_WEIGHTS: &id002
    - 1.0
    - 1.0
    - 1.0
    - 1.0
    FOCAL_LOSS_ALPHA: 0.25
    FOCAL_LOSS_GAMMA: 2.0
    IN_FEATURES:
    - p3
    - p4
    - p5
    - p6
    - p7
    IOU_LABELS:
    - 0
    - -1
    - 1
    IOU_THRESHOLDS:
    - 0.4
    - 0.5
    NMS_THRESH_TEST: 0.5
    NORM: ''
    NUM_CLASSES: 80
    NUM_CONVS: 4
    PRIOR_PROB: 0.01
    SCORE_THRESH_TEST: 0.05
    SMOOTH_L1_LOSS_BETA: 0.1
    TOPK_CANDIDATES_TEST: 1000
  ROI_BOX_CASCADE_HEAD:
    BBOX_REG_WEIGHTS:
    - &id
      - 10.0
      - 5.0
      - 5.0
    - - 20.0
      - 20.0
      - 10.0
      - 10.0
    - - 30.0
      - 30.0
      - 15.0
      - 15.0
    IOUS:
    - 0.5
    - 0.6
    - 0.7
  ROI_BOX_HEAD:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: *id001
    CLS_AGNOSTIC_BBOX_REG: false
    CONV_DIM: 256
    FC_DIM: 1024
    FED_LOSS_FREQ_WEIGHT_POWER: 0.5
    FED_LOSS_NUM_CLASSES: 50
    NAME: FastRCNNConvFCHead
    NORM: ''
    NUM_CONV: 0
    NUM_FC: 2
    POOLER_RESOLUTION: 7
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
    SMOOTH_L1_BETA: 0.0
    TRAIN_ON_PRED_BOXES: false
    USE_FED_LOSS: false
    USE_SIGMOID_CE: false
  ROI_HEADS:
    BATCH_SIZE_PER_IMAGE: 512
    IN_FEATURES:
    - p2
    - p3
    - p4
    - p5
    IOU_LABELS:
    - 0
    - 1
    IOU_THRESHOLDS:
    - 0.5
    NAME: StandardROIHeads
    NMS_THRESH_TEST: 0.5
    NUM_CLASSES: 80
    POSITIVE_FRACTION: 0.25
    PROPOSAL_APPEND_GT: true
    SCORE_THRESH_TEST: 0.05
  ROI_KEYPOINT_HEAD:
    CONV_DIMS:
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    LOSS_WEIGHT: 1.0
    MIN_KEYPOINTS_PER_IMAGE: 1
    NAME: KRCNNConvDeconvUpsampleHead
    NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
    NUM_KEYPOINTS: 17
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  ROI_MASK_HEAD:
    CLS_AGNOSTIC_MASK: false
    CONV_DIM: 256
    NAME: MaskRCNNConvUpsampleHead
    NORM: ''
    NUM_CONV: 4
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  RPN:
    BATCH_SIZE_PER_IMAGE: 256
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: *id002
    BOUNDARY_THRESH: -1
    CONV_DIMS:
    - -1
    HEAD_NAME: StandardRPNHead
    IN_FEATURES:
    -
    - p4
    - p5
    - p6
    IOU_LABELS:
    - 0
    - -1
    - 1
    IOU_THRESHOLDS:
    - 0.3
    - 0.7
    LOSS_WEIGHT: 1.0
    NMS_THRESH: 0.7
    POSITIVE_FRACTION: 0.5
    POST_NMS_TOPK_TEST: 1000
    POST_NMS_TOPK_TRAIN: 1000
    PRE_NMS_TOPK_TEST: 1000
    PRE_NMS_TOPK_TRAIN: 2000
    SMOOTH_L1_BETA: 0.0
  SEM_SEG_HEAD:
    COMMON_STRIDE: 4
    CONVS_DIM: 128
    IGNORE_VALUE: 255
    IN_FEATURES:
    - p2
    - p3
    - p4
    - p5
    LOSS_WEIGHT: 1.0
    NAME: SemSegFPNHead
    NORM: GN
    NUM_CLASSES: 54
  WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl
OUTPUT_DIR: ./output
SEED: 42
SOLVER:
  AMP:
    ENABLED: false
  BASE_LR: 0.02
  BASE_LR_END: 0.0
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_PERIOD: 5000
  CLIP_GRADIENTS:
    CLIP_TYPE: value
    CLIP_VALUE: 1.0
    ENABLED: false
    NORM_TYPE: 2.0
  GAMMA: 0.1
  IMS_PER_BATCH: 1
  LR_SCHEDULER_NAME: WarmupMultiStepLR
  MAX_ITER: 90000
  MOMENTUM: 0.9
  NESTEROV: false
  REFERENCE_WORLD_SIZE: 0
  STEPS:
  - 60000
  - 80000
  WARMUP_FACTOR: 0.001
  WARMUP_ITERS: 1000
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.0001
  WEIGHT_DECAY_BIAS: null
  WEIGHT_DECAY_NORM: 0.0
TEST:
  AUG:
    ENABLED: false
    FLIP: true
    MAX_SIZE: 4000
    MIN_SIZES:
    - 400
    - 500
    - 600
    - 700
    - 800
    - 900
    - 1000
    - 1100
    - 1200
  DETECTIONS_PER_IMAGE: 100
  EVAL_PERIOD: 0
  EXPECTED_RESULTS: []
  KEYPOINT_OKS_SIGMAS: []
  PRECISE_BN:
    ENABLED: false
    NUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0

[05/23 15:49:08 detectron2]: Full config saved to ./output/config.yaml

          )
          (conv3): Conv2d(
            64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
          )
        )
        (2): BottleneckBlock(
          (conv1): Conv2d(
            256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
          )
          (conv2): Conv2d(
            64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
          )
          (conv3): Conv2d(
            64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
          )
        )
      )
      (res3): Sequential(
        (0): BottleneckBlock(
          (shortcut): Conv2d(
            256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
            (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
          )
          (conv1): Conv2d(
            256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False
            (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
          )
          (conv2): Conv2d(
            128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
          )
          (conv3): Conv2d(
            128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
          )
        )
        (1): BottleneckBlock(
          (conv1): Conv2d(
            512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
          )
          (conv2): Conv2d(
            128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1
            256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
          )
        )
        (1): BottleneckBlock(
          (conv1): Conv2d(
            1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
          )
          (conv2): Conv2d(
            256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
          )
          (conv3): Conv2d(
            256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
          )
        )
        (2): BottleneckBlock(
          (conv1): Conv2d(
            1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
          )
          (conv2): Conv2d(
            256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
          )
          (conv3): Conv2d(
            256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
          )
        )
        (3): BottleneckBlock(
          (conv1): Conv2d(
            1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
          )
          (conv2): Conv2d(
            256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
          )
          (conv3): Conv2d(
            256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
          )
        )
        (4): BottleneckBl
          )
          (conv2): Conv2d(
            512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
          )
          (conv3): Conv2d(
            512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
          )
        )
        (2): BottleneckBlock(
          (conv1): Conv2d(
            2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
          )
          (conv2): Conv2d(
            512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
          )
          (conv3): Conv2d(
            512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
          )
        )
      )
    )
  )
  (proposal_generator): RPN(
    (rpn_head): StandardRPNHead(
      (conv): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
        (activation): ReLU()
      )
      (objectness_logits): Conv2d(256, 3, kernel_size=(1, 1), stride=(1, 1))
      (anchor_deltas): Conv2d(256, 12, kernel_size=(1, 1), stride=(1, 1))
    )
    (anchor_generator): DefaultAnchorGenerator(
      (cell_anchors): BufferList()
    )
  )
  (roi_heads): StandardROIHeads(
    (box_pooler): ROIPooler(
      (level_poolers): ModuleList(
        (0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=0, aligned=True)
        (1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=0, aligned=True)
        (2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=0, aligned=True)
        (3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=0, aligned=True)
      )
    )
    (box_head): FastRCNNConvFCHead(
      (flatten): Flatten(start_dim=1, end_dim=-1)
      (fc1): Linear(in_features=12544, out_features=1024, bias=True)
      (fc_relu1): ReLU()
      (fc2): Linear(in_features=1024, out_features=1024, bias=True)
      (fc_relu2): ReLU()
    )
    (box_predictor): FastRCNNOutputLayers(
      (cls_score): Linear(in_features=1024, out_features=81, bias=True)
      (bbox_pred): Linear(in_features=1024, out_features=320, bias=True)
    )
    (mask_pooler): ROIPooler(
      (level_poolers): ModuleList(
        (0): ROIAlign(output_size=(14, 14), spatial_scale=0.25, sampling_ratio=0, aligned=True)
        (1): ROIAlign(output_size=(14, 14), spatial_scale=0.125, sampling_ratio=0, aligned=True)
        (2): ROIAlign(output_size=(14, 14), spatial_scale=0.0625, sampling_ratio=0, aligned=True)
        (3): ROIAlign(output_size=(14, 14), spatial_scale=0.03125, sampling_ratio=0, aligned=True)
      )
    )
    (mask_head): MaskRCNNConvUpsampleHead(
      (mask_fcn1): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
        (activation): ReLU()
      )
      (mask_fcn2): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
        (activation): ReLU()
      )
      (mask_fcn3): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
        (activation): ReLU()
      )
      (mask_fcn4): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
        (activation): ReLU()
      )
      (deconv): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2))
      (deconv_relu): ReLU()
      (predictor): Conv2d(256, 80, kernel_size=(1, 1), stride=(1, 1))
    )
  )
)
[05/23 15:49:30 d2.data.datasets.coco]: Loading datasets/coco/annotations/instances_train2017.json takes 18.03 seconds.
[05/23 15:49:31 d2.data.datasets.coco]: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json
[05/23 15:49:37 d2.data.build]: Removed 1021 images with no usable annotations. 117266 images left.
[05/23 15:49:43 d2.data.build]: Distribution of instances among all 80 categories:
|   category    | #instances   |   category   | #instances   |   category    | #instances   |
|:-------------:|:-------------|:------------:|:-------------|:-------------:|:-------------|
|    person     | 257253       |   bicycle    | 7056         |      car      | 43533        |
|  motorcycle   | 8654         |   airplane   | 5129         |      bus      | 6061         |
|     train     | 4570         |    truck     | 9970         |     boat      | 10576        |
| traffic light | 12842        | fire hydrant | 1865         |   stop sign   | 1983         |
| parking meter | 1283         |    bench     | 9820         |     bird      | 10542        |
|      cat      | 4766         |     dog      | 5500         |     horse     | 6567         |
|     sheep     | 9223         |     cow      | 8014         |   elephant    | 5484         |
|     bear      | 1294         |    zebra     | 5269         |    giraffe    | 5128         |
|   backpack    | 8714         |   umbrella   | 11265        |    handbag    | 12342        |
|      tie      | 6448         |   suitcase   | 6112         |    frisbee    | 2681         |
|     skis      | 6623         |  snowboard   | 2681         |  sports ball  | 6299         |
|     kite      | 8802         | baseball bat | 3273         | baseball gl.. | 3747         |
|  skateboard   | 5536         |  surfboard   | 6095         | tennis racket | 4807         |
|    bottle     | 24070        |  wine glass  | 7839         |      cup      | 20574        |
|     fork      | 5474         |    knife     | 7760         |     spoon     | 6159         |
|     bowl      | 14323        |    banana    | 9195         |     apple     | 5776         |
|   sandwich    | 4356         |    orange    | 6302         |   broccoli    | 7261         |
|    carrot     | 7758         |   hot dog    | 2884         |     pizza     | 5807         |
|     donut     | 7005         |     cake     | 6296         |     chair     | 38073        |
|     couch     | 5779         | potted plant | 8631         |      bed      | 4192         |
| dining table  | 15695        |    toilet    | 4149         |      tv       | 5803         |
|    laptop     | 4960         |    mouse     | 2261         |    remote     | 5700         |
|   keyboard    | 2854         |  cell phone  | 6422         |   microwave   | 1672         |
|     oven      | 3334         |   toaster    | 225          |     sink      | 5609         |
| refrigerator  | 2634         |     book     | 24077        |     clock     | 6320         |
|     vase      | 6577         |   scissors   | 1464         |  teddy bear   | 4729         |
|  hair drier   | 198          |  toothbrush  | 1945         |               |              |
|     total     | 849949       |              |              |               |              |
[05/23 15:49:43 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()]
[05/23 15:49:43 d2.data.build]: Using training sampler TrainingSampler
[05/23 15:49:43 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ...
[05/23 15:49:47 d2.data.common]: Serialized dataset takes 451.21 MiB
[05/23 15:50:04 fvcore.common.checkpoint]: [Checkpointer] Loading from detectron2://ImageNetPretrained/MSRA/R-50.pkl ...
[05/23 15:50:04 d2.checkpoint.c2_model_loading]: Renaming Caffe2 weights ......
[05/23 15:50:04 d2.checkpoint.c2_model_loading]: Following weights matched with submodule backbone.bottom_up:
| Names in Model    | Names in Checkpoint      | Shapes                                          |
|:------------------|:-------------------------|:------------------------------------------------|
| res2.0.conv1.*    | res2_0_branch2a_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,1,1)             |
| res2.0.conv2.*    | res2_0_branch2b_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3)             |
| res2.0.conv3.*    | res2_0_branch2c_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| res2.0.shortcut.* | res2_0_branch1_{bn_*,w}  | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| res2.1.conv1.*    | res2_1_branch2a_{bn_*,w} | (64,) (64,) (64,) (64,) (64,256,1,1)            |
| res2.1.conv2.*    | res2_1_branch2b_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3)             |
| res2.1.conv3.*    | res2_1_branch2c_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| res2.2.conv1.*    | res2_2_branch2a_{bn_*,w} | (64,) (64,) (64,) (64,) (64,256,1,1)            |
| res2.2.conv2.*    | res2_2_branch2b_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3)             |
| res2.2.conv3.*    | res2_2_branch2c_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| res3.0.conv1.*    | res3_0_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,256,1,1)       |
| res3.0.conv2.*    | res3_0_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| res3.0.conv3.*    | res3_0_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| res3.0.shortcut.* | res3_0_branch1_{bn_*,w}  | (512,) (512,) (512,) (512,) (512,256,1,1)       |
| res3.1.conv1.*    | res3_1_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1)       |
| res3.1.conv2.*    | res3_1_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| res3.1.conv3.*    | res3_1_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| res3.2.conv1.*    | res3_2_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1)       |
| res3.2.conv2.*    | res3_2_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| res3.2.conv3.*    | res3_2_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| res3.3.conv1.*    | res3_3_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1)       |
| res3.3.conv2.*    | res3_3_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| res3.3.conv3.*    | res3_3_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| res4.0.conv1.*    | res4_0_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,512,1,1)       |
| res4.0.conv2.*    | res4_0_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.0.conv3.*    | res4_0_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| res4.0.shortcut.* | res4_0_branch1_{bn_*,w}  | (1024,) (1024,) (1024,) (1024,) (1024,512,1,1)  |
| res4.1.conv1.*    | res4_1_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| res4.1.conv2.*    | res4_1_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.1.conv3.*    | res4_1_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| res4.2.conv1.*    | res4_2_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| res4.2.conv2.*    | res4_2_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.2.conv3.*    | res4_2_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| res4.3.conv1.*    | res4_3_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| res4.3.conv2.*    | res4_3_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.3.conv3.*    | res4_3_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| res4.4.conv1.*    | res4_4_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| res4.4.conv2.*    | res4_4_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.4.conv3.*    | res4_4_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.mask_head.mask_fcn1.{bias, weight}
roi_heads.mask_head.mask_fcn2.{bias, weight}
roi_heads.mask_head.mask_fcn3.{bias, weight}
roi_heads.mask_head.mask_fcn4.{bias, weight}
roi_heads.mask_head.predictor.{bias, weight}
WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
  fc1000.{bias, weight}
  stem.conv1.bias
[05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0
/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the index
backbone.fpn_output2.{bias, weight}
backbone.fpn_output3.{bias, weight}
backbone.fpn_output4.{bias, weight}
backbone.fpn_output5.{bias, weight}
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.mask_head.mask_fcn1.{bias, weight}
roi_heads.mask_head.mask_fcn2.{bias, weight}
roi_heads.mask_head.mask_fcn3.{bias, weight}
roi_heads.mask_head.mask_fcn4.{bias, weight}
roi_heads.mask_head.predictor.{bias, weight}
WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
  fc1000.{bias, weight}
  stem.conv1.bias
[05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0
/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
[05/23 15:50:12 d2.utils.events]:  eta: 7:44:48  iter: 19  total_loss: 2.345  loss_cls: 0.5814  loss_box_reg: 0.01275  loss_mask: 0.6936  loss_rpn_cls: 0.6719  loss_rpn_loc: 0.0908  time: 0.3151  data_time: 0.0139  lr: 0.00039962  max_mem: 1481M
[05/23 15:50:19 d2.utils.events]:  eta: 8:08:10  iter: 39  total_loss: 1.601  loss_cls: 0.4312  loss_box_reg: 0.04747  loss_mask: 0.6906  loss_rpn_cls: 0.4376  loss_rpn_loc: 0.0764  time: 0.3254  data_time: 0.0026  lr: 0.00079922  max_mem: 1481M
[05/23 15:50:26 d2.utils.events]:  eta: 8:17:54  iter: 59  total_loss: 1.641  loss_cls: 0.4153  loss_box_reg: 0.09799  loss_mask: 0.691  loss_rpn_cls: 0.3649  loss_rpn_loc: 0.1253  time: 0.3259  data_time: 0.0028  lr: 0.0011988  max_mem: 1481M
[05/23 15:50:32 d2.utils.events]:  eta: 8:20:12  iter: 79  total_loss: 1.439  loss_cls: 0.3282  loss_box_reg: 0.09175  loss_mask: 0.6924  loss_rpn_cls: 0.2477  loss_rpn_loc: 0.05234  time: 0.3288  data_time: 0.0027  lr: 0.0015984  max_mem: 1481M
[05/23 15:50:39 d2.utils.events]:  eta: 8:20:06  iter: 99  total_loss: 1.285  loss_cls: 0.2667  loss_box_reg: 0.1191  loss_mask: 0.6891  loss_rpn_cls: 0.154  loss_rpn_loc: 0.05424  time: 0.3274  data_time: 0.0025  lr: 0.001998  max_mem: 1481M
[05/23 15:50:45 d2.utils.events]:  eta: 8:15:39  iter: 119  total_loss: 1.52  loss_cls: 0.346  loss_box_reg: 0.1504  loss_mask: 0.6818  loss_rpn_cls: 0.2181  loss_rpn_loc: 0.09391  time: 0.3256  data_time: 0.0025  lr: 0.0023976  max_mem: 1481M
[05/23 15:50:51 d2.utils.events]:  eta: 8:12:57  iter: 139  total_loss: 1.546  loss_cls: 0.2511  loss_box_reg: 0.1242  loss_mask: 0.6869  loss_rpn_cls: 0.2738  loss_rpn_loc: 0.04643  time: 0.3242  data_time: 0.0027  lr: 0.0027972  max_mem: 1481M
[05/23 15:50:58 d2.utils.events]:  eta: 8:12:51  iter: 159  total_loss: 1.687  loss_cls: 0.3452  loss_box_reg: 0.09927  loss_mask: 0.6778  loss_rpn_cls: 0.2546  loss_rpn_loc: 0.1271  time: 0.3253  data_time: 0.0028  lr: 0.0031968  max_mem: 1481M
[05/23 15:51:05 d2.utils.events]:  eta: 8:15:19  iter: 179  total_loss: 1.557  loss_cls: 0.4099  loss_box_reg: 0.1837  loss_mask: 0.6872  loss_rpn_cls: 0.1388  loss_rpn_loc: 0.06568  time: 0.3271  data_time: 0.0027  lr: 0.0035964  max_mem: 1481M
[05/23 15:51:12 d2.utils.events]:  eta: 8:16:06  iter: 199  total_loss: 1.931  loss_cls: 0.5021  loss_box_reg: 0.2378  loss_mask: 0.6843  loss_rpn_cls: 0.2495  loss_rpn_loc: 0.1568  time: 0.3284  data_time: 0.0035  lr: 0.003996  max_mem: 1481M

run2:

[05/23 15:52:57 d2.utils.events]:  eta: 7:49:54  iter: 19  total_loss: 2.349  loss_cls: 0.5801  loss_box_reg: 0.01275  loss_mask: 0.6936  loss_rpn_cls: 0.6719  loss_rpn_loc: 0.09081  time: 0.3190  data_time: 0.0176  lr: 0.00039962  max_mem: 1481M
[05/23 15:53:04 d2.utils.events]:  eta: 8:10:18  iter: 39  total_loss: 1.603  loss_cls: 0.4004  loss_box_reg: 0.04758  loss_mask: 0.6906  loss_rpn_cls: 0.4404  loss_rpn_loc: 0.07629  time: 0.3276  data_time: 0.0025  lr: 0.00079922  max_mem: 1481M
[05/23 15:53:10 d2.utils.events]:  eta: 8:19:58  iter: 59  total_loss: 1.646  loss_cls: 0.4176  loss_box_reg: 0.1167  loss_mask: 0.6912  loss_rpn_cls: 0.3633  loss_rpn_loc: 0.1252  time: 0.3274  data_time: 0.0026  lr: 0.0011988  max_mem: 1481M
[05/23 15:53:17 d2.utils.events]:  eta: 8:21:51  iter: 79  total_loss: 1.428  loss_cls: 0.299  loss_box_reg: 0.0902  loss_mask: 0.6921  loss_rpn_cls: 0.2449  loss_rpn_loc: 0.05256  time: 0.3296  data_time: 0.0026  lr: 0.0015984  max_mem: 1481M
[05/23 15:53:23 d2.utils.events]:  eta: 8:21:44  iter: 99  total_loss: 1.319  loss_cls: 0.2876  loss_box_reg: 0.1062  loss_mask: 0.6898  loss_rpn_cls: 0.1512  loss_rpn_loc: 0.05531  time: 0.3289  data_time: 0.0027  lr: 0.001998  max_mem: 1481M
[05/23 15:53:30 d2.utils.events]:  eta: 8:17:13  iter: 119  total_loss: 1.441  loss_cls: 0.28  loss_box_reg: 0.1317  loss_mask: 0.6835  loss_rpn_cls: 0.2149  loss_rpn_loc: 0.09209  time: 0.3274  data_time: 0.0025  lr: 0.0023976  max_mem: 1481M
[05/23 15:53:36 d2.utils.events]:  eta: 8:15:03  iter: 139  total_loss: 1.496  loss_cls: 0.272  loss_box_reg: 0.1103  loss_mask: 0.6876  loss_rpn_cls: 0.2564  loss_rpn_loc: 0.04832  time: 0.3262  data_time: 0.0025  lr: 0.0027972  max_mem: 1481M
[05/23 15:53:43 d2.utils.events]:  eta: 8:14:56  iter: 159  total_loss: 1.737  loss_cls: 0.3486  loss_box_reg: 0.06897  loss_mask: 0.678  loss_rpn_cls: 0.2603  loss_rpn_loc: 0.1359  time: 0.3266  data_time: 0.0025  lr: 0.0031968  max_mem: 1481M
[05/23 15:53:49 d2.utils.events]:  eta: 8:16:21  iter: 179  total_loss: 1.525  loss_cls: 0.3834  loss_box_reg: 0.1672  loss_mask: 0.6877  loss_rpn_cls: 0.1623  loss_rpn_loc: 0.08118  time: 0.3272  data_time: 0.0026  lr: 0.0035964  max_mem: 1481M
[05/23 15:53:56 d2.utils.events]:  eta: 8:16:14  iter: 199  total_loss: 1.598  loss_cls: 0.3331  loss_box_reg: 0.1141  loss_mask: 0.6792  loss_rpn_cls: 0.2563  loss_rpn_loc: 0.1831  time: 0.3270  data_time: 0.0026  lr: 0.003996  max_mem: 1481M

run3:

[05/23 15:56:10 d2.utils.events]:  eta: 7:45:39  iter: 19  total_loss: 2.348  loss_cls: 0.5763  loss_box_reg: 0.01275  loss_mask: 0.6936  loss_rpn_cls: 0.6719  loss_rpn_loc: 0.0908  time: 0.3167  data_time: 0.0122  lr: 0.00039962  max_mem: 1481M
[05/23 15:56:16 d2.utils.events]:  eta: 8:10:26  iter: 39  total_loss: 1.605  loss_cls: 0.3891  loss_box_reg: 0.04755  loss_mask: 0.6906  loss_rpn_cls: 0.4403  loss_rpn_loc: 0.07635  time: 0.3277  data_time: 0.0027  lr: 0.00079922  max_mem: 1481M
[05/23 15:56:23 d2.utils.events]:  eta: 8:23:04  iter: 59  total_loss: 1.679  loss_cls: 0.4163  loss_box_reg: 0.1102  loss_mask: 0.6912  loss_rpn_cls: 0.3563  loss_rpn_loc: 0.1251  time: 0.3293  data_time: 0.0031  lr: 0.0011988  max_mem: 1481M
[05/23 15:56:30 d2.utils.events]:  eta: 8:21:28  iter: 79  total_loss: 1.433  loss_cls: 0.3133  loss_box_reg: 0.07978  loss_mask: 0.6921  loss_rpn_cls: 0.2468  loss_rpn_loc: 0.05257  time: 0.3303  data_time: 0.0028  lr: 0.0015984  max_mem: 1481M
[05/23 15:56:36 d2.utils.events]:  eta: 8:22:50  iter: 99  total_loss: 1.317  loss_cls: 0.2764  loss_box_reg: 0.1469  loss_mask: 0.6895  loss_rpn_cls: 0.1487  loss_rpn_loc: 0.05474  time: 0.3291  data_time: 0.0027  lr: 0.001998  max_mem: 1481M
[05/23 15:56:43 d2.utils.events]:  eta: 8:20:03  iter: 119  total_loss: 1.455  loss_cls: 0.3264  loss_box_reg: 0.1456  loss_mask: 0.6827  loss_rpn_cls: 0.209  loss_rpn_loc: 0.09486  time: 0.3281  data_time: 0.0030  lr: 0.0023976  max_mem: 1481M
[05/23 15:56:49 d2.utils.events]:  eta: 8:16:57  iter: 139  total_loss: 1.475  loss_cls: 0.2835  loss_box_reg: 0.09706  loss_mask: 0.6861  loss_rpn_cls: 0.2541  loss_rpn_loc: 0.04725  time: 0.3260  data_time: 0.0027  lr: 0.0027972  max_mem: 1481M
[05/23 15:56:56 d2.utils.events]:  eta: 8:18:19  iter: 159  total_loss: 1.675  loss_cls: 0.3287  loss_box_reg: 0.1219  loss_mask: 0.6776  loss_rpn_cls: 0.2344  loss_rpn_loc: 0.1299  time: 0.3269  data_time: 0.0028  lr: 0.0031968  max_mem: 1481M
[05/23 15:57:02 d2.utils.events]:  eta: 8:19:43  iter: 179  total_loss: 1.568  loss_cls: 0.4459  loss_box_reg: 0.1866  loss_mask: 0.6875  loss_rpn_cls: 0.124  loss_rpn_loc: 0.06825  time: 0.3279  data_time: 0.0027  lr: 0.0035964  max_mem: 1481M
[05/23 15:57:09 d2.utils.events]:  eta: 8:19:37  iter: 199  total_loss: 1.803  loss_cls: 0.4938  loss_box_reg: 0.1835  loss_mask: 0.6884  loss_rpn_cls: 0.2585  loss_rpn_loc: 0.1701  time: 0.3281  data_time: 0.0029  lr: 0.003996  max_mem: 1481M

Expected behavior:

I would expect the losses to be (largely) identical in the default training setup, when using identical machine/code/random seed/config and PyTorch flags for deterministic training.

@jhindel
Copy link

jhindel commented Jun 17, 2022

I am facing a very similar issue. Did you find a reason for this behaviour and have any suggestions how to fix it?

@j-rausch
Copy link
Author

I'm still facing the issue. Without having debugged this in more detail and just looking at the losses of the three runs, loss_cls appears to differ the most at the beginning of the training.

There have been other issues that have been closed in the past (e.g. #2480
), pointing to PyTorch's non-determinism. Perhaps revisiting them with the new deterministic training flags in PyTorch could give new pointers.

@j-rausch
Copy link
Author

j-rausch commented Aug 9, 2022

Are there any news or advice on possible reasons for this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants