module._head.anchors._anchors with shape torch.Size([3, 1, 2]) does not match _head.anchors._anchors with shape torch.Size([3, 3, 2]) #1226

jhurliman · 2023-06-27T18:03:22Z

🐛 Describe the bug

I am training a YOLOX model with as few configuration changes as possible, I changed num_classes from 80 to 8 to and input_dims from 640x640 to 768x768 to match my training data. Training runs and appears to have nice loss curves and validation set performance improvements, but when I try to load the best checkpoint and use it for inference I get this error (same machine, same conda environment):

ValueError: ckpt layer module._head.anchors._anchors with shape torch.Size([3, 1, 2]) does not match _head.anchors._anchors with shape torch.Size([3, 3, 2]) in the model

Versions

$ python3 collect_env.py
Collecting environment information...
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.35

Python version: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-72-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB

Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   43 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          256
On-line CPU(s) list:             0-254
Off-line CPU(s) list:            255
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 7742 64-Core Processor
CPU family:                      23
Model:                           49
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
Stepping:                        0
Frequency boost:                 enabled
CPU max MHz:                     2250.0000
CPU min MHz:                     0.0000
BogoMIPS:                        4499.92
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization:                  AMD-V
L1d cache:                       4 MiB (128 instances)
L1i cache:                       4 MiB (128 instances)
L2 cache:                        64 MiB (128 instances)
L3 cache:                        512 MiB (32 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-63,128-191
NUMA node1 CPU(s):               64-127,192-254
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.21.5
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2
[pip3] torchvision==0.15.1
[pip3] triton==2.0.0
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

shaydeci · 2023-06-29T10:27:54Z

@jhurliman can you please post the code snippet you used for loading the network's weights ?

BloodAxe · 2023-08-10T10:07:57Z

This should be fixed after we merged #1184
@jhurliman can you please try training again with 3.1.3 to see whether the problem still persist?
If it does feel free to re-open the issue

ortizeg · 2023-08-15T02:30:48Z

Hi, I was able to reproduce the error. I installed the main branch.

I trained one iteration yolox_s with 4 gpus in DDP mode using coco 128 without modifying number of classes or anything and got the same error when loading the checkpoint.
https://ultralytics.com/assets/coco128.zip

I tried to load the model with the following:
model = models.get(model_name="yolox_s", num_classes=80, checkpoint_path=checkpoint_path)

The error:

Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60x42/code/super-gradients/src/super_gradients/training/utils/checkpoint_utils.py", line 67, in adaptive_load_state_dict
    net.load_state_dict(state_dict, strict=strict_bool)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60x42/code/super-gradients/src/super_gradients/training/models/detection_models/yolo_base.py", line 593, in load_state_dict
    raise RuntimeError(
RuntimeError: Got exception Error(s) in loading state_dict for YoloX_S:
        size mismatch for _head.anchors._anchors: copying a param with shape torch.Size([3, 1, 2]) from checkpoint, the shape in current model is torch.Size([3, 3, 2]).
        size mismatch for _head.anchors._anchor_grid: copying a param with shape torch.Size([3, 1, 1, 1, 1, 2]) from checkpoint, the shape in current model is torch.Size([3, 1, 3, 1, 1, 2])., if a mismatch between expected and given state_dict keys exist, checkpoint may have been saved after fusing conv and bn. use fuse_conv_bn before loading.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60x42/code/gradients/convert_to_onnx.py", line 7, in <module>
    model = models.get(model_name="yolox_s", num_classes=80, checkpoint_path=checkpoint_path)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60x42/code/super-gradients/src/super_gradients/training/models/model_factory.py", line 233, in get
    _ = load_checkpoint_to_model(
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60x42/code/super-gradients/src/super_gradients/training/utils/checkpoint_utils.py", line 246, in load_checkpoint_to_model
    adaptive_load_state_dict(net, checkpoint, strict)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60x42/code/super-gradients/src/super_gradients/training/utils/checkpoint_utils.py", line 70, in adaptive_load_state_dict
    adapted_state_dict = adapt_state_dict_to_fit_model_layer_names(net.state_dict(), state_dict, solver=solver)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60x42/code/super-gradients/src/super_gradients/training/utils/checkpoint_utils.py", line 178, in adapt_state_dict_to_fit_model_layer_names
    raise ValueError(f"ckpt layer {ckpt_key} with shape {ckpt_val.shape} does not match {model_key}" f" with shape {model_val.shape} in the model")
ValueError: ckpt layer _head.anchors._anchors with shape torch.Size([3, 1, 2]) does not match _head.anchors._anchors with shape torch.Size([3, 3, 2]) in the model

Config:

  - training_hyperparams: coco2017_yolox_train_params
  - dataset_params: coco_detection_dataset_params
  - arch_params: yolox_s_arch_params
  - checkpoint_params: default_checkpoint_params
  - _self_
  - variable_setup

train_dataloader: coco2017_train
val_dataloader: coco2017_val

load_checkpoint: False
resume: False
# num_classes: 1
data_dir: "data/coco128/"

dataset_params:
  train_dataset_params:
    data_dir: ${data_dir}
  train_dataloader_params:
    batch_size: 32
  val_dataset_params:
    data_dir: ${data_dir}

training_hyperparams:
  resume: ${resume}
  loss: yolox_fast_loss


architecture: yolox_s

multi_gpu: DDP
num_gpus: 4

experiment_suffix: res${dataset_params.train_dataset_params.input_dim}
experiment_name: ${architecture}_sub_${experiment_suffix}

Version 3.1.3+main:

Python 3.9.17 (main, Jul  5 2023, 20:41:20) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import super_gradients
super_graThe console stream is logged into /home/azureuser/sg_logs/console.log
dients[2023-08-15 02:31:43] INFO - crash_tips_setup.py - Crash tips is enabled. You can set your environment variable to CRASH_HANDLER=FALSE to disable it
__version__/anaconda/envs/gradients/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

>>> super_gradients.__version__
'3.1.3'```

ortizeg · 2023-08-15T16:55:21Z

@BloodAxe update: I saw some commits related to YOLOX were committed today. If I run training now and the same code above with current main commit, everything works fine. But if I used the sample model I trained last night using 3.1.3 and whatever was on main as of last night, we get a different error shown below. I'll try to trace down the issue on my end as well, but figured this might be useful for you. Should I share the model to help?

The console stream is logged into /home/azureuser/sg_logs/console.log
[2023-08-15 16:33:01] INFO - crash_tips_setup.py - Crash tips is enabled. You can set your environment variable to CRASH_HANDLER=FALSE to disable it
/anaconda/envs/gradients/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60/code/super-gradients/src/super_gradients/training/models/detection_models/yolo_base.py", line 593, in load_state_dict
    super().load_state_dict(state_dict, strict)
  File "/anaconda/envs/gradients/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for YoloX_S:
        Unexpected key(s) in state_dict: "stride", "_head.anchors._stride", "_head.anchors._anchors", "_head.anchors._anchor_grid", "_head._modules_list.14.stride". 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60/code/super-gradients/src/super_gradients/training/utils/checkpoint_utils.py", line 86, in adaptive_load_state_dict
    net.load_state_dict(state_dict, strict=strict_bool)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60/code/super-gradients/src/super_gradients/training/models/detection_models/yolo_base.py", line 595, in load_state_dict
    raise RuntimeError(
RuntimeError: Got exception Error(s) in loading state_dict for YoloX_S:
        Unexpected key(s) in state_dict: "stride", "_head.anchors._stride", "_head.anchors._anchors", "_head.anchors._anchor_grid", "_head._modules_list.14.stride". , if a mismatch between expected and given state_dict keys exist, checkpoint may have been saved after fusing conv and bn. use fuse_conv_bn before loading.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60/code/gradients/convert_to_onnx.py", line 8, in <module>
    model = models.get(model_name="yolox_s", num_classes=80, checkpoint_path=checkpoint_path).eval()
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60/code/super-gradients/src/super_gradients/training/models/model_factory.py", line 233, in get
    _ = load_checkpoint_to_model(
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60/code/super-gradients/src/super_gradients/training/utils/checkpoint_utils.py", line 1513, in load_checkpoint_to_model
    adaptive_load_state_dict(net, checkpoint, strict)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60/code/super-gradients/src/super_gradients/training/utils/checkpoint_utils.py", line 89, in adaptive_load_state_dict
    adapted_state_dict = adapt_state_dict_to_fit_model_layer_names(net.state_dict(), state_dict, solver=solver)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60/code/super-gradients/src/super_gradients/training/utils/checkpoint_utils.py", line 1446, in adapt_state_dict_to_fit_model_layer_names
    new_ckpt_dict = solver(model_state_dict, source_ckpt)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/enortiz-m60/code/super-gradients/src/super_gradients/training/utils/checkpoint_utils.py", line 200, in __call__
    raise ValueError(f"ckpt layer {ckpt_key} with shape {ckpt_val.shape} does not match {model_key}" f" with shape {model_val.shape} in the model")
ValueError: ckpt layer stride with shape torch.Size([3]) does not match _backbone._modules_list.0.conv.weight with shape torch.Size([32, 3, 6, 6]) in the model

shaydeci added the ❔ Need more info label Jun 29, 2023

BloodAxe closed this as completed Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

module._head.anchors._anchors with shape torch.Size([3, 1, 2]) does not match _head.anchors._anchors with shape torch.Size([3, 3, 2]) #1226

module._head.anchors._anchors with shape torch.Size([3, 1, 2]) does not match _head.anchors._anchors with shape torch.Size([3, 3, 2]) #1226

jhurliman commented Jun 27, 2023 •

edited

Loading

shaydeci commented Jun 29, 2023

BloodAxe commented Aug 10, 2023

ortizeg commented Aug 15, 2023 •

edited

Loading

ortizeg commented Aug 15, 2023

module._head.anchors._anchors with shape torch.Size([3, 1, 2]) does not match _head.anchors._anchors with shape torch.Size([3, 3, 2]) #1226

module._head.anchors._anchors with shape torch.Size([3, 1, 2]) does not match _head.anchors._anchors with shape torch.Size([3, 3, 2]) #1226

Comments

jhurliman commented Jun 27, 2023 • edited Loading

🐛 Describe the bug

Versions

shaydeci commented Jun 29, 2023

BloodAxe commented Aug 10, 2023

ortizeg commented Aug 15, 2023 • edited Loading

ortizeg commented Aug 15, 2023

jhurliman commented Jun 27, 2023 •

edited

Loading

ortizeg commented Aug 15, 2023 •

edited

Loading