Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error training yolo_nas_l #1997

Open
Vdol22 opened this issue May 20, 2024 · 4 comments
Open

Error training yolo_nas_l #1997

Vdol22 opened this issue May 20, 2024 · 4 comments

Comments

@Vdol22
Copy link

Vdol22 commented May 20, 2024

💡 Your Question

Hi! I'm stuck with trying to train yolo_nas_l on custom data. I follow several guides and notebooks yet constantly come to one error - "You can use sliding window validation callback, but your model does not support sliding window inference. Please either remove the callback or use the model that supports sliding inference: "Segformer".
Here's the code:

from super_gradients.common.object_names import Models
from super_gradients.training import Trainer, models
from super_gradients.training.losses import PPYoloELoss
from super_gradients.training.models.detection_models.pp_yolo_e import PPYoloEPostPredictionCallback
from super_gradients import init_trainer
from super_gradients.training.dataloaders.dataloaders import (
    coco_detection_yolo_format_train, 
    coco_detection_yolo_format_val
)
from super_gradients.training.metrics import DetectionMetrics

trainer = Trainer(experiment_name="YOLO_LEARN", ckpt_root_dir="checkpoints")
model = models.get(model_name=Models.YOLO_NAS_L, num_classes=1, pretrained_weights="coco")

dataset_params = {
    'data_dir': '.',
    'train_images_dir': 'images/train',
    'train_labels_dir': 'labels/train',
    'val_images_dir': 'images/val',
    'val_labels_dir': 'labels/val',
    'classes': ['person']
}

BATCH_SIZE = 8
WORKERS = 1

train_loader = coco_detection_yolo_format_train(
    dataset_params={
        'data_dir': dataset_params['data_dir'],
        'images_dir': dataset_params['train_images_dir'],
        'labels_dir': dataset_params['train_labels_dir'],
        'classes': dataset_params['classes']
    },
    dataloader_params={
        'batch_size':BATCH_SIZE,
        'num_workers':WORKERS
    }
)

valid_loader = coco_detection_yolo_format_val(
    dataset_params={
        'data_dir': dataset_params['data_dir'],
        'images_dir': dataset_params['val_images_dir'],
        'labels_dir': dataset_params['val_labels_dir'],
        'classes': dataset_params['classes']
    },
    dataloader_params={
        'batch_size':BATCH_SIZE,
        'num_workers':WORKERS
    }
)

training_params = {
    "max_epochs": 300,
    "warmup_mode": "LinearBatchLRWarmup",
    "warmup_initial_lr": 1e-6,
    "lr_warmup_epochs": 3,
    "initial_lr": 5e-4,
    "lr_mode": "CosineLRScheduler",
    "cosine_final_lr_ratio": 0.1,
    "loss": PPYoloELoss(
        use_static_assigner=False,
        num_classes=1
    ),
    "optimizer": "AdamW",
    "optimizer_params": {"weight_decay": 0.0001},
    "ema": True,
    "ema_params": {"decay": 0.9997, "decay_type": "threshold"},
    "valid_metrics_list": [
        DetectionMetrics(
            score_thres=0.1,
            top_k_predictions=300,
            num_cls=1,
            normalize_targets=True,
            post_prediction_callback=PPYoloEPostPredictionCallback(
                score_threshold=0.01,
                nms_top_k=1000,
                max_predictions=300,
                nms_threshold=0.7
            )
        )
    ],
    "metric_to_watch": 'mAP@0.50:0.95',
    "greater_metric_to_watch_is_better": True
}

trainer.train(model=model, training_params=training_params, train_loader=train_loader, valid_loader=valid_loader)

Here's the output:

Indexing dataset annotations: 100%|██████████| 4/4 [00:00<00:00, 1999.91it/s]
Indexing dataset annotations: 100%|██████████| 3/3 [00:00<00:00, 2981.73it/s]

StopIteration                             Traceback (most recent call last)
Cell In[9], line 1
----> 1 trainer.train(model=model, training_params=training_params, train_loader=train_loader, valid_loader=valid_loader)

File ~\AppData\Roaming\Python\Python39\site-packages\super_gradients\training\sg_trainer\sg_trainer.py:1482, in Trainer.train(self, model, training_params, train_loader, valid_loader, test_loaders, additional_configs_to_log)
   1475     raise ValueError(
   1476         "You can use sliding window validation callback, but your model does not support sliding window "
   1477         "inference. Please either remove the callback or use the model that supports sliding inference: "
   1478         "Segformer"
   1479     )
   1481 if isinstance(model, SupportsInputShapeCheck):
-> 1482     first_train_batch = next(iter(self.train_loader))
   1483     inputs, _, _ = sg_trainer_utils.unpack_batch_items(first_train_batch)
   1484     model.validate_input_shape(inputs.size())

File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
    628 if self._sampler_iter is None:
    629     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    630     self._reset()  # type: ignore[call-arg]
--> 631 data = self._next_data()
    632 self._num_yielded += 1
    633 if self._dataset_kind == _DatasetKind.Iterable and \
    634         self._IterableDataset_len_called is not None and \
    635         self._num_yielded > self._IterableDataset_len_called:

File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:1319, in _MultiProcessingDataLoaderIter._next_data(self)
   1317     if not self._persistent_workers:
   1318         self._shutdown_workers()
-> 1319     raise StopIteration
   1321 # Now `self._rcvd_idx` is the batch index we want to fetch
   1322 
   1323 # Check if the next sample has already been generated
   1324 if len(self._task_info[self._rcvd_idx]) == 2:

Please help, you lib looks so promising yet I don't understand what I do wrong.

Versions

PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Windows 11 Pro
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.9.19 (main, May 6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: True
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2050
Nvidia driver version: 552.22
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Revision=

Versions of relevant libraries:
[pip3] numpy==1.23.0
[pip3] onnx==1.15.0
[pip3] onnx-simplifier==0.4.36
[pip3] onnxruntime==1.15.0
[pip3] onnxsim==0.4.36
[pip3] torch==2.3.0
[pip3] torchaudio==2.3.0
[pip3] torchmetrics==0.8.0
[pip3] torchvision==0.18.0
[conda] blas 1.0 mkl
[conda] mkl 2021.4.0 pypi_0 pypi
[conda] mkl-service 2.4.0 py39h2bbff1b_0
[conda] mkl_fft 1.3.1 py39h277e83a_0
[conda] mkl_random 1.2.2 py39hf11a4ad_0
[conda] numpy 1.23.0 pypi_0 pypi
[conda] numpy-base 1.24.3 py39h005ec55_0
[conda] pytorch 2.3.0 py3.9_cuda12.1_cudnn8_0 pytorch
[conda] pytorch-cuda 12.1 hde6ce7c_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 2.3.0 pypi_0 pypi
[conda] torchaudio 2.3.0 pypi_0 pypi
[conda] torchmetrics 0.8.0 pypi_0 pypi
[conda] torchvision 0.18.0 pypi_0 pypi

@BloodAxe
Copy link
Collaborator

I don't think it has anything to do with a sliding window inference. It just happen to be nearby in the file where the stacktrace is printed. If you look closely to a stacktrace you will see "-> " symbols indicating where there error is coming from.
Overall it looks like you have some exception happening in DataLoader while trying to make a batch.

For this I suggest to test whether you can get a single batch or sample from dataset. To simplify the debugging it's better to turn of all workers (workers: 0) when creating a DataLoader. This way you will get the exception in the main thread with better exception message that hopefully should give you a clear picture what is happening. Looking forward seeing this error message.

@Vdol22
Copy link
Author

Vdol22 commented May 20, 2024

Thank you kindly for a brief reply.

turn of all workers (workers: 0) when creating a DataLoader.
There it is:

StopIteration                             Traceback (most recent call last)
Cell In[9], line 1
----> 1 trainer.train(model=model, training_params=training_params, train_loader=train_loader, valid_loader=valid_loader)

File ~\AppData\Roaming\Python\Python39\site-packages\super_gradients\training\sg_trainer\sg_trainer.py:1482, in Trainer.train(self, model, training_params, train_loader, valid_loader, test_loaders, additional_configs_to_log)
   1475     raise ValueError(
   1476         "You can use sliding window validation callback, but your model does not support sliding window "
   1477         "inference. Please either remove the callback or use the model that supports sliding inference: "
   1478         "Segformer"
   1479     )
   1481 if isinstance(model, SupportsInputShapeCheck):
-> 1482     first_train_batch = next(iter(self.train_loader))
   1483     inputs, _, _ = sg_trainer_utils.unpack_batch_items(first_train_batch)
   1484     model.validate_input_shape(inputs.size())

File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
    628 if self._sampler_iter is None:
    629     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    630     self._reset()  # type: ignore[call-arg]
--> 631 data = self._next_data()
    632 self._num_yielded += 1
    633 if self._dataset_kind == _DatasetKind.Iterable and \
    634         self._IterableDataset_len_called is not None and \
    635         self._num_yielded > self._IterableDataset_len_called:

File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:674, in _SingleProcessDataLoaderIter._next_data(self)
    673 def _next_data(self):
--> 674     index = self._next_index()  # may raise StopIteration
    675     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    676     if self._pin_memory:

File C:\utils\anaconda3\envs\py39\lib\site-packages\torch\utils\data\dataloader.py:621, in _BaseDataLoaderIter._next_index(self)
    620 def _next_index(self):
--> 621     return next(self._sampler_iter)

@Vdol22
Copy link
Author

Vdol22 commented May 20, 2024

After some debugging I found out that printing these

train_images_dir = dataset_params['train_images_dir']
train_labels_dir = dataset_params['train_labels_dir']
val_images_dir = dataset_params['val_images_dir']
val_labels_dir = dataset_params['val_labels_dir']

train_loader_iter = iter(train_loader)
try:
    train_batch = next(train_loader_iter)
    display("Train Batch:", train_batch)
except StopIteration:
    display("No data fetched from train_loader")

valid_loader_iter = iter(valid_loader)
try:
    valid_batch = next(valid_loader_iter)
    display("Valid Batch:", valid_batch)
except StopIteration:
    display("No data fetched from valid_loader")

Results in 'No data fetched from train_loader'
However the valid_loader works just fine.

@Vdol22
Copy link
Author

Vdol22 commented May 21, 2024

UPD: removing worker_init_fn in training dataloader seemed to have started it:

train_loader = coco_detection_yolo_format_train(
    dataset_params={
        'data_dir': dataset_params['data_dir'],
        'images_dir': dataset_params['train_images_dir'],
        'labels_dir': dataset_params['train_labels_dir'],
        'classes': dataset_params['classes']
    },
    dataloader_params={
        'batch_size': BATCH_SIZE,
        'num_workers': WORKERS,
        'shuffle': True,
        'drop_last': False,
        'pin_memory': True,
        # 'worker_init_fn': {
        #     '_target_': 'super_gradients.training.utils.utils.load_func',
        #     'dotpath': 'super_gradients.training.datasets.datasets_utils.worker_init_reset_seed'
        # },
        'collate_fn': 'DetectionCollateFN'
    }
)

It is strange though, that progress bar of an epoch now consists of 1/1. There are only 4 photos in my dataset (since I was trying to run training), so maybe that's the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants