Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Hyperparameter Evolution raises ValueError #12897

Closed
1 task done
RAHUL01-09 opened this issue Apr 8, 2024 · 3 comments
Closed
1 task done

Running Hyperparameter Evolution raises ValueError #12897

RAHUL01-09 opened this issue Apr 8, 2024 · 3 comments
Labels
bug Something isn't working Stale

Comments

@RAHUL01-09
Copy link

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training, Evolution

Bug

I was trying to train my custom model locally on Nvidia RTX 3050 but it raises a ValueError. I checked and it raises the same error on coco128 dataset.
This is the dump:

(.venv-cuda121) PS C:\workspace\adis\yolov5> python train.py --img 640 --batch 4 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache --evolve
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data\hyps\hyp.scratch-low.yaml, epochs=3, batch_size=4, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=300, evolve_population=data\hyps, resume_evolve=None, bucket=, cache=ram, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs\train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
github: up to date with https://github.com/ultralytics/yolov5
YOLOv5  v7.0-296-gae4ef3b2 Python-3.11.1 torch-2.2.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3050 Laptop GPU, 4096MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.01041, hsv_s=0.54703, hsv_v=0.27739, degrees=0.0, translate=0.04591, scale=0.75544, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=0.85834, mixup=0.04266, copy_paste=0.0, anchors=3
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5  runs in Comet
Overriding model.yaml anchors with anchors=3

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5]], [128, 256, 512]]
Model summary: 214 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs

Transferred 348/349 items from yolov5s.pt
AMP: checks passed
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 60 weight(decay=0.0005), 60 bias
train: Scanning C:\workspace\adis\datasets\coco128\labels\train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s
train: Caching images (0.1GB ram): 100%|██████████| 128/128 [00:00<00:00, 1829.21it/s]
val: Scanning C:\workspace\adis\datasets\coco128\labels\train2017.cache... 126 images, 2 backgrounds, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]

AutoAnchor: 0.36 anchors/target, 0.097 Best Possible Recall (BPR). Anchors are a poor fit to dataset , attempting to improve...
AutoAnchor: WARNING  Extremely small objects found: 3 of 929 labels are <3 pixels in size
AutoAnchor: Running kmeans for 9 anchors on 928 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.6715: 100%|██████████| 1000/1000 [00:00<00:00, 2909.40it/s]
AutoAnchor: thr=0.25: 0.9925 best possible recall, 3.71 anchors past thr
AutoAnchor: n=9, img_size=640, metric_all=0.261/0.672-mean/best, past_thr=0.478-mean: 11,11, 20,27, 51,57, 125,86, 92,175, 140,287, 280,226, 378,368, 549,444
AutoAnchor: Done  (optional: update model *.yaml to use these anchors in the future)
Plotting labels to runs\evolve\exp4\labels.jpg...
Image sizes 640 train, 640 val
Using 4 dataloader workers
Logging results to runs\evolve\exp4
Starting training for 3 epochs...

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
  0%|          | 0/32 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\workspace\adis\yolov5\train.py", line 848, in <module>
    main(opt)
  File "C:\workspace\adis\yolov5\train.py", line 754, in main
    results = train(hyp.copy(), opt, device, callbacks)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\workspace\adis\yolov5\train.py", line 356, in train
    for i, (imgs, targets, paths, _) in pbar:  # batch -------------------------------------------------------------
  File "C:\workspace\adis\yolov5\.venv-cuda121\Lib\site-packages\tqdm\std.py", line 1181, in __iter__
    for obj in iterable:
  File "C:\workspace\adis\yolov5\utils\dataloaders.py", line 239, in __iter__
    yield next(self.iterator)
          ^^^^^^^^^^^^^^^^^^^
  File "C:\workspace\adis\yolov5\.venv-cuda121\Lib\site-packages\torch\utils\data\dataloader.py", line 631, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "C:\workspace\adis\yolov5\.venv-cuda121\Lib\site-packages\torch\utils\data\dataloader.py", line 1346, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\workspace\adis\yolov5\.venv-cuda121\Lib\site-packages\torch\utils\data\dataloader.py", line 1372, in _process_data
    data.reraise()
  File "C:\workspace\adis\yolov5\.venv-cuda121\Lib\site-packages\torch\_utils.py", line 722, in reraise
    raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "C:\workspace\adis\yolov5\.venv-cuda121\Lib\site-packages\torch\utils\data\_utils\worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\workspace\adis\yolov5\.venv-cuda121\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\workspace\adis\yolov5\.venv-cuda121\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "C:\workspace\adis\yolov5\utils\dataloaders.py", line 777, in __getitem__
    img, labels = mixup(img, labels, *self.load_mosaic(random.choice(self.indices)))
                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\random.py", line 369, in choice
    if not seq:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

If this is an Nvidia related bug, then this is my info from nvidia-smi

Mon Apr  8 20:41:10 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 546.21                 Driver Version: 546.21       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3050 ...  WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   36C    P0               7W /  40W |      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

### Environment

YOLOv5  v7.0-296-gae4ef3b2 Python-3.11.1 torch-2.2.2+cu121 CUDA:0 (NVIDIA GeForce RTX 3050 Laptop GPU, 4096MiB)

### Minimal Reproducible Example

_No response_

### Additional

_No response_

### Are you willing to submit a PR?

- [ ] Yes I'd like to help by submitting a PR!
@RAHUL01-09 RAHUL01-09 added the bug Something isn't working label Apr 8, 2024
Copy link
Contributor

github-actions bot commented Apr 8, 2024

👋 Hello @RAHUL01-09, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

@glenn-jocher
Copy link
Member

@RAHUL01-09 hello! Thanks for thorough detailing of the issue you're encountering. 😊

From the traceback provided, it looks like the error originates from how the mixup data augmentation is applied. Specifically, the error message ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() hints at a condition in the code potentially expecting a singular boolean value but instead receiving an array.

As your environment information and error traces don’t directly suggest an issue with NVIDIA drivers or CUDA versions, and considering this seems to be related only to the Python code execution, my suggestion would be to:

  1. Ensure your dataset is correctly structured as per the YOLOv5 documentation recommendations.
  2. Try running the training without the --evolve flag to see if the issue persists. This can help narrow down if the problem is specifically tied to the hyperparameter evolution process.
  3. Double-check the version of Python and PyTorch you're using is well-supported. Although unlikely the cause, it's always good to rule out compatibility issues.

If the problem persists without the --evolve flag or adjusting as above doesn't help, a deeper dive into the mixup implementation within utils/dataloaders.py might be necessary. Given the specific nature of the error, you might want to ensure the access to self.indices isn’t returning unexpected array-like results which could lead to this ambiguous truth value encountered during the mixup data preparation.

I hope this helps! If you're continuing to run into issues, please ensure all relevant details are updated in your report or comments to aid further diagnostics. Thanks for contributing to the YOLOv5 community!

Copy link
Contributor

github-actions bot commented May 9, 2024

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale label May 9, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

2 participants