Queries related to Ultralytics YOLOv9 to detect multiple video feed and prediction #14021

itskhaledmohammad · 2024-06-26T20:57:47Z

itskhaledmohammad
Jun 26, 2024

Hi everyone,

So I need your help on this.

What and the why:

Want I am trying to achieve is, that I will take video feed from 3 inputs, and the YOLO will detect what are the possible items on the tray, so one item might be visible on one camera and might be not visible or partially visible on another, and also YOLO might not be confident about an object in one camera but the same object YOLO might detect with high confidence from the other camera. For example a beverage can might not be detectable from the top (as you can only see the opening part and most beverage have similar looking openings on top), but be detectable from another angle where the body can be seen). Hence the multiple cameras.

Questions:

How to take input from multiple camera and get it to predict the objects combining all 3 results? what should be the procedure I should go with?
As you see in the picture, I tried to keep the background a constant black, hoping that will help with training and prediction, Is there any particular color other than black that might help like white or stripes?
As I have a fixed background, so how many pictures of an object would I need? I have seen in the documentation it says > 1500 images of each items, but given my scenerio do I need that much, taking 1500 images of each food item would be really hard each time, if get like 36-40 pictures of each items (and each item will be class) and then augment it, would that work and if yes but what type of augmentation do you recommend?
Also when training on a new item, I would like to take a single item and take multiple images of that, so would be a single instance of that item in all the images, but they will appear together in real case scenario, lets say two instances of pack of chips and beverages, would that be a problem? And is there any particular way to train for that?
As the cameras are 1080p, the images I am training on is 1920x1080, is that a problem? Or I will have to downscale them, if so what resolution?
I am using OAK-D cameras, these are stereo cameras, can I use the depth data or any other data to give better results for the detection or help in my case?

I did go through the ultralytics documentation so these questions came up. Thanks in Advance, hoping to get guidelines from you guys.

glenn-jocher · 2024-06-27T01:44:44Z

glenn-jocher
Jun 27, 2024
Maintainer

@itskhaledmohammad hi there,

Thank you for reaching out with your detailed questions! Let's address each of your queries to help you achieve your objectives with YOLOv9.

1. Multiple Camera Inputs and Combined Predictions

To handle multiple camera inputs and combine predictions, you can follow these steps:

Capture Frames: Use a library like OpenCV to capture frames from each camera.
Run Inference: Run YOLOv9 inference on each frame independently.
Combine Results: Merge the detection results from all cameras. You can use techniques like Non-Maximum Suppression (NMS) to handle overlapping detections and ensure that each object is only counted once.

Here's a basic example in Python:

import cv2
from ultralytics import YOLO

# Initialize YOLO model
model = YOLO('yolov9.pt')

# Capture from multiple cameras
cap1 = cv2.VideoCapture(0)
cap2 = cv2.VideoCapture(1)
cap3 = cv2.VideoCapture(2)

while True:
    ret1, frame1 = cap1.read()
    ret2, frame2 = cap2.read()
    ret3, frame3 = cap3.read()

    if ret1 and ret2 and ret3:
        # Run inference on each frame
        results1 = model(frame1)
        results2 = model(frame2)
        results3 = model(frame3)

        # Combine results (implement your merging logic here)
        combined_results = merge_results([results1, results2, results3])

        # Display results
        for result in combined_results:
            result.show()

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap1.release()
cap2.release()
cap3.release()
cv2.destroyAllWindows()

You will need to implement the merge_results function to combine the detections effectively.

2. Background Color for Training

A consistent background can indeed help with training. Black is a good choice, but other solid colors like white can also work well. The key is consistency. Stripes or patterns might introduce unnecessary complexity.

3. Number of Images and Augmentation

While 1500 images per class is ideal, you can start with fewer images and use augmentation to increase your dataset size. Techniques like rotation, scaling, flipping, and color jittering can help. Here's an example using Albumentations:

import albumentations as A
from albumentations.pytorch import ToTensorV2

transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
    A.Rotate(limit=15, p=0.5),
    A.Resize(640, 640),
    ToTensorV2()
])

4. Training with Single Instances

Training with single instances is fine. Ensure your dataset includes images with multiple instances as well to help the model generalize. You can manually create such images or use augmentation to simulate them.

5. Image Resolution

Training on 1920x1080 images is feasible but may require more computational resources. You can downscale images to a resolution like 640x640, which is commonly used in YOLO models, to speed up training and reduce memory usage.

6. Using Depth Data

Using depth data from OAK-D cameras can improve detection, especially for occlusions and overlapping objects. You can incorporate depth information as an additional input to your model or use it to filter and refine detections post-inference.

For further details and examples, you can refer to our FAQ.

I hope this helps! Feel free to ask if you have more questions. 😊

2 replies

itskhaledmohammad Jun 27, 2024
Author

Hi @glenn-jocher,
Thanks for all your answers, I have some more queries related to them,

for a better understanding this is how my cameras are (video feeds are taken from them):

You see they are in angle, what I would like to do lets say:

In reality, there is 3 items on the tray: 1 pack of Lays Chips, 1 Pack of Doritos Chips and 1 Can of Red bull.

In the top camera the model predicted: 1 pack of Lays Chips with 80% confidence, 1 pack of Doritos Chips with 70% confidence.
The left 45 degree angled camera: it predicted 1 packet of Lays Chips with 60% confidence, and 1 can of a Red Bull with 70% confidence. And didn't detect the Doritos at all because it was hidden by other items.
The right 45 degree angled camera: it predicted 1 packet of Doritos Chips with 70% confidence, and 1 can of a Red Bull with 60% confidence. And didn't detect the Lays Chips at all because it was hidden by other items.

Now for example how would I merge the results so that my model can say: Red Bull, Lays and Doritos, all 3 were there? How would I use Non-Maximum Suppression (NMS) here, or some other technique? And can you give me an example how would I achieve that with ultralytics NMS, I see it is available in ultralytics.utils.ops.non_max_suppression but the usage in my case isn't clear, if you can please help me with this.

glenn-jocher Jun 28, 2024
Maintainer

Hi @itskhaledmohammad,

Thank you for providing the detailed context and the image of your camera setup. It's great to see your enthusiasm for leveraging multiple camera angles to improve detection accuracy. Let's dive into your questions.

Merging Results from Multiple Cameras

To combine the predictions from multiple cameras, you can use Non-Maximum Suppression (NMS) to filter out duplicate detections and merge the results. Here's a step-by-step approach:

Run Inference on Each Camera Feed: Capture frames from each camera and run YOLOv9 inference independently.
Collect Predictions: Gather the predictions from each camera.
Combine Predictions: Merge the predictions into a single list.
Apply NMS: Use NMS to filter out duplicate detections and retain the most confident ones.

Example Code

Here's an example of how you can achieve this using Ultralytics' NMS:

import cv2
import torch
from ultralytics import YOLO
from ultralytics.utils.ops import non_max_suppression

# Initialize YOLO model
model = YOLO('yolov9.pt')

# Capture from multiple cameras
cap1 = cv2.VideoCapture(0)
cap2 = cv2.VideoCapture(1)
cap3 = cv2.VideoCapture(2)

while True:
    ret1, frame1 = cap1.read()
    ret2, frame2 = cap2.read()
    ret3 = cap3.read()

    if ret1 and ret2 and ret3:
        # Run inference on each frame
        results1 = model(frame1)
        results2 = model(frame2)
        results3 = model(frame3)

        # Collect predictions
        predictions = torch.cat((results1[0].boxes, results2[0].boxes, results3[0].boxes), dim=0)

        # Apply NMS
        nms_results = non_max_suppression(predictions, conf_thres=0.5, iou_thres=0.5)

        # Display results
        for result in nms_results:
            result.show()

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap1.release()
cap2.release()
cap3.release()
cv2.destroyAllWindows()

Explanation

Capture Frames: Frames are captured from three cameras.
Run Inference: YOLOv9 inference is run on each frame independently.
Collect Predictions: Predictions from all cameras are concatenated into a single tensor.
Apply NMS: Non-Maximum Suppression is applied to filter out duplicate detections and retain the most confident ones.
Display Results: The final results are displayed.

Additional Tips

Confidence Threshold: Adjust the conf_thres parameter in NMS to filter out low-confidence detections.
IoU Threshold: Adjust the iou_thres parameter to control the overlap threshold for merging detections.

For more detailed information on handling common issues, you can refer to our YOLO Common Issues Guide.

I hope this helps! Feel free to reach out if you have any more questions. 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ultralytics

Queries related to Ultralytics YOLOv9 to detect multiple video feed and prediction #14021

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Ultralytics

Queries related to Ultralytics YOLOv9 to detect multiple video feed and prediction #14021

itskhaledmohammad Jun 26, 2024

What and the why:

Questions:

Replies: 1 comment · 2 replies

glenn-jocher Jun 27, 2024 Maintainer

1. Multiple Camera Inputs and Combined Predictions

2. Background Color for Training

3. Number of Images and Augmentation

4. Training with Single Instances

5. Image Resolution

6. Using Depth Data

itskhaledmohammad Jun 27, 2024 Author

glenn-jocher Jun 28, 2024 Maintainer

Merging Results from Multiple Cameras

Example Code

Explanation

Additional Tips

itskhaledmohammad
Jun 26, 2024

Replies: 1 comment 2 replies

glenn-jocher
Jun 27, 2024
Maintainer

itskhaledmohammad Jun 27, 2024
Author

glenn-jocher Jun 28, 2024
Maintainer