Queries related to Ultralytics YOLOv9 to detect multiple video feed and prediction #14021
Replies: 1 comment 2 replies
-
@itskhaledmohammad hi there, Thank you for reaching out with your detailed questions! Let's address each of your queries to help you achieve your objectives with YOLOv9. 1. Multiple Camera Inputs and Combined PredictionsTo handle multiple camera inputs and combine predictions, you can follow these steps:
Here's a basic example in Python: import cv2
from ultralytics import YOLO
# Initialize YOLO model
model = YOLO('yolov9.pt')
# Capture from multiple cameras
cap1 = cv2.VideoCapture(0)
cap2 = cv2.VideoCapture(1)
cap3 = cv2.VideoCapture(2)
while True:
ret1, frame1 = cap1.read()
ret2, frame2 = cap2.read()
ret3, frame3 = cap3.read()
if ret1 and ret2 and ret3:
# Run inference on each frame
results1 = model(frame1)
results2 = model(frame2)
results3 = model(frame3)
# Combine results (implement your merging logic here)
combined_results = merge_results([results1, results2, results3])
# Display results
for result in combined_results:
result.show()
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap1.release()
cap2.release()
cap3.release()
cv2.destroyAllWindows() You will need to implement the 2. Background Color for TrainingA consistent background can indeed help with training. Black is a good choice, but other solid colors like white can also work well. The key is consistency. Stripes or patterns might introduce unnecessary complexity. 3. Number of Images and AugmentationWhile 1500 images per class is ideal, you can start with fewer images and use augmentation to increase your dataset size. Techniques like rotation, scaling, flipping, and color jittering can help. Here's an example using Albumentations: import albumentations as A
from albumentations.pytorch import ToTensorV2
transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(p=0.2),
A.Rotate(limit=15, p=0.5),
A.Resize(640, 640),
ToTensorV2()
]) 4. Training with Single InstancesTraining with single instances is fine. Ensure your dataset includes images with multiple instances as well to help the model generalize. You can manually create such images or use augmentation to simulate them. 5. Image ResolutionTraining on 1920x1080 images is feasible but may require more computational resources. You can downscale images to a resolution like 640x640, which is commonly used in YOLO models, to speed up training and reduce memory usage. 6. Using Depth DataUsing depth data from OAK-D cameras can improve detection, especially for occlusions and overlapping objects. You can incorporate depth information as an additional input to your model or use it to filter and refine detections post-inference. For further details and examples, you can refer to our FAQ. I hope this helps! Feel free to ask if you have more questions. 😊 |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
So I need your help on this.
What and the why:
Want I am trying to achieve is, that I will take video feed from 3 inputs, and the YOLO will detect what are the possible items on the tray, so one item might be visible on one camera and might be not visible or partially visible on another, and also YOLO might not be confident about an object in one camera but the same object YOLO might detect with high confidence from the other camera. For example a beverage can might not be detectable from the top (as you can only see the opening part and most beverage have similar looking openings on top), but be detectable from another angle where the body can be seen). Hence the multiple cameras.
Questions:
How to take input from multiple camera and get it to predict the objects combining all 3 results? what should be the procedure I should go with?
As you see in the picture, I tried to keep the background a constant black, hoping that will help with training and prediction, Is there any particular color other than black that might help like white or stripes?
As I have a fixed background, so how many pictures of an object would I need? I have seen in the documentation it says > 1500 images of each items, but given my scenerio do I need that much, taking 1500 images of each food item would be really hard each time, if get like 36-40 pictures of each items (and each item will be class) and then augment it, would that work and if yes but what type of augmentation do you recommend?
Also when training on a new item, I would like to take a single item and take multiple images of that, so would be a single instance of that item in all the images, but they will appear together in real case scenario, lets say two instances of pack of chips and beverages, would that be a problem? And is there any particular way to train for that?
As the cameras are 1080p, the images I am training on is 1920x1080, is that a problem? Or I will have to downscale them, if so what resolution?
I am using OAK-D cameras, these are stereo cameras, can I use the depth data or any other data to give better results for the detection or help in my case?
I did go through the ultralytics documentation so these questions came up. Thanks in Advance, hoping to get guidelines from you guys.
Beta Was this translation helpful? Give feedback.
All reactions