Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining Two Datasets in YOLO #754

Closed
1 task done
jaisenbe58r opened this issue Jul 2, 2024 · 8 comments
Closed
1 task done

Combining Two Datasets in YOLO #754

jaisenbe58r opened this issue Jul 2, 2024 · 8 comments
Assignees
Labels
fixed Bug is resolved question A HUB question that does not involve a bug

Comments

@jaisenbe58r
Copy link

Search before asking

Question

Hello Ultralytics Team,

I am currently working on a project using YOLO and I have two separate datasets that I would like to combine for training and validation purposes. I would like to know if it is possible to combine these datasets, and if so, how should I structure my data.yml file to properly reference both datasets?

Example Details:

  • Dataset 1:
    • Images: /path/to/datasets/dataset1/images/
    • Labels: /path/to/datasets/dataset1/labels/
  • Dataset 2:
    • Images: /path/to/datasets/dataset2/images/
    • Labels: /path/to/datasets/dataset2/labels/

Folder Structure:

/dataset/detect/
├── dataset1/
│   ├── images/
│   │   ├── train/
│   │   ├── val/
│   └── labels/
│       ├── train/
│       ├── val/
├── dataset2/
│   ├── images/
│   │   ├── train/
│   │   ├── val/
│   └── labels/
│       ├── train/
│       ├── val/

Proposed data.yml Configuration:

# File: data.yml

train: 
  - dataset1/images/train
  - dataset2/images/train
val: 
  - dataset1/images/val
  - dataset2/images/val

# Number of classes
nc: <number_of_classes>

# Class names
names: ['class1', 'class2', 'class3', ...]

Could you please confirm if this setup is correct and whether it is possible to combine datasets in this manner? If there are any additional steps or considerations I need to take into account, please let me know.

Thank you for your assistance!

Best regards,
Jaime

Additional

No response

@jaisenbe58r jaisenbe58r added the question A HUB question that does not involve a bug label Jul 2, 2024
Copy link

github-actions bot commented Jul 2, 2024

👋 Hello @jaisenbe58r, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

@sergiuwaxmann
Copy link
Member

@jaisenbe58r Hello!
Unfortunately, the structure you suggested won't work... You need to create a new dataset using the structure we highlighted in our documentation.

@sergiuwaxmann sergiuwaxmann self-assigned this Jul 2, 2024
@sergiuwaxmann sergiuwaxmann added the fixed Bug is resolved label Jul 2, 2024
@jaisenbe58r
Copy link
Author

Hello @sergiuwaxman!,

That's unfortunate to hear. Thank you for your clarification and assistance.

Best regards,
Jaime.

@pderrenger
Copy link
Member

Hello @jaisenbe58r,

Thank you for your understanding. To combine your datasets effectively, you can create a new dataset directory that includes the images and labels from both datasets. Here's a step-by-step guide to help you structure your combined dataset:

  1. Create a New Directory: Create a new directory, say combined_dataset/, and within it, create subdirectories for images/train, images/val, labels/train, and labels/val.

  2. Merge Images and Labels: Copy the images and labels from both datasets into the respective subdirectories of combined_dataset/.

  3. Update the YAML File: Create a new YAML file for the combined dataset. Here's an example configuration:

# File: combined_dataset.yaml

path: ../datasets/combined_dataset  # root directory of the combined dataset
train: images/train  # train images (relative to 'path')
val: images/val  # val images (relative to 'path')

# Number of classes
nc: <number_of_classes>

# Class names
names: ['class1', 'class2', 'class3', ...]
  1. Validate the Dataset: Before uploading to Ultralytics HUB, validate your dataset to ensure it's correctly formatted:
from ultralytics.hub import check_dataset

check_dataset("path/to/combined_dataset.zip", task="detect")
  1. Upload to Ultralytics HUB: Zip the combined_dataset/ directory and upload it to Ultralytics HUB following the instructions in our documentation.

By following these steps, you can seamlessly combine your datasets and utilize them for training and validation.

If you encounter any issues or need further assistance, feel free to reach out. We're here to help!

@jaisenbe58r
Copy link
Author

Hello @pderrenger,

Thank you for the detailed explanation and the step-by-step guide on combining the datasets. It is very helpful.

I have a follow-up question: Is there no way to structure the data.yml file by adding multiple paths for the images, something like this?

# File: combined_dataset.yaml

path: ../datasets/combined_dataset  # root directory of the combined dataset
train: 
  - dataset1/images/train
  - dataset2/images/train
val: 
  - dataset1/images/val
  - dataset2/images/val

# Number of classes
nc: <number_of_classes>

# Class names
names: ['class1', 'class2', 'class3', ...]

Would this approach not be possible or supported by the Ultralytics framework? It would be more convenient to maintain the original datasets separately.

Thank you again for your assistance!

Best regards,
Jaime.

@pderrenger
Copy link
Member

Hello @jaisenbe58r,

Thank you for your kind words and for the follow-up question!

Currently, the Ultralytics framework does not support specifying multiple paths for the train and val directories directly within the data.yaml file. The framework expects a single path for each of these entries. Therefore, the approach you suggested would not work as intended.

To maintain the original datasets separately while still combining them for training, you can use symbolic links (symlinks) to create a unified directory structure without duplicating the data. Here's how you can do it:

  1. Create a New Directory: Create a new directory, say combined_dataset/, and within it, create subdirectories for images/train, images/val, labels/train, and labels/val.

  2. Create Symlinks: Use symlinks to link the images and labels from both datasets into the respective subdirectories of combined_dataset/. Here’s an example of how to create symlinks in a Unix-based system:

# Create symlinks for training images
ln -s /path/to/datasets/dataset1/images/train/* /path/to/combined_dataset/images/train/
ln -s /path/to/datasets/dataset2/images/train/* /path/to/combined_dataset/images/train/

# Create symlinks for validation images
ln -s /path/to/datasets/dataset1/images/val/* /path/to/combined_dataset/images/val/
ln -s /path/to/datasets/dataset2/images/val/* /path/to/combined_dataset/images/val/

# Create symlinks for training labels
ln -s /path/to/datasets/dataset1/labels/train/* /path/to/combined_dataset/labels/train/
ln -s /path/to/datasets/dataset2/labels/train/* /path/to/combined_dataset/labels/train/

# Create symlinks for validation labels
ln -s /path/to/datasets/dataset1/labels/val/* /path/to/combined_dataset/labels/val/
ln -s /path/to/datasets/dataset2/labels/val/* /path/to/combined_dataset/labels/val/
  1. Update the YAML File: Use the same YAML configuration as before, pointing to the combined_dataset/ directory:
# File: combined_dataset.yaml

path: ../datasets/combined_dataset  # root directory of the combined dataset
train: images/train  # train images (relative to 'path')
val: images/val  # val images (relative to 'path')

# Number of classes
nc: <number_of_classes>

# Class names
names: ['class1', 'class2', 'class3', ...]

This way, you can maintain the original datasets separately while creating a unified structure for training purposes.

If you have any further questions or need additional assistance, feel free to ask. We're here to help!

@jaisenbe58r
Copy link
Author

Hello @pderrenger,

Thank you for your detailed response and for clarifying the limitations of the current Ultralytics framework.

I appreciate your suggestion to use symbolic links to create a unified directory structure without duplicating data. This solution effectively addresses my concern about maintaining the original datasets separately while combining them for training.

I will proceed with creating the symlinks and updating the YAML file as you recommended.

Thank you once again for your assistance and support!

Best regards,
Jaime.

@pderrenger
Copy link
Member

Hello @jaisenbe58r,

You're very welcome! I'm glad to hear that the suggestion to use symbolic links was helpful and addresses your concern about maintaining the original datasets separately.

If you encounter any issues while setting up the symlinks or have any further questions as you proceed, please don't hesitate to reach out. We're here to support you every step of the way.

Additionally, if you haven't already, please ensure that you're using the latest versions of the Ultralytics packages to benefit from the latest features and bug fixes. This can help avoid any potential issues that may have already been resolved in recent updates.

For any future issues or questions, providing a minimum reproducible example can greatly assist us in diagnosing and resolving your concerns more efficiently. You can find more information on creating such examples here.

Thank you for your engagement and for being a part of the YOLO community. Happy training!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Bug is resolved question A HUB question that does not involve a bug
Projects
None yet
Development

No branches or pull requests

3 participants