Default DataLoader `shuffle=True` for training #5623

werner-duvaud · 2021-11-12T05:56:41Z

If the issue #5622 is relevant, I have added the shuffle argument to create_dataloader. It defaults to False and is set to True only when creating the training dataloader.
This shuffle parameter is then passed to the DataLoader or DistributedSampler to maintain a coherent behaviour. (in pytorch doc, shuffle defaults to False for DataLoader and to True for DistributedSampler)

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhanced data loading with shuffle option for YOLOv5 training.

📊 Key Changes

train.py now passes a shuffle=True argument to the data loader, enabling data shuffling during training.
create_dataloader function in utils/datasets.py now accepts a shuffle argument.
Rectangular training (rect=True) and shuffling are made mutually exclusive to avoid potential conflicts.
Refactored the use of DataLoader and InfiniteDataLoader to allow attribute updates, with a conditional use of DistributedSampler based on rank.
Adjusted imports in utils/datasets.py for clearer code structure.

🎯 Purpose & Impact

Data shuffling is a common technique to improve model generalization and prevent overfitting, which can lead to better performance.
By preventing the incompatible configuration of rectangular batches with shuffling, the PR ensures stable training behavior.
The changes enhance scalability and usability for distributed training scenarios, potentially improving user experiences during training on various hardware setups.
Overall, these modifications might contribute to more robust model performance and smoother user experiences with the YOLOv5 training pipeline.

github-actions

👋 Hello @werner-duvaud, thank you for submitting a 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

✅ Verify your PR is up-to-date with upstream/master. If your PR is behind upstream/master an automatic GitHub actions rebase may be attempted by including the /rebase command in a comment body, or by running the following code, replacing 'feature' with the name of your local branch:

git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git merge upstream/master
git push -u origin -f

✅ Verify all Continuous Integration (CI) checks are passing.
✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

glenn-jocher · 2021-11-12T12:43:01Z

@werner-duvaud one conflict here is --rect, which will force single-image (no mosaic) minimum rectangular batch sizes. This precomputes batch size dimensions based on the dataset sorted by aspect ratio. val.py uses this by default for example, but it's a setting we also have in train.py also.

So we need something like shuffle = shuffle and not rect in place.

werner-duvaud · 2021-11-12T14:21:08Z

@glenn-jocher Thanks!

I have added and not rect.

The downside could be that by passing shuffle = True to create_dataloader one expects random indexes but if rect is True shuffle will be overwritten. And the shuffle effect will come from the sort by aspect ratio which is not 100% shuffled in some cases.
It seems ok to me given the purpose of the rect argument. If necessary I can add a log warning when shuffle and rect are both passed to True to create_dataloader to warn in case for future modifications.

(I have rebased to make the CI happy)

glenn-jocher · 2021-11-12T18:04:25Z

@werner-duvaud yes good idea. Can you all a logger warning for this case?

EDIT: added this update and a bit of cleanup myself in 89abf9e

for more information, see https://pre-commit.ci

glenn-jocher · 2021-11-13T11:28:46Z

@werner-duvaud cleaned this up a bit and added a rect-shuffle conflict warning and handling. Evaluating now on VOC. Also linked to #2582 to close that (long-running) TODO.

glenn-jocher · 2021-11-13T12:07:37Z

@werner-duvaud PR is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

glenn-jocher · 2021-11-13T12:10:46Z

@werner-duvaud I evaluated this PR against master on VOC finetuning for 50 epochs, and the results show a slight improvement in most metrics and losses, particularly in objectness loss and mAP@0.5, perhaps indicating that the shuffle addition may help delay overtraining.

https://wandb.ai/glenn-jocher/VOC

* Fix shuffle DataLoader argument * Add shuffle argument * Disable shuffle when rect * Cleanup, add rect warning * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleanup2 * Cleanup3 Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

github-actions bot reviewed Nov 12, 2021

View reviewed changes

werner-duvaud added 3 commits November 12, 2021 15:02

Fix shuffle DataLoader argument

20fb666

Add shuffle argument

d8fd61d

Disable shuffle when rect

c363fb7

glenn-jocher linked an issue Nov 13, 2021 that may be closed by this pull request

Dataset not shuffled (when not in DDP mode) #5622

Closed

2 tasks

glenn-jocher changed the title ~~Fix shuffle DataLoader argument Fix #5622~~ Default DataLoader shuffle=True for training Nov 13, 2021

glenn-jocher assigned werner-duvaud Nov 13, 2021

glenn-jocher and others added 3 commits November 13, 2021 12:19

Cleanup, add rect warning

89abf9e

[pre-commit.ci] auto fixes from pre-commit.com hooks

83fbe49

for more information, see https://pre-commit.ci

Cleanup2

1e3f48c

glenn-jocher linked an issue Nov 13, 2021 that may be closed by this pull request

shuffle index #2582

Closed

Cleanup3

1873938

glenn-jocher merged commit 09d1703 into ultralytics:master Nov 13, 2021

glenn-jocher mentioned this pull request Nov 13, 2021

[BUG] Data shuffling is off by default outside of mosaic shuffling for training in the dataloader #4961

Closed

glenn-jocher mentioned this pull request Feb 22, 2022

YOLOv5 v6.1 release #6739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default DataLoader `shuffle=True` for training #5623

Default DataLoader `shuffle=True` for training #5623

werner-duvaud commented Nov 12, 2021 •

edited by UltralyticsAssistant

Loading

github-actions bot left a comment

glenn-jocher commented Nov 12, 2021 •

edited

Loading

werner-duvaud commented Nov 12, 2021

glenn-jocher commented Nov 12, 2021 •

edited

Loading

glenn-jocher commented Nov 13, 2021

glenn-jocher commented Nov 13, 2021

glenn-jocher commented Nov 13, 2021

Default DataLoader shuffle=True for training #5623

Default DataLoader shuffle=True for training #5623

Conversation

werner-duvaud commented Nov 12, 2021 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

github-actions bot left a comment

Choose a reason for hiding this comment

glenn-jocher commented Nov 12, 2021 • edited Loading

werner-duvaud commented Nov 12, 2021

glenn-jocher commented Nov 12, 2021 • edited Loading

glenn-jocher commented Nov 13, 2021

glenn-jocher commented Nov 13, 2021

glenn-jocher commented Nov 13, 2021

Default DataLoader `shuffle=True` for training #5623

Default DataLoader `shuffle=True` for training #5623

werner-duvaud commented Nov 12, 2021 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Nov 12, 2021 •

edited

Loading

glenn-jocher commented Nov 12, 2021 •

edited

Loading