Update DDP for `torch.distributed.run` with `gloo` backend #3680

glenn-jocher · 2021-06-18T10:13:57Z

@NanoCode012 I'm experimenting with a few DDP updates here.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhanced logging and introduced distributed training modifications.

📊 Key Changes

Introduced colorstr function to colorize the printed options in detect.py, export.py, and test.py.
Removed set_logging calls from conditionals to allow logging across different modes.
Added LOCAL_RANK, RANK, and WORLD_SIZE global variables for distributed data parallel (DDP) support in train.py.
Refactored train.py to use DDP-friendly WORLD_SIZE, RANK, and LOCAL_RANK variables.
Modified create_dataloader in utils/datasets.py to support DDP by removing the world_size argument.
Added dist.barrier() calls in utils/torch_utils.py to synchronize DDP processes.
Updated the WandbLogger to handle DDP rank conditions using the environment variable RANK.

🎯 Purpose & Impact

📈 Improved Logging: Colorized output enhances readability, helping users track the progress and settings easily.
💻 DDP Support: Changes create a solid foundation for distributed training, allowing efficient scaling across multiple GPUs and improving training times.
🔄 Easier Collaboration: These updates could facilitate collaborative development and training on different systems or clusters.

NanoCode012 · 2021-06-18T12:43:09Z

Hi glenn, a quick run from this branch on the Docker image (commit 382ce4f) gives me the below. It somewhat trains but stopped at the end.

Output

[INFO] 2021-06-18 12:16:14,143 run: Running torch.distributed.run with args: ['/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py', '--master_port', '9980', '--nproc_per_node', '2', 'train.py', '--weights', 'yolov5s.pt', '--epochs', '3', '--img', '320', '--device', '0,1']
[INFO] 2021-06-18 12:16:14,149 run: Using nproc_per_node=2.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[INFO] 2021-06-18 12:16:14,150 api: Starting elastic_operator with launch configs:
  entrypoint       : train.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 2
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:9980
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

[INFO] 2021-06-18 12:16:14,153 local_elastic_agent: log directory set to: /tmp/torchelastic_nxmozyoe/none_tmcpzgfk
[INFO] 2021-06-18 12:16:14,153 api: [default] starting workers for entrypoint: python
[INFO] 2021-06-18 12:16:14,153 api: [default] Rendezvous'ing worker group
[INFO] 2021-06-18 12:16:14,153 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
[INFO] 2021-06-18 12:16:14,159 api: [default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=9980
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]

[INFO] 2021-06-18 12:16:14,160 api: [default] Starting worker group
[INFO] 2021-06-18 12:16:14,160 __init__: Setting worker0 reply file to: /tmp/torchelastic_nxmozyoe/none_tmcpzgfk/attempt_0/0/error.json
[INFO] 2021-06-18 12:16:14,161 __init__: Setting worker1 reply file to: /tmp/torchelastic_nxmozyoe/none_tmcpzgfk/attempt_0/1/error.json
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
{'RANK': 1, 'LOCAL_RANK': 1, 'WORLD_SIZE': 2}
YOLOv5 🚀 v5.0-223-g382ce4f torch 1.9.0+cu102 CUDA:0 (Tesla V100-SXM2-32GB, 32510.5MB)
                                              CUDA:1 (Tesla V100-SXM2-32GB, 32510.5MB)

{'RANK': 0, 'LOCAL_RANK': 0, 'WORLD_SIZE': 2}
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 2 nodes.
Namespace(adam=False, artifact_alias='latest', batch_size=8, bbox_interval=-1, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='0,1', entity=None, epochs=3, evolve=False, exist_ok=False, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[320, 320], label_smoothing=0.0, linear_lr=False, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp10', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=16, upload_dataset=False, weights='yolov5s.pt', workers=8)
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
wandb: Install Weights & Biases for YOLOv5 logging with 'pip install wandb' (recommended)

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    156928  models.common.C3                        [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Model Summary: 283 layers, 7276605 parameters, 7276605 gradients, 17.1 GFLOPs

Transferred 362/362 items from yolov5s.pt
Scaled weight_decay = 0.0005
Optimizer groups: 62 .bias, 62 conv.weight, 59 other
train: Scanning '../coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100%|███████████████| 128/128 [00:00<00:00, 1334.50it/s]
train: New cache created: ../coco128/labels/train2017.cache
train: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:00<?, ?it/s][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Plotting labels... 
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]

autoanchor: Analyzing anchors... anchors/target = 3.97, Best Possible Recall (BPR) = 0.9580. Attempting to improve anchors, please wait...
autoanchor: WARNING: Extremely small objects found. 35 of 929 labels are < 3 pixels in size.
autoanchor: Running kmeans for 9 anchors on 927 points...
autoanchor: thr=0.25: 0.9623 best possible recall, 3.58 anchors past thr
autoanchor: n=9, img_size=320, metric_all=0.252/0.634-mean/best, past_thr=0.474-mean: 11,12,  31,34,  74,42,  46,87,  132,90,  78,161,  192,151,  173,273,  305,189
autoanchor: Evolving anchors with Genetic Algorithm: fitness = 0.6718: 100%|█████████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 938.65it/s]
autoanchor: thr=0.25: 0.9925 best possible recall, 3.71 anchors past thr
autoanchor: n=9, img_size=320, metric_all=0.261/0.672-mean/best, past_thr=0.478-mean: 6,6,  10,13,  25,28,  64,45,  43,87,  64,130,  139,118,  184,194,  313,214
autoanchor: New anchors saved to model. Update model *.yaml to use these anchors in the future.

Image sizes 320 train, 320 test
Using 4 dataloader workers
Logging results to runs/train/exp10
Starting training for 3 epochs...

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       0/2    0.644G   0.08607   0.03836   0.03987    0.1643       111       320:  12%|██████▏                                          | 1/8 [00:08<01:01,  8.78s/it]Reducer buckets have been rebuilt in this iteration.
       0/2     1.59G   0.07716   0.04663   0.03862    0.1624        86       320: 100%|█████████████████████████████████████████████████| 8/8 [00:11<00:00,  1.41s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████████████████████████████████| 8/8 [00:03<00:00,  2.04it/s]
                 all        128        929      0.387      0.485      0.404      0.225

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       1/2     1.59G   0.06844   0.04651   0.03561    0.1506        80       320: 100%|█████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.29it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████████████████████████████████| 8/8 [00:02<00:00,  3.32it/s]
                 all        128        929      0.453      0.488       0.46      0.275

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       2/2     1.59G   0.06269   0.05292   0.03293    0.1485       150       320: 100%|█████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.02it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████████████████████████████████| 8/8 [00:04<00:00,  1.80it/s]
                 all        128        929       0.51       0.49      0.493        0.3
3 epochs completed in 0.012 hours.

Optimizer stripped from runs/train/exp10/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp10/weights/best.pt, 14.8MB

I'll try to check your changes. Right now, I notice a few things.

my training stuck on the last line with Optimizer stripped.... Perhaps a worker did not end or an update with wandb that I haven't been following recently

Output when Cntrl+C

Optimizer stripped from runs/train/exp10/weights/best.pt, 14.8MB
^CTraceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 637, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 629, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 621, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 238, in launch_agent
    result = agent.run()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 700, in run
    result = self._invoke_run(role)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 828, in _invoke_run
    time.sleep(monitor_interval)
KeyboardInterrupt

There are a lot of extra logs with the new run module. I wonder if we can hide them.
Warning: Leaking Caffe2 thread-pool after fork. <- This warning while loading dataset

Unfortunately, since this is quite new, there aren't any examples we can follow.

NanoCode012 · 2021-06-18T12:45:51Z

train.py

+        assert torch.cuda.device_count() > LOCAL_RANK, 'too few GPUS for DDP command'
+        torch.cuda.set_device(LOCAL_RANK)
+        device = torch.device('cuda', LOCAL_RANK)
+        dist.init_process_group(backend="gloo")  # distributed backend


nccl should be the faster backend for ddp. I recall that Windows only support gloo however.

glenn-jocher · 2021-06-18T14:32:12Z

@NanoCode012 I see the same hanging at the end of training. This warning seems to imply that the hanging may be related to dist.barrier() we use in torch_distributed_zero_first()

[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

yolov5/utils/torch_utils.py

Lines 27 to 37 in 2729761

    
           @contextmanager 
        
           def torch_distributed_zero_first(local_rank: int): 
        
               """ 
        
               Decorator to make all processes in distributed training wait for each local_master to do something. 
        
               """ 
        
               if local_rank not in [-1, 0]: 
        
                   torch.distributed.barrier() 
        
               yield 
        
               if local_rank == 0: 
        
                   torch.distributed.barrier()

glenn-jocher · 2021-06-18T14:37:22Z

@NanoCode012 agreed there are a lot of warnings/messages everywhere.

Part of this newly appeared in torch 1.9:

/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)

And this part maybe either be due to 1.9 or the Docker image.

[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)

This is just my own debug code to remove before merge.

{'RANK': 0, 'LOCAL_RANK': 0, 'WORLD_SIZE': 2}

The rest of the extra stuff is due to torch.distributed.run

NanoCode012 · 2021-06-18T14:48:00Z

@NanoCode012 I see the same hanging at the end of training. This warning seems to imply that the hanging may be related to dist.barrier() we use in torch_distributed_zero_first()
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

I think this warning is because of this comment in the run module.
https://github.com/pytorch/pytorch/blob/d5988c5eca0221e9ef58918e4f0b504940cb926a/torch/distributed/run.py#L210-L212

However, I notice when I re-ran the latest code from this branch, I also get this error but not in my earlier run. Perhaps it's something you changed from commit 382ce4f ?

Edit: I think the cause of this error is when you changed to nccl 8ae9ea1 . Checkout to the commit before does not give me that warning. This is quite confusing..

glenn-jocher · 2021-06-19T14:10:58Z

@NanoCode012 ok I found a fix. torch.destroy_process_group() needs to be outside of the main train() function, and now it works, with both nccl and gloo, using both torch.distributed.launch and torch.distributed.run.

EDIT: I can't profile this right now as there's no EC2 spot availability but I will try to profile this week and we can revert to nccl if need be, or even introduce an if statement with if torch.distributed.is_nccl_available().

lleye · 2021-07-20T02:23:19Z

Hi @glenn-jocher @NanoCode012 i'm having the same issues here.

[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)

[E ProcessGroupNCCL.cpp:566] [Rank #] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 60960 milliseconds before timing out.

RuntimeError: replicas[0][0] in this process with sizes [48, 12, 3, 3] appears not to match sizes of the same param in process 0.

[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'

training on multiple a100s using ddp in docker

glenn-jocher · 2021-07-20T08:46:02Z

@glenn-jocher this is a merged PR. If you have an issue that meets the criteria below I recommend you open a new issue with code to reproduce.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible that still produces the same problem
✅ Complete – Provide all parts someone else needs to reproduce your problem in the question itself
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

✅ Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
✅ Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

@kalenmike

* ConfusionMatrix `normalize=True` fix (ultralytics#3587) * train.py GPU memory fix (ultralytics#3590) * train.py GPU memory fix * ema * cuda * cuda * zeros input * to device * batch index 0 * W&B: Allow changed in config variable ultralytics#3588 * Update `dataset_stats()` (ultralytics#3593) @kalenmike this is a PR to add image filenames and labels to our stats dictionary and to save the dictionary to JSON. Save location is next to the train labels.cache file. The single JSON contains all stats for entire dataset. Usage example: ```python from utils.datasets import * dataset_stats('coco128.yaml', verbose=True) ``` * Delete __init__.py (ultralytics#3596) * Simplify README.md (ultralytics#3530) * Update README.md * added hosted images * added new logo * testing image hosting * changed svgs to pngs * removed old header * Update README.md * correct colab image source * splash.jpg * rocket and W&B fix * added contributing template * added social media to top section * increased size of top social media * cleanup and updates * rearrange quickstarts * API cleanup * PyTorch Hub cleanup * Add tutorials * cleanup * update CONTRIBUTING.md * Update README.md * update wandb link * Update README.md * remove tutorials header * update environments and integrations * Comment API image * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * double spaces after section * Update README.md * Update README.md Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update datasets.py (ultralytics#3591) * 'changes-in_dataset' * Update datasets.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Download COCO and VOC by default (ultralytics#3608) * Suppress wandb images size mismatch warning (ultralytics#3611) * supress wandb images size mismatch warning * supress wandb images size mismatch warning * PEP8 reformat and optimize imports Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Fix incorrect end epoch comment (ultralytics#3612) * Update `check_file()` (ultralytics#3622) * Update `check_file()` * Update datasets.py * Update README.md (ultralytics#3624) * FROM nvcr.io/nvidia/pytorch:21.05-py3 (ultralytics#3633) * Add `**/*.torchscript.pt` (ultralytics#3634) * Update `verify_image_label()` (ultralytics#3635) * RUN pip install --no-cache -U torch torchvision (ultralytics#3637) * Assert non-premature end of JPEG images (ultralytics#3638) * premature end of JPEG images * PEP8 reformat Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update CONTRIBUTING.md (ultralytics#3645) * Update CONTRIBUTING.md * Update CONTRIBUTING.md * Update CONTRIBUTING.md * Update CONTRIBUTING.md * Update CONTRIBUTING.md (ultralytics#3647) * `is_coco` list fix (ultralytics#3646) * Update README.md (ultralytics#3650) Be more user-friendly to new users * Update `dataset_stats()` to list of dicts (ultralytics#3657) * Update `dataset_stats()` to list of dicts @kalenmike * Update datasets.py * Remove `/weights` directory (ultralytics#3659) * Remove `/weights` directory * cleanup * Update download_weights.sh comment (ultralytics#3662) * Update train.py (ultralytics#3667) * Update `train(hyp, *args)` to accept `hyp` file or dict (ultralytics#3668) * Update TensorBoard (ultralytics#3669) * Update `WORLD_SIZE` and `RANK` retrieval (ultralytics#3670) * Cache v0.3: improved corrupt image/label reporting (ultralytics#3676) * Cache v0.3: improved corrupt image/label reporting Fix for ultralytics#3656 (comment) * cleanup * EMA changes for pre-model's batch_size (ultralytics#3681) * EMA changes for pre-model's batch_size * Update train.py * Update torch_utils.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update README.md (ultralytics#3684) * Update cache check (ultralytics#3691) Swapped order of operations for faster first per ultralytics@f527704#r52362419 * Skip HSV augmentation when hyperparameters are [0, 0, 0] (ultralytics#3686) * Create shortcircuit in augment_hsv when hyperparameter are zero * implement faster opt-in Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Slightly modify CLI execution (ultralytics#3687) * Slightly modify CLI execution This simple change makes it easier to run the primary functions of this repo (train/detect/test) from within Python. An object which represents `opt` can be constructed and fed to the `main` function of each of these modules, rather than having to call the lower level functions directly, or run the module as a script. * Update export.py Add CLI parsing update for more convenient module usage within Python. Co-authored-by: Lewis Belcher <lb@desupervised.io> * Reformat (ultralytics#3694) * Update DDP for `torch.distributed.run` with `gloo` backend (ultralytics#3680) * Update DDP for `torch.distributed.run` * Add LOCAL_RANK * remove opt.local_rank * backend="gloo|nccl" * print * print * debug * debug * os.getenv * gloo * gloo * gloo * cleanup * fix getenv * cleanup * cleanup destroy * try nccl * return opt * add --local_rank * add timeout * add init_method * gloo * move destroy * move destroy * move print(opt) under if RANK * destroy only RANK 0 * move destroy inside train() * restore destroy outside train() * update print(opt) * cleanup * nccl * gloo with 60 second timeout * update namespace printing * Eliminate `total_batch_size` variable (ultralytics#3697) * Eliminate `total_batch_size` variable * cleanup * Update train.py * Add torch DP warning (ultralytics#3698) * Add `train.run()` method (ultralytics#3700) * Update train.py explicit arguments * Update train.py * Add run method * Update DDP backend `if dist.is_nccl_available()` (ultralytics#3705) * [x]W&B: Don't resume transfer learning runs (ultralytics#3604) * Allow config cahnge * Allow val change in wandb config * Don't resume transfer learning runs * Add entity in log dataset * Update 4 main ops for paths and .run() (ultralytics#3715) * Add yolov5/ to path * rename functions to run() * cleanup * rename fix * CI fix * cleanup find models/export.py * Fix `img2label_paths()` order (ultralytics#3720) * Fix `img2label_paths()` order * fix, 1 * Fix typo (ultralytics#3729) * Backwards compatible cache version checks (ultralytics#3730) * Update readme. * Update `check_datasets()` for dynamic unzip path (ultralytics#3732) @kalenmike * Create `data/hyps` directory (ultralytics#3747) * Force non-zero hyp evolution weights `w` (ultralytics#3748) Fix for ultralytics#3741 * Edit comment (ultralytics#3759) edit comment * Add optional dataset.yaml `path` attribute (ultralytics#3753) * Add optional dataset.yaml `path` attribute @kalenmike * pass locals to python scripts * handle lists * update coco128.yaml * Capitalize first letter * add test key * finalize GlobalWheat2020.yaml * finalize objects365.yaml * finalize SKU-110K.yaml * finalize SKU-110K.yaml * finalize VisDrone.yaml * NoneType fix * update download comment * voc to VOC * update * update VOC.yaml * update VOC.yaml * remove dashes * delete get_voc.sh * force coco and coco128 to ../datasets * Capitalize Argoverse_HD.yaml * Capitalize Objects365.yaml * update Argoverse_HD.yaml * coco segments fix * VOC single-thread * update Argoverse_HD.yaml * update data_dict in test handling * create root * COCO annotations JSON fix (ultralytics#3764) * Add `xyxy2xywhn()` (ultralytics#3765) * Edit Comments for numpy2torch tensor process Edit Comments for numpy2torch tensor process * add xyxy2xywhn add xyxy2xywhn * add xyxy2xywhn * formatting * pass arguments pass arguments * edit comment for xyxy2xywhn() edit comment for xyxy2xywhn() * cleanup datasets.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Remove DDP MultiHeadAttention fix (ultralytics#3768) * fix/incorrect_fitness_import (ultralytics#3770) * W&B: Update Tables API and comply with new dataset_check (ultralytics#3772) * Update tables API and windows path fix * update dataset check * NGA xView 2018 Dataset Auto-Download (ultralytics#3775) * update clip_coords for numpy * uncomment * cleanup * Add autosplits * fix * cleanup * Update README.md fix banner width (ultralytics#3785) * Objectness IoU Sort (ultralytics#3610) Co-authored-by: U-LAPTOP-5N89P8V7\banhu <ban.huang@foxmail.com> * Update objectness IoU sort (ultralytics#3786) * Create hyp.scratch-p6.yaml (ultralytics#3787) * Fix datasets for aws and get_coco.sh (ultralytics#3788) * merge master * Update get_coco.sh * Update seeds for single-GPU reproducibility (ultralytics#3789) For seed=0 on single-GPU. * Update Usage examples (ultralytics#3790) * nvcr.io/nvidia/pytorch:21.06-py3 (ultralytics#3791) * Update Dockerfile (ultralytics#3792) * FROM nvcr.io/nvidia/pytorch:21.05-py3 (ultralytics#3794) * Fix competition link (ultralytics#3799) * link to the competition repaired * Update README.md Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Fix warmup `accumulate` (ultralytics#3722) * gradient accumulation during warmup in train.py Context: `accumulate` is the number of batches/gradients accumulated before calling the next optimizer.step(). During warmup, it is ramped up from 1 to the final value nbs / batch_size. Although I have not seen this in other libraries, I like the idea. During warmup, as grads are large, too large steps are more of on issue than gradient noise due to small steps. The bug: The condition to perform the opt step is wrong > if ni % accumulate == 0: This produces irregular step sizes if `accumulate` is not constant. It becomes relevant when batch_size is small and `accumulate` changes many times during warmup. This demo also shows the proposed solution, to use a ">=" condition instead: https://colab.research.google.com/drive/1MA2z2eCXYB_BC5UZqgXueqL_y1Tz_XVq?usp=sharing Further, I propose not to restrict the number of warmup iterations to >= 1000. If the user changes hyp['warmup_epochs'], this causes unexpected behavior. Also, it makes evolution unstable if this parameter was to be optimized. * replace last_opt_step tracking by do_step(ni) * add docstrings * move down nw * Update train.py * revert math import move Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Add feature map visualization (ultralytics#3804) * Add feature map visualization Add a feature_visualization function to visualize the mid feature map of the model. * Update yolo.py * remove boolean from forward and reorder if statement * remove print from forward * General cleanup * Indent * Update plots.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update `feature_visualization()` (ultralytics#3807) * Update `feature_visualization()` Only plot for data with height, width > 1 * cleanup * Cleanup * Fix for `dataset_stats()` with updated data.yaml (ultralytics#3819) @kalenmike * Move IoU functions to metrics.py (ultralytics#3820) * Concise `TransformerBlock()` (ultralytics#3821) * Update setup.py to use utf8 everywhere. * Update setup.py to use utf8 everywhere again. * Fix `LoadStreams()` dataloader frame skip issue (ultralytics#3833) * Update datasets.py to read every 4th frame of streams * Update datasets.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Plot `AutoShape()` detections in ascending order (ultralytics#3843) * Copy-Paste augmentation for YOLOv5 (ultralytics#3845) * Copy-paste augmentation initial commit * if any segments * Add obscuration rejection * Add copy_paste hyperparameter * Update comments * Created using Colaboratory * Created using Colaboratory * Add EXIF rotation to YOLOv5 Hub inference (ultralytics#3852) * rotating an image according to its exif tag * Update common.py * Update datasets.py * Update datasets.py faster * delete extraneous gpg file * Update common.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * `--evolve 300` generations CLI argument (ultralytics#3863) * evolve command accepts argument for number of generations * evolve generations argument used in evolve for loop * evolve argument boolean fixes * default to 300 evolve generations * Update train.py Co-authored-by: John San Soucie <jsansoucie@whoi.edu> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Add multi-stream saving feature (ultralytics#3864) * Added the recording feature for multiple streams Thanks for the very cool repo!! I was trying to record multiple feeds at the same time, but the current version of the detector only had one video writer and one vid_path! So the streams were not being saved and only were initialized with one frame and this process didn't record the whole thing. Fix: I made a list of `vid_writer` and `vid_path` and the `i` from the loop over the `pred` took care of the writer which need to work! I hope this helps, Thanks! * Cleanup list lengths * batch size variable * Update datasets.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Created using Colaboratory * Models `*.yaml` reformat (ultralytics#3875) * Create `utils/augmentations.py` (ultralytics#3877) * Create `utils/augmentations.py` * cleanup * Improved BGR2RGB speeds (ultralytics#3880) * Update BGR2RGB ops * speed improvements * cleanup * Evolution commented `hyp['anchors']` fix (ultralytics#3887) Fix for `KeyError: 'anchors'` error when start hyperparameter evolution: ```bash python train.py --evolve ``` ```bash Traceback (most recent call last): File "E:\yolov5\train.py", line 623, in <module> hyp[k] = max(hyp[k], v[1]) # lower limit KeyError: 'anchors' ``` * Hub models `map_location=device` (ultralytics#3894) * Hub models `map_location=device` * cleanup * YOLOv5 + Albumentations integration (ultralytics#3882) * Albumentations integration * ToGray p=0.01 * print confirmation * create instance in dataloader init method * improved version handling * transform not defined fix * assert string update * create check_version() * add spaces * update class comment * Save PyTorch Hub models to `/root/hub/cache/dir` (ultralytics#3904) * Create hubconf.py * Add save_dir variable Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Feature visualization update (ultralytics#3920) * Feature visualization update * Save to jpg (faster) * Save to png * Fix `torch.hub.list('ultralytics/yolov5')` pathlib bug (ultralytics#3921) * Update `setattr()` default for Hub PIL images (ultralytics#3923) Fix inference from PIL source. * `feature_visualization()` CUDA fix (ultralytics#3925) * Update `dataset_stats()` for zipped datasets (ultralytics#3926) * Update `dataset_stats()` for zipped datasets @kalenmike * cleanup * Fix inconsistent NMS IoU value for COCO (ultralytics#3934) Evaluation of 'best' and 'last' models will use the same params as the evaluation during the training phase. This PR fixes ultralytics#3907 * Created using Colaboratory * Feature visualization improvements 32 (ultralytics#3947) * Update augmentations.py (ultralytics#3948) * Cache v0.4 update (ultralytics#3954) * Numerical stability fix for Albumentations (ultralytics#3958) * Update `albumentations>=1.0.2` (ultralytics#3966) * Update `np.random.random()` to `random.random()` (ultralytics#3967) * Update requirements.txt `albumentations>=1.0.2` (ultralytics#3972) * `Ensemble()` visualize fix (ultralytics#3973) * fix visualize error * Revert "fix visualize error" * add visualise profile * Created using Colaboratory * Update `probability` to `p` (ultralytics#3980) * Alert (no detections) (ultralytics#3984) * `Detections()` class `print()` overload * Update common.py * Update README.md (ultralytics#3996) * Rename `test.py` to `val.py` (ultralytics#4000) * W&B sweeps support (ultralytics#3938) * Add support for W&B Sweeps * Update and reformat * Update search space * reformat * reformat sweep.py * Update sweep.py * Move sweeps files to wandb dir * Remove print Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update greetings.yml (ultralytics#4024) * Update greetings.yml * Update greetings.yml * Add `--sync-bn` known issue (ultralytics#4032) * Add `--sync-bn` known issue * Update train.py * Update greetings.yml (ultralytics#4037) * Update README.md (ultralytics#4041) * Update README.md * Update README.md * Update README.md * AutoShape PosixPath support (ultralytics#4047) * AutoShape PosixPath support Usage example: ````python from pathlib import Path model = ... file = Path('data/images/zidane.jpg') results = model(file) ``` * Update common.py * `val.py` refactor (ultralytics#4053) * val.py refactor * cleanup * cleanup * cleanup * cleanup * save after eval * opt.imgsz bug fix * wandb refactor * dataloader to train_loader * capitalize global variables * runs/hub/exp to runs/detect/exp * refactor wandb logging * Refactor wandb operations (ultralytics#4061) Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com> * Module `super().__init__()` (ultralytics#4065) * Module `super().__init__()` * remove NMS * Missing `nc` and `names` handling in check_dataset() (ultralytics#4066) * Created using Colaboratory * Albumentations >= 1.0.3 (ultralytics#4068) * W&B: fix refactor bugs (ultralytics#4069) * Refactor `export.py` (ultralytics#4080) * Refactor `export.py` * cleanup * Update check_requirements() * Update export.py * Addition refactor `export.py` (ultralytics#4089) * Addition refactor `export.py` * Update export.py * Add train.py ``--img-size` floor (ultralytics#4099) * Update resume.py (ultralytics#4115) * Fix indentation in `log_training_progress()` (ultralytics#4126) * Update README.md (ultralytics#4134) * ONNX inference update (ultralytics#4073) * Rename `opset_version` to `opset` (ultralytics#4135) * Update train.py (ultralytics#4136) * Refactor train.py * Update imports * Update imports * Update optimizer * cleanup * Refactor train.py and val.py `loggers` (ultralytics#4137) * Update loggers * Config * Update val.py * cleanup * fix1 * fix2 * fix3 and reformat * format sweep.py * Logger() class * cleanup * cleanup2 * wandb package import fix * wandb package import fix2 * txt fix * fix4 * fix5 * fix6 * drop wandb into utils/loggers * fix 7 * rename loggers/wandb_logging to loggers/wandb * Update message * Update message * Update message * cleanup * Fix x axis bug * fix rank 0 issue * cleanup * Update README.md (ultralytics#4143) * Add `export.py` ONNX inference suggestion (ultralytics#4146) * Created using Colaboratory * New CSV Logger (ultralytics#4148) * New CSV Logger * cleanup * move batch plots into Logger * rename comment * Remove total loss from progress bar * mloss :-1 bug fix * Update plot_results() * Update plot_results() * plot_results bug fix * Created using Colaboratory * Update dataset headers (ultralytics#4162) * Update script headers (ultralytics#4163) * Update download script headers * cleanup * bug fix attempt * bug fix attempt2 * bug fix attempt3 * cleanup * Improve docstrings and run names (ultralytics#4174) * Update comments header (ultralytics#4184) * Train from `--data path/to/dataset.zip` feature (ultralytics#4185) * Train from `--data path/to/dataset.zip` feature * Update dataset_stats() * cleanup * cleanup2 * Create yolov5-bifpn.yaml (ultralytics#4195) * Update Hub Path inputs (ultralytics#4200) * W&B: Restructure code to support the new dataset_check() feature (ultralytics#4197) * Improve docstrings and run names * default wandb login prompt with timeout * return key * Update api_key check logic * Properly support zipped dataset feature * update docstring * Revert tuorial change * extend changes to log_dataset * add run name * bug fix * bug fix * Update comment * fix import check * remove unused import * Hardcore .yaml file extension * reduce code * Reformat using pycharm Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update yolov5-bifpn.yaml (ultralytics#4208) * W&B: More improvements and refactoring (ultralytics#4205) * Improve docstrings and run names * default wandb login prompt with timeout * return key * Update api_key check logic * Properly support zipped dataset feature * update docstring * Revert tuorial change * extend changes to log_dataset * add run name * bug fix * bug fix * Update comment * fix import check * remove unused import * Hardcore .yaml file extension * reduce code * Reformat using pycharm * Remove redundant try catch * More refactoring and bug fixes * retry * Reformat using pycharm * respect LOGGERS include list Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * PyCharm reformat (ultralytics#4209) * PyCharm reformat * YAML reformat * Markdown reformat * Add `@try_except` decorator (ultralytics#4224) * Explicit `requirements.txt` location (ultralytics#4225) * Suppress torch 1.9.0 max_pool2d() warning (ultralytics#4227) * Created using Colaboratory * Created using Colaboratory * Fix weight decay comment (ultralytics#4228) * Update profiler (ultralytics#4236) * Add `python train.py --freeze N` argument (ultralytics#4238) * Add freeze as an argument I train on different platforms and sometimes I want to freeze some layers. I have to go into the code and change it and also keep track of how many layers I froze on what platform. Please add the number of layers to freeze as an argument in future versions thanks. * Update train.py * Update train.py * Cleanup Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update `profile()` for CUDA Memory allocation (ultralytics#4239) * Update profile() * Update profile() * Update profile() * Update profile() * Update profile() * Update profile() * Update profile() * Update profile() * Update profile() * Update profile() * Update profile() * Update profile() * Cleanup * Add `train.py` and `val.py` callbacks (ultralytics#4220) * added callbacks * Update callbacks.py * Update train.py * Update val.py * Fix CamlCase add staticmethod * Refactor logger into callbacks * Cleanup * New callback on_val_image_end() * Add curves and results images to TensorBoard Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * W&B: suppress warnings (ultralytics#4257) * Improve docstrings and run names * default wandb login prompt with timeout * return key * Update api_key check logic * Properly support zipped dataset feature * update docstring * Revert tuorial change * extend changes to log_dataset * add run name * bug fix * bug fix * Update comment * fix import check * remove unused import * Hardcore .yaml file extension * reduce code * Reformat using pycharm * Remove redundant try catch * More refactoring and bug fixes * retry * Reformat using pycharm * respect LOGGERS include list * call wandblogger.log instead of wandb.log Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update AP calculation (ultralytics#4260) * Update AP calculation * Cleanup * Remove original * Update Autoshape forward header (ultralytics#4271) * Update variables (ultralytics#4273) * Add `DWConvClass()` (ultralytics#4274) * Add `DWConvClass()` * Cleanup * Cleanup2 * Update 'results saved to' string (ultralytics#4275) * W&B: Fix sweep bug (ultralytics#4276) * Improve docstrings and run names * default wandb login prompt with timeout * return key * Update api_key check logic * Properly support zipped dataset feature * update docstring * Revert tuorial change * extend changes to log_dataset * add run name * bug fix * bug fix * Update comment * fix import check * remove unused import * Hardcore .yaml file extension * reduce code * Reformat using pycharm * Remove redundant try catch * More refactoring and bug fixes * retry * Reformat using pycharm * respect LOGGERS include list * call wandblogger.log instead of wandb.log * Fix Sweep bug Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Feature `python train.py --cache disk` (ultralytics#4049) * Add cache-on-disk and cache-directory to cache images on disk * Fix load_image with cache_on_disk * Add no_cache flag for load_image * Revert the parts('logging' and a new line) that do not need to be modified * Add the assertion for shapes of cached images * Add a suffix string for cached images * Fix boundary-error of letterbox for load_mosaic * Add prefix as cache-key of cache-on-disk * Update cache-function on disk * Add psutil in requirements.txt * Update train.py * Cleanup1 * Cleanup2 * Skip existing npy * Include re-space * Export return character fix Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Fixed logging level in distributed mode (ultralytics#4284) Co-authored-by: fkwong <huangfuqiang@transai.cn> * Simplify callbacks (ultralytics#4289) * Evolve in CSV format (ultralytics#4307) * Update evolution to CSV format * Update * Update * Update * Update * Update * reset args * reset args * reset args * plot_results() fix * Cleanup * Cleanup2 * Update newline (ultralytics#4308) * Update README.md (ultralytics#4309) remove unnecessary "`" * Simpler code for DWConvClass (ultralytics#4310) * more simpler code for DWConvClass more simpler code for DWConvClass * remove DWConv function * Replace DWConvClass with DWConv * `int(mlc)` (ultralytics#4385) * Fix module count in parse_model (ultralytics#4379) Co-authored-by: yangyuantao <yangyuantao@transai.cn> * Created using Colaboratory * Update README.md (ultralytics#4387) * W&B: Add advanced features tutorial (ultralytics#4384) * Improve docstrings and run names * default wandb login prompt with timeout * return key * Update api_key check logic * Properly support zipped dataset feature * update docstring * Revert tuorial change * extend changes to log_dataset * add run name * bug fix * bug fix * Update comment * fix import check * remove unused import * Hardcore .yaml file extension * reduce code * Reformat using pycharm * Remove redundant try catch * More refactoring and bug fixes * retry * Reformat using pycharm * respect LOGGERS include list * Initial readme update * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * W&B: Fix for 4360 (ultralytics#4388) * Improve docstrings and run names * default wandb login prompt with timeout * return key * Update api_key check logic * Properly support zipped dataset feature * update docstring * Revert tuorial change * extend changes to log_dataset * add run name * bug fix * bug fix * Update comment * fix import check * remove unused import * Hardcore .yaml file extension * reduce code * Reformat using pycharm * Remove redundant try catch * More refactoring and bug fixes * retry * Reformat using pycharm * respect LOGGERS include list * Fix * fix Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Fix rename `utils.google_utils` to `utils.downloads` (ultralytics#4393) * Simplify ONNX inference command (ultralytics#4405) * No cache option for reading datasets (ultralytics#4376) * no cache option * no cache option * bit change * changed to 0,1 instead of True False * Update train.py * Update datasets.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Update plots.py (ultralytics#4407) * Add `yolov5s-ghost.yaml` (ultralytics#4412) * Add yolov5s-ghost.yaml * Finish C3Ghost * Add C3Ghost to list * Add C3Ghost to number of repeats if statement * Fixes * Cleanup * Remove `encoding='ascii'` (ultralytics#4413) * Remove `encoding='ascii'` * Reinstate `encoding='ascii'` in emojis() * Merge PIL and OpenCV in `plot_one_box(use_pil=False)` (ultralytics#4416) * Merge PIL and OpenCV box plotting functions * Add ASCII check to plot_one_box * Cleanup * Cleanup2 * Created using Colaboratory * Standardize headers and docstrings (ultralytics#4417) * Implement new headers * Reformat 1 * Reformat 2 * Reformat 3 - math * Reformat 4 - yaml * Add `SPPF()` layer (ultralytics#4420) * Add `SPPF()` layer * Cleanup * Add credit * Created using Colaboratory * Remove DDP process group timeout (ultralytics#4422) * Update hubconf.py attempt_load import (ultralytics#4428) * TFLite prep (ultralytics#4436) * Add TensorFlow and TFLite export (ultralytics#1127) * Add models/tf.py for TensorFlow and TFLite export * Set auto=False for int8 calibration * Update requirements.txt for TensorFlow and TFLite export * Read anchors directly from PyTorch weights * Add --tf-nms to append NMS in TensorFlow SavedModel and GraphDef export * Remove check_anchor_order, check_file, set_logging from import * Reformat code and optimize imports * Autodownload model and check cfg * update --source path, img-size to 320, single output * Adjust representative_dataset * Put representative dataset in tfl_int8 block * detect.py TF inference * weights to string * weights to string * cleanup tf.py * Add --dynamic-batch-size * Add xywh normalization to reduce calibration error * Update requirements.txt TensorFlow 2.3.1 -> 2.4.0 to avoid int8 quantization error * Fix imports Move C3 from models.experimental to models.common * Add models/tf.py for TensorFlow and TFLite export * Set auto=False for int8 calibration * Update requirements.txt for TensorFlow and TFLite export * Read anchors directly from PyTorch weights * Add --tf-nms to append NMS in TensorFlow SavedModel and GraphDef export * Remove check_anchor_order, check_file, set_logging from import * Reformat code and optimize imports * Autodownload model and check cfg * update --source path, img-size to 320, single output * Adjust representative_dataset * detect.py TF inference * Put representative dataset in tfl_int8 block * weights to string * weights to string * cleanup tf.py * Add --dynamic-batch-size * Add xywh normalization to reduce calibration error * Update requirements.txt TensorFlow 2.3.1 -> 2.4.0 to avoid int8 quantization error * Fix imports Move C3 from models.experimental to models.common * implement C3() and SiLU() * Fix reshape dim to support dynamic batching * Add epsilon argument in tf_BN, which is different between TF and PT * Set stride to None if not using PyTorch, and do not warmup without PyTorch * Add list support in check_img_size() * Add list input support in detect.py * sys.path.append('./') to run from yolov5/ * Add int8 quantization support for TensorFlow 2.5 * Add get_coco128.sh * Remove --no-tfl-detect in models/tf.py (Use tf-android-tfl-detect branch for EdgeTPU) * Update requirements.txt * Replace torch.load() with attempt_load() * Update requirements.txt * Add --tf-raw-resize to set half_pixel_centers=False * Add --agnostic-nms for TF class-agnostic NMS * Cleanup after merge * Cleanup2 after merge * Cleanup3 after merge * Add tf.py docstring with credit and usage * pb saved_model and tflite use only one model in detect.py * Add use cases in docstring of tf.py * Remove redundant `stride` definition * Remove keras direct import * Fix `check_requirements(('tensorflow>=2.4.1',))` Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Fix default `--weights yolov5s.pt` (ultralytics#4458) * Fix missing labels after albumentations (ultralytics#4455) * fix missing labels after augmentation * Update datasets.py Cleanup Co-authored-by: Huu Quan <huuquan@HuuQuans-MacBook.local> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * `check_requirements(('coremltools',))` (ultralytics#4478) * `check_requirements(('coremltools',))` * Update ci-testing.yml * Update ci-testing.yml * W&B: Refactor the wandb_utils.py file (ultralytics#4496) * Improve docstrings and run names * default wandb login prompt with timeout * return key * Update api_key check logic * Properly support zipped dataset feature * update docstring * Revert tuorial change * extend changes to log_dataset * add run name * bug fix * bug fix * Update comment * fix import check * remove unused import * Hardcore .yaml file extension * reduce code * Reformat using pycharm * Remove redundant try catch * More refactoring and bug fixes * retry * Reformat using pycharm * respect LOGGERS include list * Fix * fix * refactor constructor * refactor * refactor * refactor * PyCharm reformat Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> * Add `install=True` argument to `check_requirements` (ultralytics#4512) * Add `install=True` argument to `check_requirements` * Update general.py * Automatic TFLite uint8 determination (ultralytics#4515) * Auto TFLite uint8 detection This PR automatically determines if TFLite models are uint8 quantized rather than accepting a manual argument. The quantization determination is based on @zldrobit comment ultralytics#1127 (comment) * Cleanup * Fix for `python models/yolo.py --profile` (ultralytics#4541) Profiling fix copies input to Detect layer to circumvent inplace changes to the feature maps. * Auto-fix corrupt JPEGs (ultralytics#4548) * Autofix corrupt JPEGs This PR automatically re-saves corrupt JPEGs and trains with the resaved images. WARNING: this will overwrite the existing corrupt JPEGs in a dataset and replace them with correct JPEGs, though the filesize may increase and the image contents may not be exactly the same due to lossy JPEG compression schemes. Results may vary by JPEG decoder and hardware. Current behavior is to exclude corrupt JPEGs from training with a warning to the user, but many users have been complaining about large parts of their dataset being excluded from training. * Clarify re-save reason * Fix for corrupt JPEGs auto-fix PR (ultralytics#4560) Auto-fix corrupt JPEGs PR introduced a bug whereby the f.seek() operation read all of the bytes in the image, resulting in the PIL image having nothing to read upon the .save() operation. Fix was to re-open the image using PIL before saving. * Fix for AP calculation limits 0.0 - 1.0 (ultralytics#4563) This PR brings alignment in AP computation practices with Detectron2 and MMDetection. Problem first noted by @yusiyoh in ultralytics#4546 * ONNX opset 13 (ultralytics#4566) * Add EarlyStopping feature (ultralytics#4576) * Add EarlyStopping feature * Add comment * Cleanup * Cleanup2 * debug * debug2 * debug3 * debug3 * debug4 * debug5 * debug6 * debug7 * debug8 * debug9 * debug10 * debug11 * debug12 * Cleanup * Add TODO for known DDP issue * Remove `image_weights` DDP code (ultralytics#4579) * Initial commit * Update * Add `Profile()` profiler (ultralytics#4587) * Add `Profile()` profiler * CamelCase Timeout * Fix bug in `plot_one_box` when label is `None` (ultralytics#4588) * Create `Annotator()` class (ultralytics#4591) * Add Annotator() class * Download Arial * 2x for loop * Cleanup * tuple 2 list * max_size=1920 * bold logging results to * tolist() * im = annotator.im * PIL save in detect.py * Smart asarray in detect.py * revert to cv2.imwrite * Cleanup * Return result asarray * Add `Profile()` profiler * CamelCase Timeout * Resize after mosaic * pillow>=8.0.0 * daemon imwrite * Add cv2 support * Remove plot_wh_methods and plot_one_box * pil=False for hubconf.py annotations * im.shape bug fix * colorstr common.py * join daemons * Update t.daemon * Removed daemon saving * Auto-UTF handling (ultralytics#4594) * Re-order `plots.py` to class-first (ultralytics#4595) * Created using Colaboratory * Update mosaic plots font size (ultralytics#4596) * TensorBoard `on_train_end()` speed improvements (ultralytics#4605) * Created using Colaboratory * Auto-download Arial.ttf on init (ultralytics#4606) * Auto-download Arial.ttf on init * Fix ROOT * Fix: add P2 layer 21 to yolov5-p2.yaml `Detect()` inputs (ultralytics#4608) Layer 21 includes the information of xsmall objects * Update `check_git_status()` warning (ultralytics#4610) * W&B: Don't log models in evolve operation (ultralytics#4611) * Close `matplotlib` plots after opening (ultralytics#4612) * Close plots * Replace fig.close() for plt.close() * DDP `torch.jit.trace()` `--sync-bn` fix (ultralytics#4615) * Remove assert * debug0 * trace=not opt.sync * sync to sync_bn fix * Cleanup * Fix for Arial.ttf redownloads with hub inference (ultralytics#4627) * Fix 2 for Arial.ttf redownloads with hub inference (ultralytics#4628) * Fix 3 for Arial.ttf redownloads with hub inference (ultralytics#4629) Fix 3 for Arial.ttf redownloads with hub inference, follow-on to ultralytics#4628. * Checkpoint code. * Fix for `plot_evolve()` string argument (ultralytics#4639) * Fix `is_coco` on missing `data['val']` key (ultralytics#4642) * Fix workers to 1 for windows and fix issue with image_size not being used correctly during training * Remove mojo files. * Add mojo_test.py and update gitignore. * Move entity and project to variables. * Update installation of dependencies to only if needed and make whl search more generic. * Fix missing parameter in _find_module_wheel_path. * Remove extra prints. * Fix weights download bug and pretraining always using yolov5s weights. * Update code to work with Ultralytics YOLOv5:4 env. * Add confidence threshold plot * Minor cleanup of azure_wrapper. * Fix click/typer incompatibility before 4.0.0 * Restore gitignore and remove wrong error import print in Azure wrapper. * Fix wrong typer version in requirements. Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com> Co-authored-by: Kalen Michael <kalenmike@gmail.com> Co-authored-by: masood azhar <masoodazhar60@gmail.com> Co-authored-by: Wei Quan <quan.we@gmail.com> Co-authored-by: xiaowk5516 <59595896+xiaowk5516@users.noreply.github.com> Co-authored-by: Mai Thanh Minh <thanhminh.mr@gmail.com> Co-authored-by: SpongeBab <2078825250@qq.com> Co-authored-by: ZouJiu1 <34758215+ZouJiu1@users.noreply.github.com> Co-authored-by: lb-desupervised <86119248+lb-desupervised@users.noreply.github.com> Co-authored-by: Lewis Belcher <lb@desupervised.io> Co-authored-by: fcakyon <34196005+fcakyon@users.noreply.github.com> Co-authored-by: Robin <robin@nanovare.com> Co-authored-by: Yonghye Kwon <developer.0hye@gmail.com> Co-authored-by: Piotr Skalski <SkalskiP@users.noreply.github.com> Co-authored-by: U-LAPTOP-5N89P8V7\banhu <ban.huang@foxmail.com> Co-authored-by: batrlatom <tomas.batrla@gmail.com> Co-authored-by: yellowdolphin <42343818+yellowdolphin@users.noreply.github.com> Co-authored-by: Zigarss <32835472+Zigars@users.noreply.github.com> Co-authored-by: Feras Oughali <47706157+feras-oughali@users.noreply.github.com> Co-authored-by: Valentin Aliferov <vaaliferov@gmail.com> Co-authored-by: san-soucie <44901782+san-soucie@users.noreply.github.com> Co-authored-by: John San Soucie <jsansoucie@whoi.edu> Co-authored-by: ketan-b <54092325+ketan-b@users.noreply.github.com> Co-authored-by: johnohagan <86861886+johnohagan@users.noreply.github.com> Co-authored-by: jmiranda-laplateforme <67475949+jmiranda-laplateforme@users.noreply.github.com> Co-authored-by: Eldar Kurtic <eldar.ciki@gmail.com> Co-authored-by: KEN <33506506+seven320@users.noreply.github.com> Co-authored-by: imyhxy <imyhxy@gmail.com> Co-authored-by: IneovaAI <67843470+IneovaAI@users.noreply.github.com> Co-authored-by: junji hashimoto <junjihashimoto@users.noreply.github.com> Co-authored-by: fkwong <huangfuqiang@transai.cn> Co-authored-by: Sudhanshu Singh <sudhanshufromearth@gmail.com> Co-authored-by: Yuantao Yang <31794133+orangeccc@users.noreply.github.com> Co-authored-by: yangyuantao <yangyuantao@transai.cn> Co-authored-by: Ahmad Mustafa Anis <47111429+ahmadmustafaanis@users.noreply.github.com> Co-authored-by: Omid Sadeghnezhad <58780720+OmidSa75@users.noreply.github.com> Co-authored-by: Jiacong Fang <zldrobit@126.com> Co-authored-by: Huu Quan, CAP <huuquan1994@users.noreply.github.com> Co-authored-by: Huu Quan <huuquan@HuuQuans-MacBook.local> Co-authored-by: Takumi Karasawa <zaki19930927@gmail.com> Co-authored-by: Yukun Xia <yukunx@cs.cmu.edu> Co-authored-by: vincent <vincent@nanovare.com>

…cs#3680) * Update DDP for `torch.distributed.run` * Add LOCAL_RANK * remove opt.local_rank * backend="gloo|nccl" * print * print * debug * debug * os.getenv * gloo * gloo * gloo * cleanup * fix getenv * cleanup * cleanup destroy * try nccl * return opt * add --local_rank * add timeout * add init_method * gloo * move destroy * move destroy * move print(opt) under if RANK * destroy only RANK 0 * move destroy inside train() * restore destroy outside train() * update print(opt) * cleanup * nccl * gloo with 60 second timeout * update namespace printing

glenn-jocher added 3 commits June 18, 2021 12:04

Update DDP for torch.distributed.run

007902e

Add LOCAL_RANK

9bcb4ad

remove opt.local_rank

b32bae0

glenn-jocher mentioned this pull request Jun 18, 2021

Multi-GPU Training 🌟 #475

Open

glenn-jocher added 12 commits June 18, 2021 12:45

backend="gloo|nccl"

b467501

print

c886538

print

5d847dc

debug

26d0ecf

debug

832ba4c

os.getenv

9a1bb01

gloo

0e912df

gloo

5f5e428

gloo

e8493c6

cleanup

fb342fc

fix getenv

382ce4f

cleanup

b09b415

NanoCode012 reviewed Jun 18, 2021

View reviewed changes

glenn-jocher added 2 commits June 18, 2021 16:11

cleanup destroy

9c4ac05

try nccl

8ae9ea1

glenn-jocher added 7 commits June 19, 2021 14:30

merge master

a18f933

return opt

2435775

add --local_rank

56a4ab4

add timeout

c4d839b

add init_method

0584e7e

gloo

d917341

move destroy

6a1cc64

glenn-jocher added 10 commits June 19, 2021 15:29

move destroy

3581c76

move print(opt) under if RANK

5f5d122

destroy only RANK 0

5451fc2

move destroy inside train()

9aa229e

restore destroy outside train()

94363ce

update print(opt)

9647379

merge master

cb8395d

cleanup

96686fd

nccl

446c610

gloo with 60 second timeout

49bb0b7

glenn-jocher changed the title ~~Update DDP for torch.distributed.run~~ Update DDP for torch.distributed.run with gloo backend Jun 19, 2021

update namespace printing

b5decde

glenn-jocher merged commit fad27c0 into master Jun 19, 2021

glenn-jocher deleted the DDP_run branch June 19, 2021 14:30

glenn-jocher mentioned this pull request Oct 12, 2021

YOLOv5 release v6.0 #5141

Merged

glenn-jocher mentioned this pull request Nov 7, 2021

YOLOv5 v6.0 compatibility update (draft) ultralytics/yolov3#1855

Closed

glenn-jocher mentioned this pull request Nov 14, 2021

YOLOv5 v6.0 compatibility update ultralytics/yolov3#1857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update DDP for `torch.distributed.run` with `gloo` backend #3680

Update DDP for `torch.distributed.run` with `gloo` backend #3680

glenn-jocher commented Jun 18, 2021 •

edited by UltralyticsAssistant

Loading

NanoCode012 commented Jun 18, 2021 •

edited

Loading

NanoCode012 Jun 18, 2021

glenn-jocher commented Jun 18, 2021 •

edited

Loading

glenn-jocher commented Jun 18, 2021

NanoCode012 commented Jun 18, 2021 •

edited

Loading

glenn-jocher commented Jun 19, 2021 •

edited

Loading

lleye commented Jul 20, 2021

glenn-jocher commented Jul 20, 2021 •

edited

Loading

Update DDP for torch.distributed.run with gloo backend #3680

Update DDP for torch.distributed.run with gloo backend #3680

Conversation

glenn-jocher commented Jun 18, 2021 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

NanoCode012 commented Jun 18, 2021 • edited Loading

NanoCode012 Jun 18, 2021

Choose a reason for hiding this comment

glenn-jocher commented Jun 18, 2021 • edited Loading

glenn-jocher commented Jun 18, 2021

NanoCode012 commented Jun 18, 2021 • edited Loading

glenn-jocher commented Jun 19, 2021 • edited Loading

lleye commented Jul 20, 2021

glenn-jocher commented Jul 20, 2021 • edited Loading

How to create a Minimal, Reproducible Example

Update DDP for `torch.distributed.run` with `gloo` backend #3680

Update DDP for `torch.distributed.run` with `gloo` backend #3680

glenn-jocher commented Jun 18, 2021 •

edited by UltralyticsAssistant

Loading

NanoCode012 commented Jun 18, 2021 •

edited

Loading

glenn-jocher commented Jun 18, 2021 •

edited

Loading

NanoCode012 commented Jun 18, 2021 •

edited

Loading

glenn-jocher commented Jun 19, 2021 •

edited

Loading

glenn-jocher commented Jul 20, 2021 •

edited

Loading