Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DDP for torch.distributed.run with gloo backend #3680

Merged
merged 35 commits into from
Jun 19, 2021
Merged

Conversation

glenn-jocher
Copy link
Member

@glenn-jocher glenn-jocher commented Jun 18, 2021

@NanoCode012 I'm experimenting with a few DDP updates here.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhanced logging and introduced distributed training modifications.

📊 Key Changes

  • Introduced colorstr function to colorize the printed options in detect.py, export.py, and test.py.
  • Removed set_logging calls from conditionals to allow logging across different modes.
  • Added LOCAL_RANK, RANK, and WORLD_SIZE global variables for distributed data parallel (DDP) support in train.py.
  • Refactored train.py to use DDP-friendly WORLD_SIZE, RANK, and LOCAL_RANK variables.
  • Modified create_dataloader in utils/datasets.py to support DDP by removing the world_size argument.
  • Added dist.barrier() calls in utils/torch_utils.py to synchronize DDP processes.
  • Updated the WandbLogger to handle DDP rank conditions using the environment variable RANK.

🎯 Purpose & Impact

  • 📈 Improved Logging: Colorized output enhances readability, helping users track the progress and settings easily.
  • 💻 DDP Support: Changes create a solid foundation for distributed training, allowing efficient scaling across multiple GPUs and improving training times.
  • 🔄 Easier Collaboration: These updates could facilitate collaborative development and training on different systems or clusters.

@NanoCode012
Copy link
Contributor

NanoCode012 commented Jun 18, 2021

Hi glenn, a quick run from this branch on the Docker image (commit 382ce4f) gives me the below. It somewhat trains but stopped at the end.

Output
[INFO] 2021-06-18 12:16:14,143 run: Running torch.distributed.run with args: ['/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py', '--master_port', '9980', '--nproc_per_node', '2', 'train.py', '--weights', 'yolov5s.pt', '--epochs', '3', '--img', '320', '--device', '0,1']
[INFO] 2021-06-18 12:16:14,149 run: Using nproc_per_node=2.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[INFO] 2021-06-18 12:16:14,150 api: Starting elastic_operator with launch configs:
  entrypoint       : train.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 2
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:9980
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

[INFO] 2021-06-18 12:16:14,153 local_elastic_agent: log directory set to: /tmp/torchelastic_nxmozyoe/none_tmcpzgfk
[INFO] 2021-06-18 12:16:14,153 api: [default] starting workers for entrypoint: python
[INFO] 2021-06-18 12:16:14,153 api: [default] Rendezvous'ing worker group
[INFO] 2021-06-18 12:16:14,153 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
[INFO] 2021-06-18 12:16:14,159 api: [default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=9980
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]

[INFO] 2021-06-18 12:16:14,160 api: [default] Starting worker group
[INFO] 2021-06-18 12:16:14,160 __init__: Setting worker0 reply file to: /tmp/torchelastic_nxmozyoe/none_tmcpzgfk/attempt_0/0/error.json
[INFO] 2021-06-18 12:16:14,161 __init__: Setting worker1 reply file to: /tmp/torchelastic_nxmozyoe/none_tmcpzgfk/attempt_0/1/error.json
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
{'RANK': 1, 'LOCAL_RANK': 1, 'WORLD_SIZE': 2}
YOLOv5 🚀 v5.0-223-g382ce4f torch 1.9.0+cu102 CUDA:0 (Tesla V100-SXM2-32GB, 32510.5MB)
                                              CUDA:1 (Tesla V100-SXM2-32GB, 32510.5MB)

{'RANK': 0, 'LOCAL_RANK': 0, 'WORLD_SIZE': 2}
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 2 nodes.
Namespace(adam=False, artifact_alias='latest', batch_size=8, bbox_interval=-1, bucket='', cache_images=False, cfg='', data='data/coco128.yaml', device='0,1', entity=None, epochs=3, evolve=False, exist_ok=False, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[320, 320], label_smoothing=0.0, linear_lr=False, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp10', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=16, upload_dataset=False, weights='yolov5s.pt', workers=8)
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
wandb: Install Weights & Biases for YOLOv5 logging with 'pip install wandb' (recommended)

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    156928  models.common.C3                        [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Model Summary: 283 layers, 7276605 parameters, 7276605 gradients, 17.1 GFLOPs

Transferred 362/362 items from yolov5s.pt
Scaled weight_decay = 0.0005
Optimizer groups: 62 .bias, 62 conv.weight, 59 other
train: Scanning '../coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100%|███████████████| 128/128 [00:00<00:00, 1334.50it/s]
train: New cache created: ../coco128/labels/train2017.cache
train: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:00<?, ?it/s][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Plotting labels... 
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████| 128/128 [00:01<?, ?it/s]

autoanchor: Analyzing anchors... anchors/target = 3.97, Best Possible Recall (BPR) = 0.9580. Attempting to improve anchors, please wait...
autoanchor: WARNING: Extremely small objects found. 35 of 929 labels are < 3 pixels in size.
autoanchor: Running kmeans for 9 anchors on 927 points...
autoanchor: thr=0.25: 0.9623 best possible recall, 3.58 anchors past thr
autoanchor: n=9, img_size=320, metric_all=0.252/0.634-mean/best, past_thr=0.474-mean: 11,12,  31,34,  74,42,  46,87,  132,90,  78,161,  192,151,  173,273,  305,189
autoanchor: Evolving anchors with Genetic Algorithm: fitness = 0.6718: 100%|█████████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 938.65it/s]
autoanchor: thr=0.25: 0.9925 best possible recall, 3.71 anchors past thr
autoanchor: n=9, img_size=320, metric_all=0.261/0.672-mean/best, past_thr=0.478-mean: 6,6,  10,13,  25,28,  64,45,  43,87,  64,130,  139,118,  184,194,  313,214
autoanchor: New anchors saved to model. Update model *.yaml to use these anchors in the future.

Image sizes 320 train, 320 test
Using 4 dataloader workers
Logging results to runs/train/exp10
Starting training for 3 epochs...

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       0/2    0.644G   0.08607   0.03836   0.03987    0.1643       111       320:  12%|██████▏                                          | 1/8 [00:08<01:01,  8.78s/it]Reducer buckets have been rebuilt in this iteration.
       0/2     1.59G   0.07716   0.04663   0.03862    0.1624        86       320: 100%|█████████████████████████████████████████████████| 8/8 [00:11<00:00,  1.41s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████████████████████████████████| 8/8 [00:03<00:00,  2.04it/s]
                 all        128        929      0.387      0.485      0.404      0.225

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       1/2     1.59G   0.06844   0.04651   0.03561    0.1506        80       320: 100%|█████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.29it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████████████████████████████████| 8/8 [00:02<00:00,  3.32it/s]
                 all        128        929      0.453      0.488       0.46      0.275

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       2/2     1.59G   0.06269   0.05292   0.03293    0.1485       150       320: 100%|█████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.02it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████████████████████████████████| 8/8 [00:04<00:00,  1.80it/s]
                 all        128        929       0.51       0.49      0.493        0.3
3 epochs completed in 0.012 hours.

Optimizer stripped from runs/train/exp10/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp10/weights/best.pt, 14.8MB

I'll try to check your changes. Right now, I notice a few things.

  • my training stuck on the last line with Optimizer stripped.... Perhaps a worker did not end or an update with wandb that I haven't been following recently
Output when Cntrl+C
Optimizer stripped from runs/train/exp10/weights/best.pt, 14.8MB
^CTraceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 637, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 629, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 621, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 238, in launch_agent
    result = agent.run()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 700, in run
    result = self._invoke_run(role)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 828, in _invoke_run
    time.sleep(monitor_interval)
KeyboardInterrupt
  • There are a lot of extra logs with the new run module. I wonder if we can hide them.
  • Warning: Leaking Caffe2 thread-pool after fork. <- This warning while loading dataset

Unfortunately, since this is quite new, there aren't any examples we can follow.

train.py Outdated
assert torch.cuda.device_count() > LOCAL_RANK, 'too few GPUS for DDP command'
torch.cuda.set_device(LOCAL_RANK)
device = torch.device('cuda', LOCAL_RANK)
dist.init_process_group(backend="gloo") # distributed backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl should be the faster backend for ddp. I recall that Windows only support gloo however.

@glenn-jocher
Copy link
Member Author

glenn-jocher commented Jun 18, 2021

@NanoCode012 I see the same hanging at the end of training. This warning seems to imply that the hanging may be related to dist.barrier() we use in torch_distributed_zero_first()

[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

@contextmanager
def torch_distributed_zero_first(local_rank: int):
"""
Decorator to make all processes in distributed training wait for each local_master to do something.
"""
if local_rank not in [-1, 0]:
torch.distributed.barrier()
yield
if local_rank == 0:
torch.distributed.barrier()

@glenn-jocher
Copy link
Member Author

@NanoCode012 agreed there are a lot of warnings/messages everywhere.

Part of this newly appeared in torch 1.9:

/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)

And this part maybe either be due to 1.9 or the Docker image.

[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)

This is just my own debug code to remove before merge.

{'RANK': 0, 'LOCAL_RANK': 0, 'WORLD_SIZE': 2}

The rest of the extra stuff is due to torch.distributed.run

@NanoCode012
Copy link
Contributor

NanoCode012 commented Jun 18, 2021

@NanoCode012 I see the same hanging at the end of training. This warning seems to imply that the hanging may be related to dist.barrier() we use in torch_distributed_zero_first()

[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

I think this warning is because of this comment in the run module.
https://github.com/pytorch/pytorch/blob/d5988c5eca0221e9ef58918e4f0b504940cb926a/torch/distributed/run.py#L210-L212

However, I notice when I re-ran the latest code from this branch, I also get this error but not in my earlier run. Perhaps it's something you changed from commit 382ce4f ?

Edit: I think the cause of this error is when you changed to nccl 8ae9ea1 . Checkout to the commit before does not give me that warning. This is quite confusing..

@glenn-jocher glenn-jocher changed the title Update DDP for torch.distributed.run Update DDP for torch.distributed.run with gloo backend Jun 19, 2021
@glenn-jocher
Copy link
Member Author

glenn-jocher commented Jun 19, 2021

@NanoCode012 ok I found a fix. torch.destroy_process_group() needs to be outside of the main train() function, and now it works, with both nccl and gloo, using both torch.distributed.launch and torch.distributed.run.

EDIT: I can't profile this right now as there's no EC2 spot availability but I will try to profile this week and we can revert to nccl if need be, or even introduce an if statement with if torch.distributed.is_nccl_available().

@glenn-jocher glenn-jocher merged commit fad27c0 into master Jun 19, 2021
@glenn-jocher glenn-jocher deleted the DDP_run branch June 19, 2021 14:30
@lleye
Copy link

lleye commented Jul 20, 2021

Hi @glenn-jocher @NanoCode012 i'm having the same issues here.

[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)

[E ProcessGroupNCCL.cpp:566] [Rank #] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 60960 milliseconds before timing out.

RuntimeError: replicas[0][0] in this process with sizes [48, 12, 3, 3] appears not to match sizes of the same param in process 0.

[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'

training on multiple a100s using ddp in docker

@glenn-jocher
Copy link
Member Author

glenn-jocher commented Jul 20, 2021

@glenn-jocher this is a merged PR. If you have an issue that meets the criteria below I recommend you open a new issue with code to reproduce.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible that still produces the same problem
  • Complete – Provide all parts someone else needs to reproduce your problem in the question itself
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

  • Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
  • Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

robin-maillot added a commit to robin-maillot/yolov5 that referenced this pull request Sep 22, 2021
* ConfusionMatrix `normalize=True` fix (ultralytics#3587)

* train.py GPU memory fix (ultralytics#3590)

* train.py GPU memory fix

* ema

* cuda

* cuda

* zeros input

* to device

* batch index 0

* W&B: Allow changed in config variable ultralytics#3588

* Update `dataset_stats()` (ultralytics#3593)

@kalenmike this is a PR to add image filenames and labels to our stats dictionary and to save the dictionary to JSON. Save location is next to the train labels.cache file. The single JSON contains all stats for entire dataset.

Usage example:
```python
from utils.datasets import *

dataset_stats('coco128.yaml', verbose=True)
```

* Delete __init__.py (ultralytics#3596)

* Simplify README.md (ultralytics#3530)

* Update README.md

* added hosted images

* added new logo

* testing image hosting

* changed svgs to pngs

* removed old header

* Update README.md

* correct colab image source

* splash.jpg

* rocket and W&B fix

* added contributing template

* added social media to top section

* increased size of top social media

* cleanup and updates

* rearrange quickstarts

* API cleanup

* PyTorch Hub cleanup

* Add tutorials

* cleanup

* update CONTRIBUTING.md

* Update README.md

* update wandb link

* Update README.md

* remove tutorials header

* update environments and integrations

* Comment API image

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* double spaces after section

* Update README.md

* Update README.md

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Update datasets.py (ultralytics#3591)

* 'changes-in_dataset'

* Update datasets.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Download COCO and VOC by default (ultralytics#3608)

* Suppress wandb images size mismatch warning (ultralytics#3611)

* supress wandb images size mismatch warning

* supress wandb images size mismatch warning

* PEP8 reformat and optimize imports

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Fix incorrect end epoch comment (ultralytics#3612)

* Update `check_file()` (ultralytics#3622)

* Update `check_file()`

* Update datasets.py

* Update README.md (ultralytics#3624)

* FROM nvcr.io/nvidia/pytorch:21.05-py3 (ultralytics#3633)

* Add `**/*.torchscript.pt` (ultralytics#3634)

* Update `verify_image_label()` (ultralytics#3635)

* RUN pip install --no-cache -U torch torchvision (ultralytics#3637)

* Assert non-premature end of JPEG images (ultralytics#3638)

* premature end of JPEG images

* PEP8 reformat

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Update CONTRIBUTING.md (ultralytics#3645)

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md (ultralytics#3647)

* `is_coco` list fix (ultralytics#3646)

* Update README.md (ultralytics#3650)

Be more user-friendly to new users

* Update `dataset_stats()` to list of dicts (ultralytics#3657)

* Update `dataset_stats()` to list of dicts

@kalenmike

* Update datasets.py

* Remove `/weights` directory (ultralytics#3659)

* Remove `/weights` directory

* cleanup

* Update download_weights.sh comment (ultralytics#3662)

* Update train.py (ultralytics#3667)

* Update `train(hyp, *args)` to accept `hyp` file or dict (ultralytics#3668)

* Update TensorBoard (ultralytics#3669)

* Update `WORLD_SIZE` and `RANK` retrieval (ultralytics#3670)

* Cache v0.3: improved corrupt image/label reporting (ultralytics#3676)

* Cache v0.3: improved corrupt image/label reporting

Fix for ultralytics#3656 (comment)

* cleanup

* EMA changes for pre-model's batch_size (ultralytics#3681)

* EMA changes for pre-model's batch_size

* Update train.py

* Update torch_utils.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Update README.md (ultralytics#3684)

* Update cache check (ultralytics#3691)

Swapped order of operations for faster first per ultralytics@f527704#r52362419

* Skip HSV augmentation when hyperparameters are [0, 0, 0] (ultralytics#3686)

* Create shortcircuit in augment_hsv when hyperparameter are zero

* implement faster opt-in

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Slightly modify CLI execution (ultralytics#3687)

* Slightly modify CLI execution

This simple change makes it easier to run the primary functions of this
repo (train/detect/test) from within Python. An object which represents
`opt` can be constructed and fed to the `main` function of each of these
modules, rather than having to call the lower level functions directly,
or run the module as a script.

* Update export.py

Add CLI parsing update for more convenient module usage within Python.

Co-authored-by: Lewis Belcher <lb@desupervised.io>

* Reformat (ultralytics#3694)

* Update DDP for `torch.distributed.run` with `gloo` backend (ultralytics#3680)

* Update DDP for `torch.distributed.run`

* Add LOCAL_RANK

* remove opt.local_rank

* backend="gloo|nccl"

* print

* print

* debug

* debug

* os.getenv

* gloo

* gloo

* gloo

* cleanup

* fix getenv

* cleanup

* cleanup destroy

* try nccl

* return opt

* add --local_rank

* add timeout

* add init_method

* gloo

* move destroy

* move destroy

* move print(opt) under if RANK

* destroy only RANK 0

* move destroy inside train()

* restore destroy outside train()

* update print(opt)

* cleanup

* nccl

* gloo with 60 second timeout

* update namespace printing

* Eliminate `total_batch_size` variable (ultralytics#3697)

* Eliminate `total_batch_size` variable

* cleanup

* Update train.py

* Add torch DP warning (ultralytics#3698)

* Add `train.run()` method (ultralytics#3700)

* Update train.py explicit arguments

* Update train.py

* Add run method

* Update DDP backend `if dist.is_nccl_available()` (ultralytics#3705)

* [x]W&B: Don't resume transfer learning runs (ultralytics#3604)

* Allow config cahnge

* Allow val change in wandb config

* Don't resume transfer learning runs

* Add entity in log dataset

* Update 4 main ops for paths and .run() (ultralytics#3715)

* Add yolov5/ to path

* rename functions to run()

* cleanup

* rename fix

* CI fix

* cleanup find models/export.py

* Fix `img2label_paths()` order (ultralytics#3720)

* Fix `img2label_paths()` order

* fix, 1

* Fix typo (ultralytics#3729)

* Backwards compatible cache version checks (ultralytics#3730)

* Update readme.

* Update `check_datasets()` for dynamic unzip path (ultralytics#3732)

@kalenmike

* Create `data/hyps` directory (ultralytics#3747)

* Force non-zero hyp evolution weights `w` (ultralytics#3748)

Fix for ultralytics#3741

* Edit comment (ultralytics#3759)

edit comment

* Add optional dataset.yaml `path` attribute (ultralytics#3753)

* Add optional dataset.yaml `path` attribute

@kalenmike

* pass locals to python scripts

* handle lists

* update coco128.yaml

* Capitalize first letter

* add test key

* finalize GlobalWheat2020.yaml

* finalize objects365.yaml

* finalize SKU-110K.yaml

* finalize SKU-110K.yaml

* finalize VisDrone.yaml

* NoneType fix

* update download comment

* voc to VOC

* update

* update VOC.yaml

* update VOC.yaml

* remove dashes

* delete get_voc.sh

* force coco and coco128 to ../datasets

* Capitalize Argoverse_HD.yaml

* Capitalize Objects365.yaml

* update Argoverse_HD.yaml

* coco segments fix

* VOC single-thread

* update Argoverse_HD.yaml

* update data_dict in test handling

* create root

* COCO annotations JSON fix (ultralytics#3764)

* Add `xyxy2xywhn()` (ultralytics#3765)

* Edit Comments for numpy2torch tensor process

Edit Comments for numpy2torch tensor process

* add xyxy2xywhn

add xyxy2xywhn

* add xyxy2xywhn

* formatting

* pass arguments

pass arguments

* edit comment for xyxy2xywhn()

edit comment for xyxy2xywhn()

* cleanup datasets.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Remove DDP MultiHeadAttention fix (ultralytics#3768)

* fix/incorrect_fitness_import (ultralytics#3770)

* W&B: Update Tables API and comply with new dataset_check (ultralytics#3772)

* Update tables API and windows path fix

* update dataset check

* NGA xView 2018 Dataset Auto-Download (ultralytics#3775)

* update clip_coords for numpy

* uncomment

* cleanup

* Add autosplits

* fix

* cleanup

* Update README.md fix banner width (ultralytics#3785)

* Objectness IoU Sort (ultralytics#3610)

Co-authored-by: U-LAPTOP-5N89P8V7\banhu <ban.huang@foxmail.com>

* Update objectness IoU sort (ultralytics#3786)

* Create hyp.scratch-p6.yaml (ultralytics#3787)

* Fix datasets for aws and get_coco.sh (ultralytics#3788)

* merge master

* Update get_coco.sh

* Update seeds for single-GPU reproducibility (ultralytics#3789)

For seed=0 on single-GPU.

* Update Usage examples (ultralytics#3790)

* nvcr.io/nvidia/pytorch:21.06-py3 (ultralytics#3791)

* Update Dockerfile (ultralytics#3792)

* FROM nvcr.io/nvidia/pytorch:21.05-py3 (ultralytics#3794)

* Fix competition link (ultralytics#3799)

* link to the competition repaired

* Update README.md

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Fix warmup `accumulate` (ultralytics#3722)

* gradient accumulation during warmup in train.py

Context:
`accumulate` is the number of batches/gradients accumulated before calling the next optimizer.step().
During warmup, it is ramped up from 1 to the final value nbs / batch_size. 
Although I have not seen this in other libraries, I like the idea. During warmup, as grads are large, too large steps are more of on issue than gradient noise due to small steps.

The bug:
The condition to perform the opt step is wrong
> if ni % accumulate == 0:
This produces irregular step sizes if `accumulate` is not constant. It becomes relevant when batch_size is small and `accumulate` changes many times during warmup.

This demo also shows the proposed solution, to use a ">=" condition instead:
https://colab.research.google.com/drive/1MA2z2eCXYB_BC5UZqgXueqL_y1Tz_XVq?usp=sharing

Further, I propose not to restrict the number of warmup iterations to >= 1000. If the user changes hyp['warmup_epochs'], this causes unexpected behavior. Also, it makes evolution unstable if this parameter was to be optimized.

* replace last_opt_step tracking by do_step(ni)

* add docstrings

* move down nw

* Update train.py

* revert math import move

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Add feature map visualization (ultralytics#3804)

* Add feature map visualization

Add a feature_visualization function to visualize the mid feature map of the model.

* Update yolo.py

* remove boolean from forward and reorder if statement

* remove print from forward

* General cleanup

* Indent

* Update plots.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Update `feature_visualization()` (ultralytics#3807)

* Update `feature_visualization()`

Only plot for data with height, width > 1

* cleanup

* Cleanup

* Fix for `dataset_stats()` with updated data.yaml (ultralytics#3819)

@kalenmike

* Move IoU functions to metrics.py (ultralytics#3820)

* Concise `TransformerBlock()` (ultralytics#3821)

* Update setup.py to use utf8 everywhere.

* Update setup.py to use utf8 everywhere again.

* Fix `LoadStreams()` dataloader frame skip issue (ultralytics#3833)

* Update datasets.py to read every 4th frame of streams

* Update datasets.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Plot `AutoShape()` detections in ascending order (ultralytics#3843)

* Copy-Paste augmentation for YOLOv5 (ultralytics#3845)

* Copy-paste augmentation initial commit

* if any segments

* Add obscuration rejection

* Add copy_paste hyperparameter

* Update comments

* Created using Colaboratory

* Created using Colaboratory

* Add EXIF rotation to YOLOv5 Hub inference (ultralytics#3852)

* rotating an image according to its exif tag

* Update common.py

* Update datasets.py

* Update datasets.py

faster

* delete extraneous gpg file

* Update common.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* `--evolve 300` generations CLI argument (ultralytics#3863)

* evolve command accepts argument for number of generations

* evolve generations argument used in evolve for loop

* evolve argument boolean fixes

* default to 300 evolve generations

* Update train.py

Co-authored-by: John San Soucie <jsansoucie@whoi.edu>
Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Add multi-stream saving feature (ultralytics#3864)

* Added the recording feature for multiple streams

Thanks for the very cool repo!!
I was trying to record multiple feeds at the same time, but the current version of the detector only had one video writer and one vid_path!
So the streams were not being saved and only were initialized with one frame and this process didn't record the whole thing.

Fix:
I made a list of `vid_writer` and `vid_path` and the `i` from the loop over the `pred` took care of the writer which need to work!

I hope this helps, Thanks!

* Cleanup list lengths

* batch size variable

* Update datasets.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Created using Colaboratory

* Models `*.yaml` reformat (ultralytics#3875)

* Create `utils/augmentations.py` (ultralytics#3877)

* Create `utils/augmentations.py`

* cleanup

* Improved BGR2RGB speeds (ultralytics#3880)

* Update BGR2RGB ops

* speed improvements

* cleanup

* Evolution commented `hyp['anchors']` fix (ultralytics#3887)

Fix for `KeyError: 'anchors'` error when start hyperparameter evolution:
```bash
python train.py --evolve
```

```bash
Traceback (most recent call last):
  File "E:\yolov5\train.py", line 623, in <module>
    hyp[k] = max(hyp[k], v[1])  # lower limit
KeyError: 'anchors'
```

* Hub models `map_location=device` (ultralytics#3894)

* Hub models `map_location=device`

* cleanup

* YOLOv5 + Albumentations integration (ultralytics#3882)

* Albumentations integration

* ToGray p=0.01

* print confirmation

* create instance in dataloader init method

* improved version handling

* transform not defined fix

* assert string update

* create check_version()

* add spaces

* update class comment

* Save PyTorch Hub models to `/root/hub/cache/dir` (ultralytics#3904)

* Create hubconf.py

* Add save_dir variable

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Feature visualization update (ultralytics#3920)

* Feature visualization update

* Save to jpg (faster)

* Save to png

* Fix `torch.hub.list('ultralytics/yolov5')` pathlib bug (ultralytics#3921)

* Update `setattr()` default for Hub PIL images (ultralytics#3923)

Fix inference from PIL source.

* `feature_visualization()` CUDA fix (ultralytics#3925)

* Update `dataset_stats()` for zipped datasets (ultralytics#3926)

* Update `dataset_stats()` for zipped datasets

@kalenmike

* cleanup

* Fix inconsistent NMS IoU value for COCO (ultralytics#3934)

Evaluation of 'best' and 'last' models will use the same params as the evaluation during the training phase. 
This PR fixes ultralytics#3907

* Created using Colaboratory

* Feature visualization improvements 32 (ultralytics#3947)

* Update augmentations.py (ultralytics#3948)

* Cache v0.4 update (ultralytics#3954)

* Numerical stability fix for Albumentations (ultralytics#3958)

* Update `albumentations>=1.0.2` (ultralytics#3966)

* Update `np.random.random()` to `random.random()` (ultralytics#3967)

* Update requirements.txt `albumentations>=1.0.2` (ultralytics#3972)

* `Ensemble()` visualize fix (ultralytics#3973)

* fix visualize error

* Revert "fix visualize error"

* add visualise profile

* Created using Colaboratory

* Update `probability` to `p` (ultralytics#3980)

* Alert (no detections) (ultralytics#3984)

* `Detections()` class `print()` overload

* Update common.py

* Update README.md (ultralytics#3996)

* Rename `test.py` to `val.py` (ultralytics#4000)

* W&B sweeps support (ultralytics#3938)

* Add support for W&B Sweeps

* Update and reformat

* Update search space

* reformat

* reformat sweep.py

* Update sweep.py

* Move sweeps files to wandb dir

* Remove print

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Update greetings.yml (ultralytics#4024)

* Update greetings.yml

* Update greetings.yml

* Add `--sync-bn` known issue (ultralytics#4032)

* Add `--sync-bn` known issue

* Update train.py

* Update greetings.yml (ultralytics#4037)

* Update README.md (ultralytics#4041)

* Update README.md

* Update README.md

* Update README.md

* AutoShape PosixPath support (ultralytics#4047)

* AutoShape PosixPath support

Usage example:

````python
from pathlib import Path

model = ...
file = Path('data/images/zidane.jpg')

results = model(file)
```

* Update common.py

* `val.py` refactor (ultralytics#4053)

* val.py refactor

* cleanup

* cleanup

* cleanup

* cleanup

* save after eval

* opt.imgsz bug fix

* wandb refactor

* dataloader to train_loader

* capitalize global variables

* runs/hub/exp to runs/detect/exp

* refactor wandb logging

* Refactor wandb operations (ultralytics#4061)

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>

* Module `super().__init__()` (ultralytics#4065)

* Module `super().__init__()`

* remove NMS

* Missing `nc` and `names` handling in check_dataset() (ultralytics#4066)

* Created using Colaboratory

* Albumentations >= 1.0.3 (ultralytics#4068)

* W&B: fix refactor bugs (ultralytics#4069)

* Refactor `export.py` (ultralytics#4080)

* Refactor `export.py`

* cleanup

* Update check_requirements()

* Update export.py

* Addition refactor `export.py` (ultralytics#4089)

* Addition refactor `export.py`

* Update export.py

* Add train.py ``--img-size` floor (ultralytics#4099)

* Update resume.py (ultralytics#4115)

* Fix indentation in `log_training_progress()` (ultralytics#4126)

* Update README.md (ultralytics#4134)

* ONNX inference update (ultralytics#4073)

* Rename `opset_version` to `opset` (ultralytics#4135)

* Update train.py (ultralytics#4136)

* Refactor train.py

* Update imports

* Update imports

* Update optimizer

* cleanup

* Refactor train.py and val.py `loggers` (ultralytics#4137)

* Update loggers

* Config

* Update val.py

* cleanup

* fix1

* fix2

* fix3 and reformat

* format sweep.py

* Logger() class

* cleanup

* cleanup2

* wandb package import fix

* wandb package import fix2

* txt fix

* fix4

* fix5

* fix6

* drop wandb into utils/loggers

* fix 7

* rename loggers/wandb_logging to loggers/wandb

* Update message

* Update message

* Update message

* cleanup

* Fix x axis bug

* fix rank 0 issue

* cleanup

* Update README.md (ultralytics#4143)

* Add `export.py` ONNX inference suggestion (ultralytics#4146)

* Created using Colaboratory

* New CSV Logger (ultralytics#4148)

* New CSV Logger

* cleanup

* move batch plots into Logger

* rename comment

* Remove total loss from progress bar

* mloss :-1 bug fix

* Update plot_results()

* Update plot_results()

* plot_results bug fix

* Created using Colaboratory

* Update dataset headers (ultralytics#4162)

* Update script headers (ultralytics#4163)

* Update download script headers

* cleanup

* bug fix attempt

* bug fix attempt2

* bug fix attempt3

* cleanup

* Improve docstrings and run names (ultralytics#4174)

* Update comments header (ultralytics#4184)

* Train from `--data path/to/dataset.zip` feature (ultralytics#4185)

* Train from `--data path/to/dataset.zip` feature

* Update dataset_stats()

* cleanup

* cleanup2

* Create yolov5-bifpn.yaml (ultralytics#4195)

* Update Hub Path inputs (ultralytics#4200)

* W&B: Restructure code to support the new dataset_check() feature (ultralytics#4197)

* Improve docstrings and run names

* default wandb login prompt with timeout

* return key

* Update api_key check logic

* Properly support zipped dataset feature

* update docstring

* Revert tuorial change

* extend changes to log_dataset

* add run name

* bug fix

* bug fix

* Update comment

* fix import check

* remove unused import

* Hardcore .yaml file extension

* reduce code

* Reformat using pycharm

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Update yolov5-bifpn.yaml (ultralytics#4208)

* W&B: More improvements and refactoring (ultralytics#4205)

* Improve docstrings and run names

* default wandb login prompt with timeout

* return key

* Update api_key check logic

* Properly support zipped dataset feature

* update docstring

* Revert tuorial change

* extend changes to log_dataset

* add run name

* bug fix

* bug fix

* Update comment

* fix import check

* remove unused import

* Hardcore .yaml file extension

* reduce code

* Reformat using pycharm

* Remove redundant try catch

* More refactoring and bug fixes

* retry

* Reformat using pycharm

* respect LOGGERS include list

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* PyCharm reformat (ultralytics#4209)

* PyCharm reformat

* YAML reformat

* Markdown reformat

* Add `@try_except` decorator (ultralytics#4224)

* Explicit `requirements.txt` location (ultralytics#4225)

* Suppress torch 1.9.0 max_pool2d() warning (ultralytics#4227)

* Created using Colaboratory

* Created using Colaboratory

* Fix weight decay comment (ultralytics#4228)

* Update profiler (ultralytics#4236)

* Add `python train.py --freeze N` argument (ultralytics#4238)

* Add freeze as an argument

I train on different platforms and sometimes I want to freeze some layers. I have to go into the code and change it and also keep track of how many layers I froze on what platform. Please add the number of layers to freeze as an argument in future versions thanks.

* Update train.py

* Update train.py

* Cleanup

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Update `profile()` for CUDA Memory allocation (ultralytics#4239)

* Update profile()

* Update profile()

* Update profile()

* Update profile()

* Update profile()

* Update profile()

* Update profile()

* Update profile()

* Update profile()

* Update profile()

* Update profile()

* Update profile()

* Cleanup

* Add `train.py` and `val.py` callbacks (ultralytics#4220)

* added callbacks

* Update callbacks.py

* Update train.py

* Update val.py

* Fix CamlCase add staticmethod

* Refactor logger into callbacks

* Cleanup

* New callback on_val_image_end()

* Add curves and results images to TensorBoard

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* W&B: suppress warnings (ultralytics#4257)

* Improve docstrings and run names

* default wandb login prompt with timeout

* return key

* Update api_key check logic

* Properly support zipped dataset feature

* update docstring

* Revert tuorial change

* extend changes to log_dataset

* add run name

* bug fix

* bug fix

* Update comment

* fix import check

* remove unused import

* Hardcore .yaml file extension

* reduce code

* Reformat using pycharm

* Remove redundant try catch

* More refactoring and bug fixes

* retry

* Reformat using pycharm

* respect LOGGERS include list

* call wandblogger.log instead of wandb.log

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Update AP calculation (ultralytics#4260)

* Update AP calculation

* Cleanup

* Remove original

* Update Autoshape forward header (ultralytics#4271)

* Update variables (ultralytics#4273)

* Add `DWConvClass()` (ultralytics#4274)

* Add `DWConvClass()`

* Cleanup

* Cleanup2

* Update 'results saved to' string (ultralytics#4275)

* W&B: Fix sweep bug (ultralytics#4276)

* Improve docstrings and run names

* default wandb login prompt with timeout

* return key

* Update api_key check logic

* Properly support zipped dataset feature

* update docstring

* Revert tuorial change

* extend changes to log_dataset

* add run name

* bug fix

* bug fix

* Update comment

* fix import check

* remove unused import

* Hardcore .yaml file extension

* reduce code

* Reformat using pycharm

* Remove redundant try catch

* More refactoring and bug fixes

* retry

* Reformat using pycharm

* respect LOGGERS include list

* call wandblogger.log instead of wandb.log

* Fix Sweep bug

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Feature `python train.py --cache disk` (ultralytics#4049)

* Add cache-on-disk and cache-directory to cache images on disk

* Fix load_image with cache_on_disk

* Add no_cache flag for load_image

* Revert the parts('logging' and a new line) that do not need to be modified

* Add the assertion for shapes of cached images

* Add a suffix string for cached images

* Fix boundary-error of letterbox for load_mosaic

* Add prefix as cache-key of cache-on-disk

* Update cache-function on disk

* Add psutil in requirements.txt

* Update train.py

* Cleanup1

* Cleanup2

* Skip existing npy

* Include re-space

* Export return character fix

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Fixed logging level in distributed mode (ultralytics#4284)

Co-authored-by: fkwong <huangfuqiang@transai.cn>

* Simplify callbacks (ultralytics#4289)

* Evolve in CSV format (ultralytics#4307)

* Update evolution to CSV format

* Update

* Update

* Update

* Update

* Update

* reset args

* reset args

* reset args

* plot_results() fix

* Cleanup

* Cleanup2

* Update newline (ultralytics#4308)

* Update README.md (ultralytics#4309)

remove unnecessary "`"

* Simpler code for DWConvClass (ultralytics#4310)

* more simpler code for DWConvClass

more simpler code for DWConvClass

* remove DWConv function

* Replace DWConvClass with DWConv

* `int(mlc)` (ultralytics#4385)

* Fix module count in parse_model (ultralytics#4379)

Co-authored-by: yangyuantao <yangyuantao@transai.cn>

* Created using Colaboratory

* Update README.md (ultralytics#4387)

* W&B: Add advanced features tutorial (ultralytics#4384)

* Improve docstrings and run names

* default wandb login prompt with timeout

* return key

* Update api_key check logic

* Properly support zipped dataset feature

* update docstring

* Revert tuorial change

* extend changes to log_dataset

* add run name

* bug fix

* bug fix

* Update comment

* fix import check

* remove unused import

* Hardcore .yaml file extension

* reduce code

* Reformat using pycharm

* Remove redundant try catch

* More refactoring and bug fixes

* retry

* Reformat using pycharm

* respect LOGGERS include list

* Initial readme update

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* W&B: Fix for 4360 (ultralytics#4388)

* Improve docstrings and run names

* default wandb login prompt with timeout

* return key

* Update api_key check logic

* Properly support zipped dataset feature

* update docstring

* Revert tuorial change

* extend changes to log_dataset

* add run name

* bug fix

* bug fix

* Update comment

* fix import check

* remove unused import

* Hardcore .yaml file extension

* reduce code

* Reformat using pycharm

* Remove redundant try catch

* More refactoring and bug fixes

* retry

* Reformat using pycharm

* respect LOGGERS include list

* Fix

* fix

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Fix rename `utils.google_utils` to `utils.downloads` (ultralytics#4393)

* Simplify ONNX inference command (ultralytics#4405)

* No cache option for reading datasets (ultralytics#4376)

* no cache option

* no cache option

* bit change

* changed to 0,1 instead of True False

* Update train.py

* Update datasets.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Update plots.py (ultralytics#4407)

* Add `yolov5s-ghost.yaml` (ultralytics#4412)

* Add yolov5s-ghost.yaml

* Finish C3Ghost

* Add C3Ghost to list

* Add C3Ghost to number of repeats if statement

* Fixes

* Cleanup

* Remove `encoding='ascii'` (ultralytics#4413)

* Remove `encoding='ascii'`

* Reinstate `encoding='ascii'` in emojis()

* Merge PIL and OpenCV in `plot_one_box(use_pil=False)` (ultralytics#4416)

* Merge PIL and OpenCV box plotting functions

* Add ASCII check to plot_one_box

* Cleanup

* Cleanup2

* Created using Colaboratory

* Standardize headers and docstrings (ultralytics#4417)

* Implement new headers

* Reformat 1

* Reformat 2

* Reformat 3 - math

* Reformat 4 - yaml

* Add `SPPF()` layer (ultralytics#4420)

* Add `SPPF()` layer

* Cleanup

* Add credit

* Created using Colaboratory

* Remove DDP process group timeout (ultralytics#4422)

* Update hubconf.py attempt_load  import (ultralytics#4428)

* TFLite prep (ultralytics#4436)

* Add TensorFlow and TFLite export (ultralytics#1127)

* Add models/tf.py for TensorFlow and TFLite export

* Set auto=False for int8 calibration

* Update requirements.txt for TensorFlow and TFLite export

* Read anchors directly from PyTorch weights

* Add --tf-nms to append NMS in TensorFlow SavedModel and GraphDef export

* Remove check_anchor_order, check_file, set_logging from import

* Reformat code and optimize imports

* Autodownload model and check cfg

* update --source path, img-size to 320, single output

* Adjust representative_dataset

* Put representative dataset in tfl_int8 block

* detect.py TF inference

* weights to string

* weights to string

* cleanup tf.py

* Add --dynamic-batch-size

* Add xywh normalization to reduce calibration error

* Update requirements.txt

TensorFlow 2.3.1 -> 2.4.0 to avoid int8 quantization error

* Fix imports

Move C3 from models.experimental to models.common

* Add models/tf.py for TensorFlow and TFLite export

* Set auto=False for int8 calibration

* Update requirements.txt for TensorFlow and TFLite export

* Read anchors directly from PyTorch weights

* Add --tf-nms to append NMS in TensorFlow SavedModel and GraphDef export

* Remove check_anchor_order, check_file, set_logging from import

* Reformat code and optimize imports

* Autodownload model and check cfg

* update --source path, img-size to 320, single output

* Adjust representative_dataset

* detect.py TF inference

* Put representative dataset in tfl_int8 block

* weights to string

* weights to string

* cleanup tf.py

* Add --dynamic-batch-size

* Add xywh normalization to reduce calibration error

* Update requirements.txt

TensorFlow 2.3.1 -> 2.4.0 to avoid int8 quantization error

* Fix imports

Move C3 from models.experimental to models.common

* implement C3() and SiLU()

* Fix reshape dim to support dynamic batching

* Add epsilon argument in tf_BN, which is different between TF and PT

* Set stride to None if not using PyTorch, and do not warmup without PyTorch

* Add list support in check_img_size()

* Add list input support in detect.py

* sys.path.append('./') to run from yolov5/

* Add int8 quantization support for TensorFlow 2.5

* Add get_coco128.sh

* Remove --no-tfl-detect in models/tf.py (Use tf-android-tfl-detect branch for EdgeTPU)

* Update requirements.txt

* Replace torch.load() with attempt_load()

* Update requirements.txt

* Add --tf-raw-resize to set half_pixel_centers=False

* Add --agnostic-nms for TF class-agnostic NMS

* Cleanup after merge

* Cleanup2 after merge

* Cleanup3 after merge

* Add tf.py docstring with credit and usage

* pb saved_model and tflite use only one model in detect.py

* Add use cases in docstring of tf.py

* Remove redundant `stride` definition

* Remove keras direct import

* Fix `check_requirements(('tensorflow>=2.4.1',))`

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Fix default `--weights yolov5s.pt` (ultralytics#4458)

* Fix missing labels after albumentations (ultralytics#4455)

* fix missing labels after augmentation

* Update datasets.py

Cleanup

Co-authored-by: Huu Quan <huuquan@HuuQuans-MacBook.local>
Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* `check_requirements(('coremltools',))` (ultralytics#4478)

* `check_requirements(('coremltools',))`

* Update ci-testing.yml

* Update ci-testing.yml

* W&B: Refactor the wandb_utils.py file (ultralytics#4496)

* Improve docstrings and run names

* default wandb login prompt with timeout

* return key

* Update api_key check logic

* Properly support zipped dataset feature

* update docstring

* Revert tuorial change

* extend changes to log_dataset

* add run name

* bug fix

* bug fix

* Update comment

* fix import check

* remove unused import

* Hardcore .yaml file extension

* reduce code

* Reformat using pycharm

* Remove redundant try catch

* More refactoring and bug fixes

* retry

* Reformat using pycharm

* respect LOGGERS include list

* Fix

* fix

* refactor constructor

* refactor

* refactor

* refactor

* PyCharm reformat

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Add `install=True` argument to `check_requirements` (ultralytics#4512)

* Add `install=True` argument to `check_requirements`

* Update general.py

* Automatic TFLite uint8 determination (ultralytics#4515)

* Auto TFLite uint8 detection

This PR automatically determines if TFLite models are uint8 quantized rather than accepting a manual argument.

The quantization determination is based on @zldrobit comment ultralytics#1127 (comment)

* Cleanup

* Fix for `python models/yolo.py --profile` (ultralytics#4541)

Profiling fix copies input to Detect layer to circumvent inplace changes to the feature maps.

* Auto-fix corrupt JPEGs (ultralytics#4548)

* Autofix corrupt JPEGs

This PR automatically re-saves corrupt JPEGs and trains with the resaved images. WARNING: this will overwrite the existing corrupt JPEGs in a dataset and replace them with correct JPEGs, though the filesize may increase and the image contents may not be exactly the same due to lossy JPEG compression schemes. Results may vary by JPEG decoder and hardware.

Current behavior is to exclude corrupt JPEGs from training with a warning to the user, but many users have been complaining about large parts of their dataset being excluded from training.

* Clarify re-save reason

* Fix for corrupt JPEGs auto-fix PR (ultralytics#4560)

Auto-fix corrupt JPEGs PR introduced a bug whereby the f.seek() operation read all of the bytes in the image, resulting in the PIL image having nothing to read upon the .save() operation. 

Fix was to re-open the image using PIL before saving.

* Fix for AP calculation limits 0.0 - 1.0 (ultralytics#4563)

This PR brings alignment in AP computation practices with Detectron2 and MMDetection. 

Problem first noted by @yusiyoh in ultralytics#4546

* ONNX opset 13 (ultralytics#4566)

* Add EarlyStopping feature (ultralytics#4576)

* Add EarlyStopping feature

* Add comment

* Cleanup

* Cleanup2

* debug

* debug2

* debug3

* debug3

* debug4

* debug5

* debug6

* debug7

* debug8

* debug9

* debug10

* debug11

* debug12

* Cleanup

* Add TODO for known DDP issue

* Remove `image_weights` DDP code (ultralytics#4579)

* Initial commit

* Update

* Add `Profile()` profiler (ultralytics#4587)

* Add `Profile()` profiler

* CamelCase Timeout

* Fix bug in `plot_one_box` when label is `None` (ultralytics#4588)

* Create `Annotator()` class (ultralytics#4591)

* Add Annotator() class

* Download Arial

* 2x for loop

* Cleanup

* tuple 2 list

* max_size=1920

* bold logging results to

* tolist()

* im = annotator.im

* PIL save in detect.py

* Smart asarray in detect.py

* revert to cv2.imwrite

* Cleanup

* Return result asarray

* Add `Profile()` profiler

* CamelCase Timeout

* Resize after mosaic

* pillow>=8.0.0

* daemon imwrite

* Add cv2 support

* Remove plot_wh_methods and plot_one_box

* pil=False for hubconf.py annotations

* im.shape bug fix

* colorstr common.py

* join daemons

* Update t.daemon

* Removed daemon saving

* Auto-UTF handling (ultralytics#4594)

* Re-order `plots.py` to class-first (ultralytics#4595)

* Created using Colaboratory

* Update mosaic plots font size (ultralytics#4596)

* TensorBoard `on_train_end()` speed improvements (ultralytics#4605)

* Created using Colaboratory

* Auto-download Arial.ttf on init (ultralytics#4606)

* Auto-download Arial.ttf on init

* Fix ROOT

* Fix: add P2 layer 21 to yolov5-p2.yaml `Detect()` inputs (ultralytics#4608)

Layer 21 includes the information of xsmall objects

* Update `check_git_status()` warning (ultralytics#4610)

* W&B: Don't log models in evolve operation (ultralytics#4611)

* Close `matplotlib` plots after opening (ultralytics#4612)

* Close plots

* Replace fig.close() for plt.close()

* DDP `torch.jit.trace()` `--sync-bn` fix (ultralytics#4615)

* Remove assert

* debug0

* trace=not opt.sync

* sync to sync_bn fix

* Cleanup

* Fix for Arial.ttf redownloads with hub inference (ultralytics#4627)

* Fix 2 for Arial.ttf redownloads with hub inference (ultralytics#4628)

* Fix 3 for Arial.ttf redownloads with hub inference (ultralytics#4629)

Fix 3 for Arial.ttf redownloads with hub inference, follow-on to ultralytics#4628.

* Checkpoint code.

* Fix for `plot_evolve()` string argument (ultralytics#4639)

* Fix `is_coco` on missing `data['val']` key (ultralytics#4642)

* Fix workers to 1 for windows and fix issue with image_size not being used correctly during training

* Remove mojo files.

* Add mojo_test.py and update gitignore.

* Move entity and project to variables.

* Update installation of dependencies to only if needed and make whl search more generic.

* Fix missing parameter in _find_module_wheel_path.

* Remove extra prints.

* Fix weights download bug and pretraining always using yolov5s weights.

* Update code to work with Ultralytics YOLOv5:4 env.

* Add confidence threshold plot

* Minor cleanup of azure_wrapper.

* Fix click/typer incompatibility before 4.0.0

* Restore gitignore and remove wrong error import print in Azure wrapper.

* Fix wrong typer version in requirements.

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Co-authored-by: Kalen Michael <kalenmike@gmail.com>
Co-authored-by: masood azhar <masoodazhar60@gmail.com>
Co-authored-by: Wei Quan <quan.we@gmail.com>
Co-authored-by: xiaowk5516 <59595896+xiaowk5516@users.noreply.github.com>
Co-authored-by: Mai Thanh Minh <thanhminh.mr@gmail.com>
Co-authored-by: SpongeBab <2078825250@qq.com>
Co-authored-by: ZouJiu1 <34758215+ZouJiu1@users.noreply.github.com>
Co-authored-by: lb-desupervised <86119248+lb-desupervised@users.noreply.github.com>
Co-authored-by: Lewis Belcher <lb@desupervised.io>
Co-authored-by: fcakyon <34196005+fcakyon@users.noreply.github.com>
Co-authored-by: Robin <robin@nanovare.com>
Co-authored-by: Yonghye Kwon <developer.0hye@gmail.com>
Co-authored-by: Piotr Skalski <SkalskiP@users.noreply.github.com>
Co-authored-by: U-LAPTOP-5N89P8V7\banhu <ban.huang@foxmail.com>
Co-authored-by: batrlatom <tomas.batrla@gmail.com>
Co-authored-by: yellowdolphin <42343818+yellowdolphin@users.noreply.github.com>
Co-authored-by: Zigarss <32835472+Zigars@users.noreply.github.com>
Co-authored-by: Feras Oughali <47706157+feras-oughali@users.noreply.github.com>
Co-authored-by: Valentin Aliferov <vaaliferov@gmail.com>
Co-authored-by: san-soucie <44901782+san-soucie@users.noreply.github.com>
Co-authored-by: John San Soucie <jsansoucie@whoi.edu>
Co-authored-by: ketan-b <54092325+ketan-b@users.noreply.github.com>
Co-authored-by: johnohagan <86861886+johnohagan@users.noreply.github.com>
Co-authored-by: jmiranda-laplateforme <67475949+jmiranda-laplateforme@users.noreply.github.com>
Co-authored-by: Eldar Kurtic <eldar.ciki@gmail.com>
Co-authored-by: KEN <33506506+seven320@users.noreply.github.com>
Co-authored-by: imyhxy <imyhxy@gmail.com>
Co-authored-by: IneovaAI <67843470+IneovaAI@users.noreply.github.com>
Co-authored-by: junji hashimoto <junjihashimoto@users.noreply.github.com>
Co-authored-by: fkwong <huangfuqiang@transai.cn>
Co-authored-by: Sudhanshu Singh <sudhanshufromearth@gmail.com>
Co-authored-by: Yuantao Yang <31794133+orangeccc@users.noreply.github.com>
Co-authored-by: yangyuantao <yangyuantao@transai.cn>
Co-authored-by: Ahmad Mustafa Anis <47111429+ahmadmustafaanis@users.noreply.github.com>
Co-authored-by: Omid Sadeghnezhad <58780720+OmidSa75@users.noreply.github.com>
Co-authored-by: Jiacong Fang <zldrobit@126.com>
Co-authored-by: Huu Quan, CAP <huuquan1994@users.noreply.github.com>
Co-authored-by: Huu Quan <huuquan@HuuQuans-MacBook.local>
Co-authored-by: Takumi Karasawa <zaki19930927@gmail.com>
Co-authored-by: Yukun Xia <yukunx@cs.cmu.edu>
Co-authored-by: vincent <vincent@nanovare.com>
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
…cs#3680)

* Update DDP for `torch.distributed.run`

* Add LOCAL_RANK

* remove opt.local_rank

* backend="gloo|nccl"

* print

* print

* debug

* debug

* os.getenv

* gloo

* gloo

* gloo

* cleanup

* fix getenv

* cleanup

* cleanup destroy

* try nccl

* return opt

* add --local_rank

* add timeout

* add init_method

* gloo

* move destroy

* move destroy

* move print(opt) under if RANK

* destroy only RANK 0

* move destroy inside train()

* restore destroy outside train()

* update print(opt)

* cleanup

* nccl

* gloo with 60 second timeout

* update namespace printing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants