Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wandb: Network error (ReadTimeout), entering retry loop. See wandb\debug-internal.log for full traceback. #2840

Closed
Zigars opened this issue Apr 19, 2021 · 14 comments · Fixed by #2882
Labels
bug Something isn't working

Comments

@Zigars
Copy link
Contributor

Zigars commented Apr 19, 2021

🐛 Bug

I try to use your rep to train yolov4's NET because yolov4(https://github.com/WongKinYiu/PyTorch_YOLOv4)'s code is outdate and do not maintain, it has many bugs.
when I train my own yolov4-tiny.yaml, it comes this bug, I think this bug is because my network can not connect to wandb's server? before today, I can train normally, and a few minute ago, I try many times to python train.py ,but I still can not begin my train code.

To Reproduce (REQUIRED)

python train.py

Output:

YOLOv5  2021-4-15 torch 1.7.1 CUDA:0 (GRID V100D-32Q, 32638.0MB)

Namespace(adam=False, artifact_alias='latest', batch_size=64, bbox_interval=-1, bucket='', cache_images=False, cfg='models/yolov4-tiny.yaml', data='datai/Visdrone.yaml', device='', entity=None, epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.scratch.yaml', image_weights=False, img_size=[640, 640], label_smoothing=0.0, linear_lr=False, local_rank=-1, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs\\train\\exp8', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=64, upload_dataset=False, weights='', workers=8, world_size=1)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0
wandb: Currently logged in as: zigar (use `wandb login --relogin` to force relogin)
wandb: Network error (ReadTimeout), entering retry loop. See wandb\debug-internal.log for full traceback.

Expected behavior

A clear and concise description of what you expected to happen.

Environment

If applicable, add screenshots to help explain your problem.

  • OS: [e.g. WIndows 10]
  • GPU [e.g. GRID V100D-32Q, 32638.0MB]

Additional context

Add any other context about the problem here.

@Zigars Zigars added the bug Something isn't working label Apr 19, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Apr 19, 2021

👋 Hello @Zigars, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@Zigars
Copy link
Contributor Author

Zigars commented Apr 19, 2021

also, when I switch to yolov5s.yaml, I still can not train normally, if there have some way that I can close wandb so that I can train normally, I used have login in wandb, and my network can't open wandb.ai too.

@Zigars
Copy link
Contributor Author

Zigars commented Apr 19, 2021

I uninstall the wandb and solve this bug...

@glenn-jocher
Copy link
Member

@Zigars hi sorry to hear about your logging issues! Sometimes network interruptions can prevent logging to wandb, though this should not cause an error. @AyushExel, our W&B contact may have some more info.

I saw you have a VisDrone.yaml. I've seen this dataset is pretty popular, please consider submitting a Pull Request to add your VisDone.yaml and if possible a get_visdrone.sh file to help future users auto-download this dataset. Thank you!

@Zigars
Copy link
Contributor Author

Zigars commented Apr 19, 2021

@glenn-jocher I'm so happy to get your reply! I enjoy using your yolov5 code to train object detection task, it's a great rep! Recently ,I was doing some research that use yolo to detect VisDrone dataset. I'm sorry that I'm not familiar with git and scratch, So PR or a get_visdrone.sh is a difficult things for me.If you want the VisDrone.yaml and the ready-made VisDrone dataset(I download it from VisDrone, and transform it to coco form), I can send these to your email.

@glenn-jocher
Copy link
Member

@Zigars hey great! I think you can attach files directly to these messages, so maybe you can just attach your visdrone yaml and the code you used to download and convert to YOLO format and I could do the PR.

@Zigars
Copy link
Contributor Author

Zigars commented Apr 19, 2021

Hi, @glenn-jocher ,I spend some times to rewrite my convert code, because the original code is a little ugly. :(

And I will give you the visdrone.yaml, the code trans_yolo.py and a VisDrone-test.zip dataset zip.

visdrone.yaml include the data path, nc=10 and class names;

you can convert visdrone to YOLO format by use trans_yolo.py;

because the original dataset is too large, you can download the VisDrone-DET in github, and put the annotaions and images in one directory VisDrone-DET like the VisDrone-test.zip.

VisDrone2019-DET dataset

VisDrone-test.zip is test for convert code, include test-dev, train and val data, 3 data type each 10 images and annotations. you can delate the other file except annotations and images, than python trans_yolo.py to test the convert code, remember to fix the path, I provide 'relate path' and 'absolute path' two path way, all tested it in your train.py code successfully.
VisDrone.zip

@AyushExel
Copy link
Contributor

@Zigars Thanks for filing this issue. As @glenn-jocher said, network interruptions can cause wandb to not log data to the dashboard but it should not cause errors. Can you please confirm what version of the wandb client you're using( run pip list and see what version of wandb is installed)?
But if you're facing network issues and you're not able to log files using wandb at the moment, there's another recommended way to handle this case:

  • run wandb offline to enable offline mode. This will track you experiments but not try to upload anything to cloud
  • run wandb sync when you're back online to sync everything to your wandb dashboard.

@Zigars
Copy link
Contributor Author

Zigars commented Apr 19, 2021

@AyushExel Sorry, I solve this bug by uninstall the wandb, I remember I update the latest version of wandb? and the terminal could be stick, train.py still do not work in that times. I can show you a debug-internal.log so that you can fix this bug, thank you for your replay!
It's 23.24 now in China, tomorrow I will try your recommended way. good night!
debug-internal.log

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 21, 2021

Hi, @glenn-jocher ,I spend some times to rewrite my convert code, because the original code is a little ugly. :(

And I will give you the visdrone.yaml, the code trans_yolo.py and a VisDrone-test.zip dataset zip.

visdrone.yaml include the data path, nc=10 and class names;

you can convert visdrone to YOLO format by use trans_yolo.py;

because the original dataset is too large, you can download the VisDrone-DET in github, and put the annotaions and images in one directory VisDrone-DET like the VisDrone-test.zip.

VisDrone2019-DET dataset

VisDrone-test.zip is test for convert code, include test-dev, train and val data, 3 data type each 10 images and annotations. you can delate the other file except annotations and images, than python trans_yolo.py to test the convert code, remember to fix the path, I provide 'relate path' and 'absolute path' two path way, all tested it in your train.py code successfully.
VisDrone.zip

@Zigars awesome thanks! I'll see if I can convert this into a PR so future users can autodownload VisDrone more easily.

TODO: VisDrone autodownload PR

@glenn-jocher glenn-jocher added TODO and removed TODO labels Apr 21, 2021
@glenn-jocher glenn-jocher linked a pull request Apr 21, 2021 that will close this issue
11 tasks
@glenn-jocher
Copy link
Member

@Zigars I've used your example to create a working visdrone.yaml with autodownload capability in PR #2882. Please take a look there and let me know what you think. One thing I don't understand is this line, I'm guessing this is an ignore region?

if row[4] == '0':  # TODO explain this line
    continue

@glenn-jocher
Copy link
Member

@Zigars actually, even better, could you update this line in the PR with a better explanation for this? Then you will also show up as an official PR author for the repo, giving you credit for your work!

@Zigars
Copy link
Contributor Author

Zigars commented Apr 22, 2021

@glenn-jocher hi! I‘m SOOOO happy to give the PR for yolov5! thanks so much! and I can answer your question, this line is because original VisDrone-DET have 12 classes! it include 'ignored regions' and 'others' two classes ,with original annotations row[4] == '0' to delate these two classes, so that we can get 10 useful classes to train our network. also this dataset is particularly difficult to train, yolov5s with 300 epoch, I can only get 17.0 map, the reason for this maybe because the targets are most small target. Recently, I'm conducting an experiment to get faster and more mAP in small object detection.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 22, 2021

@Zigars ah I understand now! Yes the dataset is difficult. With this sort of data (very small objects) you should really train at higher resolution with a P6 model, i.e.:

python train.py --data visdrone.yaml --weights yolov5m6.pt --batch-size 32 --img 1280

EDIT: actually maybe the P6 model doesn't matter, as it's targeted for larger objects, but definitely a higher resolution like 1280 or 1920 would help this dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants