Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy way to save checkpoints for Colab user #640

Closed
TaoXieSZ opened this issue Aug 6, 2020 · 15 comments 路 Fixed by #660
Closed

Easy way to save checkpoints for Colab user #640

TaoXieSZ opened this issue Aug 6, 2020 · 15 comments 路 Fixed by #660
Labels
enhancement New feature or request Stale

Comments

@TaoXieSZ
Copy link
Contributor

TaoXieSZ commented Aug 6, 2020

馃殌 Feature

It will be more convenient for Colab user to save checkpoints in Google Drive than in yolov5/runs.

My idea

Just change in nearly line 458 to 464 (in my current version):

if not opt.evolve:
        tb_writer = None
        if opt.local_rank in [-1, 0]:
            print('Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/')
           # Change the path here
            tb_writer = SummaryWriter(log_dir=increment_dir('/content/drive/My Drive/yolov5-checkpoints/exp', opt.name))

        train(hyp, opt, device, tb_writer)
@TaoXieSZ TaoXieSZ added the enhancement New feature or request label Aug 6, 2020
@glenn-jocher
Copy link
Member

@ChristopherSTAN can you point tensorboard to a google drive folder like you have? That would be really cool, then all of your work is saved and you can keep track of experiments this way.

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 7, 2020

This is a really good pro tip for Colab users! Maybe we should add a --log-dir argument to train.py to enable this?

@TaoXieSZ
Copy link
Contributor Author

TaoXieSZ commented Aug 7, 2020

@glenn-jocher You remind me about that.

However, I have long time no looking the tensorBoard. I just try and it can only point to the data inside yolov5's folder.

For the argument problem, that's up to you, LOL. And I think it is more convenient for colab users if you do. There is --work-dir argument in mmdetection. It ignites my idea about this and I find it can save checkpoints in larger google drive.
image

BTW, in my experience, using tensorboard often slows notebooks and raises disconnection (maybe Google try to avoid over-usage), so I ignore that.

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 7, 2020

It does work! Wow, so this is a backdoor to permanence with Colab. You can actually log all of your experiments straight to drive, and then pick up where you left off the next day without having to move any files. This is a real game changer for colab dev work. I'll add a PR for the argparser --logdir argument.

Screen Shot 2020-08-06 at 9 36 34 PM

@TaoXieSZ
Copy link
Contributor Author

TaoXieSZ commented Aug 7, 2020

@glenn-jocher It is really amazing!

@glenn-jocher
Copy link
Member

All done. Thanks for the great idea @ChristopherSTAN!

@TaoXieSZ
Copy link
Contributor Author

TaoXieSZ commented Aug 7, 2020

@glenn-jocher It is just kind of feedback from a deep-user. Expecting for better yolov5 in the future.

BTW, I noticed the default bbox loss is now CIoU, maybe you should update the logging entry. It may raise some confusion.

@glenn-jocher
Copy link
Member

@ChristopherSTAN yes, you are correct, it's now CIoU. Yes I need to update the comment to a criterion-agnostic term like 'box' or 'regression'.

@glenn-jocher
Copy link
Member

TODO: Update GIoU labels to criteria-agnostic terms.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@maheeeetaaa
Copy link

maheeeetaaa commented Jan 24, 2021

@glenn-jocher Hello, I have been trying to train yolov5_v4 it seems that the train arguments have changed, before i used to use logdir and then when the training would stop ( because i work on colab) i would run it and it would have picked up from where it started but now, it doesnt! i even set the new weights but the training starts as if there has been no training before, the epoch number doesnt reset but all the map graphs show that the training has started from the beginning. What should I do?

here are my arguments :

!python train.py --img 320 --batch 128 --epochs 200
--data /content/YoloV5Data/data.yaml
--cfg ./models/yolov5s.yaml
--weights /content/drive/Yolov5S_320/exp5/weights/last.pt\

--project /content/drive/Yolov5S_320/

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 24, 2021

@maheeetaaa yes local directly logging structure was unified in #1377. Training results are saved to runs/train/exp.

You may resume an interrupted training run very simply:

python train.py --resume  # automatically select most recent run
python train.py --resume path/to/last.pt  # manually specify run to resume

@Leprechault
Copy link

馃殌 Feature

It will be more convenient for Colab user to save checkpoints in Google Drive than in yolov5/runs.

My idea

Just change in nearly line 458 to 464 (in my current version):

if not opt.evolve:
        tb_writer = None
        if opt.local_rank in [-1, 0]:
            print('Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/')
           # Change the path here
            tb_writer = SummaryWriter(log_dir=increment_dir('/content/drive/My Drive/yolov5-checkpoints/exp', opt.name))

        train(hyp, opt, device, tb_writer)

@TaoXieSZ for me as fresh in the subject, I don't understand if your proposed change in lines 458 to 464 is inside model yaml file or another file? Could you please help me?

@glenn-jocher
Copy link
Member

@Leprechault runs can be logged anywhere now, so @TaoXieSZ comment is no longer applicable. To long a run to any directory use the --project argument along with the --name argument:
python train.py --project runs/train --name exp

@Leprechault
Copy link

Thanks very much @glenn-jocher !!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants