Error when saving period and parameters #6452

Juanjojr9 · 2022-01-27T11:42:50Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hi, I have searched but have not found a similar question.

I was training a long model, and just in case there was a problem, I set the -save_period parameter, , as you can see in the following picture:

I had a problem and wanted to resume blocked execution, since my google colab session expired and I lost all my session data. , but it gave me another problem.

When I compared it with this tutorial I found: https://colab.research.google.com/github/wandb/examples/blob/master/colabs/yolo/Train_and_Debug_YOLOv5_Models_with_Weights_%26_Biases.ipynb#scrollTo=jwcBfF5OvAHk

In his training it saves the data every epoch and in mine it should do it every 5 epochs, but it doesn't do anything:

It's a real bummer, as the model had been training for several hours.

Does anyone know how to fix this or how can I save the model and the parameters in case I get an error again? Any help is good.

I also wanted to ask if it is possible to download the data locally into a folder instead of using the Weights and Biases website.

Thank you very much for your help.

JJ

Additional

No response

AyushExel · 2022-01-27T12:25:23Z

@Juanjojr9 Hey, the command seems correct. can you check the run that you want to resume and see if the artifact is present there? You can even share the run link here and I'll look into this

Juanjojr9 · 2022-01-27T13:13:20Z

Hello! First of all, thank you very much for helping me.

If you can tell me how I can verify it, I will try to do it or which is the link you are referring to, I will pass it on without any problem.

AyushExel · 2022-01-27T13:19:59Z

@Juanjojr9 in your command, you have -save-period set to 10, which means the model will only be saved after 10 epochs. If you click on artifacts or see the charts you'll notice that the experiment didn't run 10 epochs that's why the model is not logged. I suggest reducing the save period

Juanjojr9 · 2022-01-27T19:10:45Z

Hi @AyushExel , thank you for your answer

I think I got the wrong example earlier. Anyway, I did a quick test and now it does save, as you can see in the image below:

I cancelled the run to simulate a problem. At first, I thought it was going to work, but it gave me a problem.

Analysing it, I realised that it changes the parameters I had set before:

My question is: when I run !python train.py --resume wandb-artifact://{crashed_run_path} ,I also have to set the above parameters?

Is there another way to save the data locally without losing the information in case the google colab session expires?

Thank you

AyushExel · 2022-01-28T11:48:45Z

@Juanjojr9 your syntax is correct. The problem that you're seeing in the pic above is that you're running out of CUDA memory. Read the last line of the error message.
The same syntax should work on colab or your own system if there is sufficient CUDA memory

Juanjojr9 · 2022-01-29T12:22:38Z

@AyushExel I have not explained myself well

El problema por el que se está quedando sin memoria CUDA is that the parameters are being changed.
The batch size, I have it set to 1. But when I run --resume wandb-artifact://{{crashed_run_path}, it changes the batch size and is set to 16.
You can check it by looking at the pictures. Otherwise, I would not have been able to run it the first time.
And besides the batch size, other parameters such as the epochs, imgzs, etc. change as well.

My question is whether I should run the command by setting the parameters as well, i.e. as follows:

If not, I don't understand why it changes the parameters I set when I train the network, when trying to resume the network run

AyushExel · 2022-01-29T12:24:06Z

@Juanjojr9 okay understood. This should not happen. I'll fix this coming week. Thanks for reporting

Juanjojr9 · 2022-01-29T12:34:11Z

I don't know if it's my problem. I will try to keep doing different tests.

Thanks to you for creating this incredible and majestic tool.
I am delighted to be learning about this issue

Yours faithfully,
Juanjo

AyushExel · 2022-02-02T17:31:33Z

@Juanjojr9 I've pushed a PR with the fix for this

Juanjojr9 · 2022-02-03T09:47:18Z

@AyushExel Thank you very much for the help and the quick solution. I look forward to using it when it is ready

Best regards.

JJ

glenn-jocher · 2022-02-03T10:30:29Z

@Juanjojr9 good news 😃! Your original issue may now be fixed ✅ in PR #6452 by @AyushExel. To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

AyushExel · 2022-02-03T12:46:01Z

@Juanjojr9 Your problem should be fixed in the latest release now the resume command will remember the batch size.. Please verify.. Thanks!

Juanjojr9 · 2022-02-03T13:33:24Z

@AyushExel Hi !

I have been doing different tests and I don't think it's working properly yet. Maybe, I could be wrong.
Below, I show you the evidence:

These are my parameters:

I interrupted the training and then resumed it.

At first I thought it works well, as it starts at the right epoch. But there are some things that don't seem logical to me:

Marked in pink: the parameters displayed are not correct.
Marked in red: the size of the images is not correct.
Marked in blue: The gpu_mem and the run time are totally different.
Marked in green: a new directory is created. If you look at the opt.yaml files, they don't match at all.

Therefore, the parameters still do not match.

I don't know if I'm wrong.

AyushExel · 2022-02-03T13:40:44Z

@Juanjojr9 You might be right about the image size as we're not restoring that.. But can you please check the batch size from your wandb run config? it should be in the overview tab of your wandb run.. The batch size should match before and after resume

Juanjojr9 · 2022-02-03T17:09:07Z

@AyushExel I'm trying to check the batch size after restarting, but I don't know where to see it.
Can you tell me the steps to see it?

AyushExel · 2022-02-03T17:33:50Z

@Juanjojr9 You're on the right screen. Just scroll down a bit further

Juanjojr9 · 2022-02-03T18:14:01Z

@AyushExel But how do I know if that is the setting before cancelling or after resume?

You can see the following:

But, you can see this, among other things, that the --imgsz parameter is set to 1280 and --resume is false. Therefore, I believe these configuration parameters are before interrupting the run.

They are not the parameters after resuming.

What is your opinion?

AyushExel · 2022-02-04T10:50:34Z

I think the batch size is being restored but not the img size..You can still pass those params with the resume flag..
The resume isn't perfect, like you pointed out with the batch size, probably we can do a better job at remembering other hyperparameters.

Juanjojr9 · 2022-02-04T11:29:13Z

@AyushExel In case you were right, in the images of the parameter settings, besides setting the batch size to 1, the imgz parameter should be 640 and not 1280, as the size of the images is not supposed to be saved. With other parameters it would be the same

From my humble point of view, I think it doesn't save the parameters properly. As you can see in the screenshots above, when it resumes it creates a new opt.yaml file and you can see what parameters are set and they have nothing to do with the ones you set before.

But in case the batch size is saved correctly, it would not be of much use as the other parameters have changed and your original model has not been resumed. Therefore the results would be false and erroneous.

AyushExel · 2022-02-04T11:42:28Z

@Juanjojr9 I think this requires a deeper look. If you look at the PR linked above, you'll see that the hyp dict, batch size and epochs are restored from the run. So if that's not showing up in the experiment, probably it's being overwritten somewhere. I'll take a deeper look again because the cause of the problem seems to be located somewhere else

Juanjojr9 · 2022-02-09T12:20:08Z

@AyushExel Ok, thank you very much for your help. I will be waiting for your answer, as it is a very interesting tool.

I also wanted to tell you that the wandb.login() command has been giving me problems in google colab for several days. I don't know if it's my fault. I will keep looking for information.

Thank you again

AyushExel · 2022-02-09T12:21:12Z

@Juanjojr9 thanks for reporting. I'll be working on to verify the resume issue this week again.
What is the issue with wandb.login?

Juanjojr9 · 2022-02-09T12:36:54Z

@AyushExel When I run the command, it is as if it is not able to start the session, as it takes too long to think and this causes the gogole colab session to crash and restart.
But even so, I'll keep testing just in case I'm wrong.

AyushExel · 2022-02-11T12:32:03Z

@Juanjojr9 Okay I investigated the resume further. The batch size was being remembered but image size wasn't.. I've made a PR to remember that.. Currently, here are the params that are remembered when resuming.
opt.weights, opt.save_period, opt.batch_size, opt.bbox_interval, opt.epochs, opt.hyp, opt.imgsz

If there are any more things that need to be remembered please let me know

glenn-jocher · 2022-02-12T12:03:23Z

@Juanjojr9 good news 😃! Your original issue may now be fixed ✅ in PR #6611 by @AyushExel. To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

Juanjojr9 · 2022-02-12T17:43:47Z

@AyushExel Thank you very much for the help. I'll try it out and see how it works.

Best regards.

JJ

Juanjojr9 added the question Further information is requested label Jan 27, 2022

AyushExel mentioned this issue Feb 2, 2022

W&B: Remember batch_size on resuming #6512

Merged

Juanjojr9 closed this as completed Feb 3, 2022

Juanjojr9 reopened this Feb 3, 2022

AyushExel mentioned this issue Feb 11, 2022

W&B: Improve resume stability #6611

Merged

glenn-jocher linked a pull request Feb 12, 2022 that will close this issue

W&B: Improve resume stability #6611

Merged

glenn-jocher closed this as completed in #6611 Feb 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when saving period and parameters #6452

Error when saving period and parameters #6452

Juanjojr9 commented Jan 27, 2022

AyushExel commented Jan 27, 2022

Juanjojr9 commented Jan 27, 2022

AyushExel commented Jan 27, 2022

Juanjojr9 commented Jan 27, 2022

AyushExel commented Jan 28, 2022

Juanjojr9 commented Jan 29, 2022

AyushExel commented Jan 29, 2022

Juanjojr9 commented Jan 29, 2022

AyushExel commented Feb 2, 2022

Juanjojr9 commented Feb 3, 2022

glenn-jocher commented Feb 3, 2022

AyushExel commented Feb 3, 2022

Juanjojr9 commented Feb 3, 2022

AyushExel commented Feb 3, 2022

Juanjojr9 commented Feb 3, 2022

AyushExel commented Feb 3, 2022

Juanjojr9 commented Feb 3, 2022

AyushExel commented Feb 4, 2022

Juanjojr9 commented Feb 4, 2022

AyushExel commented Feb 4, 2022

Juanjojr9 commented Feb 9, 2022

AyushExel commented Feb 9, 2022

Juanjojr9 commented Feb 9, 2022

AyushExel commented Feb 11, 2022

glenn-jocher commented Feb 12, 2022

Juanjojr9 commented Feb 12, 2022

Error when saving period and parameters #6452

Error when saving period and parameters #6452

Comments

Juanjojr9 commented Jan 27, 2022

Search before asking

Question

Additional

AyushExel commented Jan 27, 2022

Juanjojr9 commented Jan 27, 2022

AyushExel commented Jan 27, 2022

Juanjojr9 commented Jan 27, 2022

AyushExel commented Jan 28, 2022

Juanjojr9 commented Jan 29, 2022

AyushExel commented Jan 29, 2022

Juanjojr9 commented Jan 29, 2022

AyushExel commented Feb 2, 2022

Juanjojr9 commented Feb 3, 2022

glenn-jocher commented Feb 3, 2022

AyushExel commented Feb 3, 2022

Juanjojr9 commented Feb 3, 2022

AyushExel commented Feb 3, 2022

Juanjojr9 commented Feb 3, 2022

AyushExel commented Feb 3, 2022

Juanjojr9 commented Feb 3, 2022

AyushExel commented Feb 4, 2022

Juanjojr9 commented Feb 4, 2022

AyushExel commented Feb 4, 2022

Juanjojr9 commented Feb 9, 2022

AyushExel commented Feb 9, 2022

Juanjojr9 commented Feb 9, 2022

AyushExel commented Feb 11, 2022

glenn-jocher commented Feb 12, 2022

Juanjojr9 commented Feb 12, 2022