Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when saving period and parameters #6452

Closed
1 task done
Juanjojr9 opened this issue Jan 27, 2022 · 26 comments · Fixed by #6512 or #6611
Closed
1 task done

Error when saving period and parameters #6452

Juanjojr9 opened this issue Jan 27, 2022 · 26 comments · Fixed by #6512 or #6611
Labels
question Further information is requested

Comments

@Juanjojr9
Copy link

Search before asking

Question

Hi, I have searched but have not found a similar question.

I was training a long model, and just in case there was a problem, I set the -save_period parameter, , as you can see in the following picture:

image

I had a problem and wanted to resume blocked execution, since my google colab session expired and I lost all my session data. , but it gave me another problem.

image

When I compared it with this tutorial I found: https://colab.research.google.com/github/wandb/examples/blob/master/colabs/yolo/Train_and_Debug_YOLOv5_Models_with_Weights_%26_Biases.ipynb#scrollTo=jwcBfF5OvAHk

In his training it saves the data every epoch and in mine it should do it every 5 epochs, but it doesn't do anything:

image

image

It's a real bummer, as the model had been training for several hours.

Does anyone know how to fix this or how can I save the model and the parameters in case I get an error again? Any help is good.

I also wanted to ask if it is possible to download the data locally into a folder instead of using the Weights and Biases website.

Thank you very much for your help.

JJ

Additional

No response

@Juanjojr9 Juanjojr9 added the question Further information is requested label Jan 27, 2022
@AyushExel
Copy link
Contributor

@Juanjojr9 Hey, the command seems correct. can you check the run that you want to resume and see if the artifact is present there? You can even share the run link here and I'll look into this

@Juanjojr9
Copy link
Author

Hello! First of all, thank you very much for helping me.

If you can tell me how I can verify it, I will try to do it or which is the link you are referring to, I will pass it on without any problem.

image

@AyushExel
Copy link
Contributor

@Juanjojr9 in your command, you have -save-period set to 10, which means the model will only be saved after 10 epochs. If you click on artifacts or see the charts you'll notice that the experiment didn't run 10 epochs that's why the model is not logged. I suggest reducing the save period

@Juanjojr9
Copy link
Author

Hi @AyushExel , thank you for your answer

I think I got the wrong example earlier. Anyway, I did a quick test and now it does save, as you can see in the image below:

image

I cancelled the run to simulate a problem. At first, I thought it was going to work, but it gave me a problem.

image

Analysing it, I realised that it changes the parameters I had set before:

image

My question is: when I run !python train.py --resume wandb-artifact://{crashed_run_path} ,I also have to set the above parameters?

Is there another way to save the data locally without losing the information in case the google colab session expires?

Thank you

@AyushExel
Copy link
Contributor

@Juanjojr9 your syntax is correct. The problem that you're seeing in the pic above is that you're running out of CUDA memory. Read the last line of the error message.
The same syntax should work on colab or your own system if there is sufficient CUDA memory

@Juanjojr9
Copy link
Author

@AyushExel I have not explained myself well

El problema por el que se está quedando sin memoria CUDA is that the parameters are being changed.
The batch size, I have it set to 1. But when I run --resume wandb-artifact://{{crashed_run_path}, it changes the batch size and is set to 16.
You can check it by looking at the pictures. Otherwise, I would not have been able to run it the first time.
And besides the batch size, other parameters such as the epochs, imgzs, etc. change as well.

My question is whether I should run the command by setting the parameters as well, i.e. as follows:
image

If not, I don't understand why it changes the parameters I set when I train the network, when trying to resume the network run

image

@AyushExel
Copy link
Contributor

@Juanjojr9 okay understood. This should not happen. I'll fix this coming week. Thanks for reporting

@Juanjojr9
Copy link
Author

I don't know if it's my problem. I will try to keep doing different tests.

Thanks to you for creating this incredible and majestic tool.
I am delighted to be learning about this issue

Yours faithfully,
Juanjo

@AyushExel
Copy link
Contributor

@Juanjojr9 I've pushed a PR with the fix for this

@Juanjojr9
Copy link
Author

@AyushExel Thank you very much for the help and the quick solution. I look forward to using it when it is ready

Best regards.

JJ

@glenn-jocher
Copy link
Member

@Juanjojr9 good news 😃! Your original issue may now be fixed ✅ in PR #6452 by @AyushExel. To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@Juanjojr9 Juanjojr9 reopened this Feb 3, 2022
@AyushExel
Copy link
Contributor

@Juanjojr9 Your problem should be fixed in the latest release now the resume command will remember the batch size.. Please verify.. Thanks!

@Juanjojr9
Copy link
Author

@AyushExel Hi !

I have been doing different tests and I don't think it's working properly yet. Maybe, I could be wrong.
Below, I show you the evidence:

These are my parameters:

image

image

I interrupted the training and then resumed it.

image

At first I thought it works well, as it starts at the right epoch. But there are some things that don't seem logical to me:

errores

  • Marked in pink: the parameters displayed are not correct.
  • Marked in red: the size of the images is not correct.
  • Marked in blue: The gpu_mem and the run time are totally different.
  • Marked in green: a new directory is created. If you look at the opt.yaml files, they don't match at all.

image

image

Therefore, the parameters still do not match.

I don't know if I'm wrong.

@AyushExel
Copy link
Contributor

@Juanjojr9 You might be right about the image size as we're not restoring that.. But can you please check the batch size from your wandb run config? it should be in the overview tab of your wandb run.. The batch size should match before and after resume

@Juanjojr9
Copy link
Author

@AyushExel I'm trying to check the batch size after restarting, but I don't know where to see it.
Can you tell me the steps to see it?

image

@AyushExel
Copy link
Contributor

@Juanjojr9 You're on the right screen. Just scroll down a bit further

Screenshot 2022-02-03 at 11 02 39 PM

@Juanjojr9
Copy link
Author

@AyushExel But how do I know if that is the setting before cancelling or after resume?

You can see the following:

image

But, you can see this, among other things, that the --imgsz parameter is set to 1280 and --resume is false. Therefore, I believe these configuration parameters are before interrupting the run.

They are not the parameters after resuming.

image

image

What is your opinion?

@AyushExel
Copy link
Contributor

I think the batch size is being restored but not the img size..You can still pass those params with the resume flag..
The resume isn't perfect, like you pointed out with the batch size, probably we can do a better job at remembering other hyperparameters.

@Juanjojr9
Copy link
Author

@AyushExel In case you were right, in the images of the parameter settings, besides setting the batch size to 1, the imgz parameter should be 640 and not 1280, as the size of the images is not supposed to be saved. With other parameters it would be the same

From my humble point of view, I think it doesn't save the parameters properly. As you can see in the screenshots above, when it resumes it creates a new opt.yaml file and you can see what parameters are set and they have nothing to do with the ones you set before.

But in case the batch size is saved correctly, it would not be of much use as the other parameters have changed and your original model has not been resumed. Therefore the results would be false and erroneous.

@AyushExel
Copy link
Contributor

@Juanjojr9 I think this requires a deeper look. If you look at the PR linked above, you'll see that the hyp dict, batch size and epochs are restored from the run. So if that's not showing up in the experiment, probably it's being overwritten somewhere. I'll take a deeper look again because the cause of the problem seems to be located somewhere else

@Juanjojr9
Copy link
Author

@AyushExel Ok, thank you very much for your help. I will be waiting for your answer, as it is a very interesting tool.

I also wanted to tell you that the wandb.login() command has been giving me problems in google colab for several days. I don't know if it's my fault. I will keep looking for information.

Thank you again

@AyushExel
Copy link
Contributor

@Juanjojr9 thanks for reporting. I'll be working on to verify the resume issue this week again.
What is the issue with wandb.login?

@Juanjojr9
Copy link
Author

@AyushExel When I run the command, it is as if it is not able to start the session, as it takes too long to think and this causes the gogole colab session to crash and restart.
But even so, I'll keep testing just in case I'm wrong.

@AyushExel
Copy link
Contributor

@Juanjojr9 Okay I investigated the resume further. The batch size was being remembered but image size wasn't.. I've made a PR to remember that.. Currently, here are the params that are remembered when resuming.
opt.weights, opt.save_period, opt.batch_size, opt.bbox_interval, opt.epochs, opt.hyp, opt.imgsz

If there are any more things that need to be remembered please let me know

@glenn-jocher glenn-jocher linked a pull request Feb 12, 2022 that will close this issue
@glenn-jocher
Copy link
Member

@Juanjojr9 good news 😃! Your original issue may now be fixed ✅ in PR #6611 by @AyushExel. To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@Juanjojr9
Copy link
Author

@AyushExel Thank you very much for the help. I'll try it out and see how it works.

Best regards.

JJ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
3 participants