Continue training on the best.pt after running all EPochs successfully #7343

dhruvildave22 · 2022-04-08T07:38:48Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hello!
Can anybody resolve my confusion about continuing training on the best.pt after running all EPochs successfully?

I have trained my yolov5s.pt models with coco + custom dataset for 65 EPOCHS and it create a best.pt models. I want to run more EPochs on this to get better precision and recall, so i started training best.pt which already has 65 Epochs. Earlier i thought that If i start training on best.pt, it should run from 66th EPoch but after seeing results of precision and recall value, I assume that it starts fresh again.
I tried --resume but it can be only used if the training gets interrupted.
Is there any way where i can start the training from where i finished with best.pt?

Thanks
Dhruvil Dave

Additional

No response

github-actions · 2022-04-08T07:39:33Z

👋 Hello @dhruvildave22, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2022-04-08T08:17:44Z

@dhruvildave22 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck 🍀 and let us know if you have any other questions!

dhruvildave22 · 2022-04-08T08:26:51Z

@glenn-jocher So, If i train a model with 70 epochs and after successful run it creates a best.pt but the precision is not very good so i have to train more on best.pt.
My question here is
If i have to run 70 more epochs, will the best.pt contain all the training of 70 epochs which were done earlier and when i restart training on it and have to add --epochs 70 --weights best.pt or i have to do --epochs 140

Thanks
Dhruvil Dave

VYRION-Ai · 2022-04-08T08:32:18Z

@glenn-jocher So, If i train a model with 70 epochs and after successful run it creates a best.pt but the precision is not very good so i have to train more on best.pt. My question here is If i have to run 70 more epochs, will the best.pt contain all the training of 70 epochs which were done earlier and when i restart training on it and have to add --epochs 70 --weights best.pt or i have to do --epochs 140

Thanks Dhruvil Dave

`--epochs 70 --weights best.pt

javiercalle97 · 2022-04-08T08:52:17Z

And is it possible to restart a training that has stopped due to patience? I set up the patience in 20 but looking at the graph map50:95. I think it is already improving but needs more than 20 epochs.

I have tried to resume the learning by increasing the patience value, but this error appears:

assert start_epoch > 0, f'{weights} training to {epochs} epochs is finished, nothing to resume.'
AssertionError: runs/train/exp6/weights/last.pt training to 300 epochs is finished, nothing to resume.

but the last epoch was 160.

glenn-jocher · 2022-04-08T10:58:42Z

@javiercalle97 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck 🍀 and let us know if you have any other questions!

dhruvildave22 · 2022-04-12T10:07:15Z

Thanks @glenn-jocher,
Closing the issue 🙂

619732457 · 2022-05-08T02:40:18Z

Hello, I want to continue adding 500 more epochs on the data that has already completed 500epoch training, but not restart a new training. I read the above information carefully, but still don't quite understand how to do it.
For example, I now have epochs 0-499 and I want to get epochs of 500-999.

glenn-jocher · 2022-05-08T11:49:01Z

@619732457 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck 🍀 and let us know if you have any other questions!

Transigent · 2023-02-05T04:05:18Z

This is confusing because there is more than one answer on StackOverflow like this one https://stackoverflow.com/a/72922281/521342
that says that by changing the number of epochs in the opt.yaml file and using the 'resume' syntax, the run can be resumed to do more epochs.

@glenn-jocher indicates in the response above that the LR schedulers are designed to taper off to minimum at the epoch count specified in the original training command.

It is still not clear what the impact is on the scheduler values if the opt.yaml file is modified then resumed.

arielkantorovich · 2023-05-08T11:44:56Z

I don't get a concrete answer, if I continue to train for example another 200 epochs from best.pt weights I see again mAP is very low, there is no option to continue training more epochs? although the program finish I can use --resume only if happen to interrupt.

glenn-jocher · 2023-05-08T13:10:45Z

@arielkantorovich hello! I understand the confusion regarding resuming training for additional epochs in YOLOv5.
I apologize for any ambiguity in my previous response.

The number of epochs in the opt.yaml file is only used during the initial training process when python train.py is first executed. Once the training process finishes and you have the best.pt or last.pt weights file, you cannot simply resume training with additional epochs by modifying the opt.yaml file. The LR schedulers in YOLOv5 are designed to taper off to a minimum value at the epoch count specified in the original training command, and it is not recommended to modify this value.

You can resume training from the most recently saved checkpoint file using --resume path/to/last.pt. However, this only resumes training from where it left off and does not add any additional epochs. If you want to add more epochs to an already-trained model, you can start a new training run from best.pt by using the --weights path/to/best.pt argument.

I hope this clears up any confusion. Let us know if you have any further questions or concerns!

arielkantorovich · 2023-05-08T13:26:40Z

Thank you for your reply, When I train from weights/last.pt I see that my mAP is very low and does not really continue to increase from the last results why is it happening?
I was expected to continue around the same results I will be more concrete, I train 200 epochs I got last.pt around 0.251 mAP now I train again with the flags --weights last.pt my train starts with mAP 0.0447

glenn-jocher · 2023-05-08T14:46:55Z

@arielkantorovich hello!

Thank you for reaching out. It is not uncommon to experience a decrease in mAP when resuming training from the last.pt checkpoint file. This is because the LR schedulers are designed to gradually decrease the learning rate over the course of the epochs during the training process. When you resume training from last.pt, the LR scheduler has already tapered off the learning rate to minimal levels, and as a result, the network might struggle to learn effectively from the additional training epochs and your mAP may drop.

You could try reducing the learning rate --lr, or increasing the number of warmup steps --warmup-epochs during the resumed training process. Another option is to start a new training run from the best.pt checkpoint file instead of resuming from last.pt by using the --weights path/to/best.pt argument, which will initialize the training process with the best weights achieved during the previous training sessions.

I hope this helps. Let us know if you have any other questions or concerns.

aravindchakravarti · 2023-08-04T17:07:01Z

Hi @glenn-jocher,

Sorry, I am asking same question as above. I read all the posts above. I too have same problem and unable to solve.

I am training yolov5 in Google Colab. Since, the time in Google colab is limited, and since I don't know how many epochs I may train the network to achieve desired performance; I want to train step by step.

I started training yolov5 (nano) model from scratch. I trained for 100 epochs, saved the results + weights in google drive,
!python train.py --img 640 --epochs 100 --data data.yaml --weights ' ' --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25
Then used the best.pt in above step to train network again for another 100 epochs.
!python train.py --img 640 --epochs 100 --data data.yaml --weights path/to/best.pt --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25
Repeat the step 2 until get desired mAP.

Unfortunately, whenever I want to continue training (i.e., using previous best.pt) I am seeing that, yolov5 is training is starting from fresh. Can you please help?!

glenn-jocher · 2023-11-14T17:42:54Z

@aravindchakravarti 👋

Thank you for detailing your steps. It seems like the issue you're facing might be related to resuming training with the best.pt weights. When you resume training using --weights path/to/best.pt, YOLOv5 should continue training from the previously saved checkpoint.

However, it's essential to ensure that the best.pt file is being saved correctly. You could verify this by inspecting the file size and modified timestamps of the best.pt file to confirm that it is being updated at each checkpoint.

Additionally, when you start a new training run using best.pt, it starts from the most recent best.pt weights, not from the beginning. If you find that your mAP is not improving when resuming from the best.pt, you could consider adjusting the learning rate --lr and other hyperparameters to fine-tune the training process.

I hope this information helps. Please feel free to reach out if you have any further questions or issues!

Varun3713-creator · 2024-06-28T13:12:26Z

@glenn-jocher hello
i have trained yolov9 model on my dataset by using
!python train.py
--batch 16 --epochs 50 --img 640 --device 0 --min-items 0 --close-mosaic 15
--data /content/drive/MyDrive/mango_readyallclasses/data.yaml
--weights {HOME}/weights/gelan-c.pt
--cfg models/detect/gelan-c.yaml
--hyp hyp.scratch-high.yaml

Now i want to add more epochs to get more mAP,can you share the code to resume the training from here for yolov9.please reply as soon as possible

glenn-jocher · 2024-06-28T14:57:01Z

Hello @Varun3713-creator,

Thank you for reaching out! It looks like you're interested in resuming your training to add more epochs and improve your mAP. To do this, you can use the --resume flag with the path to your last checkpoint. Here’s how you can do it:

Ensure you have the latest version: First, make sure you are using the latest versions of torch and the YOLOv5 repository. You can update your repository with the following commands:
```
git pull
pip install -r requirements.txt
```
Resume Training: To resume training from your last checkpoint, you can use the --resume flag. This will continue training from where it left off, using the most recent checkpoint saved in your training directory. Here’s an example command:
```
!python train.py --resume /content/drive/MyDrive/mango_readyallclasses/weights/last.pt
```

If you encounter any issues or if the training does not resume as expected, please provide a minimum reproducible code example so we can investigate further. You can find more information on how to create a minimum reproducible example here. This will help us better understand the problem and provide a more accurate solution.

Feel free to reach out if you have any other questions or need further assistance. Happy training! 🚀

Varun3713-creator · 2024-06-28T15:10:05Z

Hello , Thank you for replying,I understood that ,but where can specify the epochs to resume the model there is nothing related epochs in the below code !python train.py --resume /content/drive/MyDrive/mango_readyallclasses/weights/last.pt That's what confusing.can you please help me with this.

…

On Fri, 28 Jun 2024, 8:27 pm Glenn Jocher, ***@***.***> wrote: Hello @Varun3713-creator <https://github.com/Varun3713-creator>, Thank you for reaching out! It looks like you're interested in resuming your training to add more epochs and improve your mAP. To do this, you can use the --resume flag with the path to your last checkpoint. Here’s how you can do it: 1. *Ensure you have the latest version*: First, make sure you are using the latest versions of torch and the YOLOv5 repository. You can update your repository with the following commands: git pull pip install -r requirements.txt 2. *Resume Training*: To resume training from your last checkpoint, you can use the --resume flag. This will continue training from where it left off, using the most recent checkpoint saved in your training directory. Here’s an example command: !python train.py --resume /content/drive/MyDrive/mango_readyallclasses/weights/last.pt If you encounter any issues or if the training does not resume as expected, please provide a minimum reproducible code example so we can investigate further. You can find more information on how to create a minimum reproducible example here <https://docs.ultralytics.com/help/minimum_reproducible_example>. This will help us better understand the problem and provide a more accurate solution. Feel free to reach out if you have any other questions or need further assistance. Happy training! 🚀 — Reply to this email directly, view it on GitHub <#7343 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BJDHXS3M3BADKXDOGIWAJNTZJV2VLAVCNFSM6AAAAABKB4LOT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXGEZDMNJQGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

glenn-jocher · 2024-06-28T16:35:27Z

Hello @Varun3713-creator,

Thank you for your follow-up question! I understand the confusion regarding specifying the number of epochs when resuming training.

When you use the --resume flag, YOLOv5 automatically continues training from the last saved checkpoint, including the number of epochs left to train. You do not need to specify the number of epochs again. The training will resume from where it left off, continuing until it reaches the total number of epochs specified in the original training command.

For example, if you initially set --epochs 50 and the training was interrupted at epoch 30, using --resume will continue training from epoch 30 to epoch 50.

If you want to extend the total number of epochs beyond the original setting, you can modify the opt.yaml file or specify the new total number of epochs directly in the command line. Here’s how you can do it:

Modify the Total Epochs: If you initially trained for 50 epochs and now want to train for an additional 50 epochs (total 100 epochs), you can specify the new total number of epochs when resuming:
```
!python train.py --resume /content/drive/MyDrive/mango_readyallclasses/weights/last.pt --epochs 100
```

This command will continue training from the last checkpoint and extend the total number of epochs to 100.

Please ensure you are using the latest versions of torch and the YOLOv5 repository to avoid any compatibility issues. If you encounter any further issues, providing a minimum reproducible code example will help us investigate and provide a more accurate solution. You can find more information on creating a minimum reproducible example here.

Feel free to reach out if you have any other questions or need further assistance. Happy training! 🚀

Varun3713-creator · 2024-07-01T11:35:30Z

Hai @Glenn Jocher, I tried to resume the model using !python train.py --resume /content/drive/MyDrive/mango_readyallclasses/weights/last.pt --epochs 100 Like you said but i am facing "local variable 'epoch' referenced before assignment error. how can i deal with it

…

On Fri, 28 Jun 2024, 10:05 pm Glenn Jocher, ***@***.***> wrote: Hello @Varun3713-creator <https://github.com/Varun3713-creator>, Thank you for your follow-up question! I understand the confusion regarding specifying the number of epochs when resuming training. When you use the --resume flag, YOLOv5 automatically continues training from the last saved checkpoint, including the number of epochs left to train. You do not need to specify the number of epochs again. The training will resume from where it left off, continuing until it reaches the total number of epochs specified in the original training command. For example, if you initially set --epochs 50 and the training was interrupted at epoch 30, using --resume will continue training from epoch 30 to epoch 50. If you want to extend the total number of epochs beyond the original setting, you can modify the opt.yaml file or specify the new total number of epochs directly in the command line. Here’s how you can do it: 1. *Modify the Total Epochs*: If you initially trained for 50 epochs and now want to train for an additional 50 epochs (total 100 epochs), you can specify the new total number of epochs when resuming: !python train.py --resume /content/drive/MyDrive/mango_readyallclasses/weights/last.pt --epochs 100 This command will continue training from the last checkpoint and extend the total number of epochs to 100. Please ensure you are using the latest versions of torch and the YOLOv5 repository to avoid any compatibility issues. If you encounter any further issues, providing a minimum reproducible code example will help us investigate and provide a more accurate solution. You can find more information on creating a minimum reproducible example here <https://docs.ultralytics.com/help/minimum_reproducible_example>. Feel free to reach out if you have any other questions or need further assistance. Happy training! 🚀 — Reply to this email directly, view it on GitHub <#7343 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BJDHXS566VLVMILQFQA4OUTZJWGGLAVCNFSM6AAAAABKB4LOT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXGI4DGOJXHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Varun3713-creator · 2024-07-02T00:32:13Z

Hai @Glenn Jocher, I tried to resume the model using !python train.py --resume /content/drive/MyDrive/mango_readyallclasses/weights/last.pt --epochs 100 Like you said but i am facing "local variable 'epoch' referenced before assignment error. how can i deal with it Show

…

On Fri, 28 Jun 2024, 10:05 pm Glenn Jocher, ***@***.***> wrote: Hello @Varun3713-creator <https://github.com/Varun3713-creator>, Thank you for your follow-up question! I understand the confusion regarding specifying the number of epochs when resuming training. When you use the --resume flag, YOLOv5 automatically continues training from the last saved checkpoint, including the number of epochs left to train. You do not need to specify the number of epochs again. The training will resume from where it left off, continuing until it reaches the total number of epochs specified in the original training command. For example, if you initially set --epochs 50 and the training was interrupted at epoch 30, using --resume will continue training from epoch 30 to epoch 50. If you want to extend the total number of epochs beyond the original setting, you can modify the opt.yaml file or specify the new total number of epochs directly in the command line. Here’s how you can do it: 1. *Modify the Total Epochs*: If you initially trained for 50 epochs and now want to train for an additional 50 epochs (total 100 epochs), you can specify the new total number of epochs when resuming: !python train.py --resume /content/drive/MyDrive/mango_readyallclasses/weights/last.pt --epochs 100 This command will continue training from the last checkpoint and extend the total number of epochs to 100. Please ensure you are using the latest versions of torch and the YOLOv5 repository to avoid any compatibility issues. If you encounter any further issues, providing a minimum reproducible code example will help us investigate and provide a more accurate solution. You can find more information on creating a minimum reproducible example here <https://docs.ultralytics.com/help/minimum_reproducible_example>. Feel free to reach out if you have any other questions or need further assistance. Happy training! 🚀 — Reply to this email directly, view it on GitHub <#7343 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BJDHXS566VLVMILQFQA4OUTZJWGGLAVCNFSM6AAAAABKB4LOT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXGI4DGOJXHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Glenn · 2024-07-02T07:56:06Z

Can you guys please stop tagging me I'm not the right Glenn - and I keep having to unsubscribe to your threads!!

glenn-jocher · 2024-07-02T12:37:16Z

Hello @Glenn,

Thank you for bringing this to our attention. We apologize for the inconvenience caused by the incorrect tagging. We'll ensure to be more careful in the future to avoid tagging the wrong person.

To the community, please note that the correct Glenn Jocher is not involved in this thread. Let's focus on addressing the issue at hand.

@Varun3713-creator,

Thank you for your patience. The error you're encountering, "local variable 'epoch' referenced before assignment," suggests there might be an issue with the way the resume functionality is being handled in your specific setup.

To help us investigate further, could you please provide a minimum reproducible code example? This will allow us to better understand the context and pinpoint the issue. You can find guidance on creating a minimum reproducible example here.

In the meantime, please ensure you are using the latest versions of torch and the YOLOv5 repository. You can update your repository with the following commands:

git pull
pip install -r requirements.txt

If the issue persists, you might want to try the following workaround to manually specify the number of epochs when resuming training:

Manually Adjust Epochs: Instead of using the --resume flag, you can manually load the weights and specify the total number of epochs. Here’s an example command:

!python train.py --weights /content/drive/MyDrive/mango_readyallclasses/weights/last.pt --epochs 100 --data /content/drive/MyDrive/mango_readyallclasses/data.yaml --cfg models/detect/gelan-c.yaml --hyp hyp.scratch-high.yaml --batch 16 --img 640 --device 0 --min-items 0 --close-mosaic 15

This command will start a new training session using the weights from last.pt and continue training for a total of 100 epochs.

Please let us know if this resolves your issue or if you need further assistance. We're here to help! 😊

dhruvildave22 added the question Further information is requested label Apr 8, 2022

dhruvildave22 closed this as completed Apr 12, 2022

aravindchakravarti mentioned this issue Aug 11, 2023

Unable to Train from 'best.pt' weights #11974

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue training on the best.pt after running all EPochs successfully #7343

Continue training on the best.pt after running all EPochs successfully #7343

dhruvildave22 commented Apr 8, 2022

github-actions bot commented Apr 8, 2022 •

edited by glenn-jocher

Loading

glenn-jocher commented Apr 8, 2022 •

edited

Loading

dhruvildave22 commented Apr 8, 2022

VYRION-Ai commented Apr 8, 2022

javiercalle97 commented Apr 8, 2022

glenn-jocher commented Apr 8, 2022 •

edited

Loading

dhruvildave22 commented Apr 12, 2022

619732457 commented May 8, 2022

glenn-jocher commented May 8, 2022 •

edited

Loading

Transigent commented Feb 5, 2023

arielkantorovich commented May 8, 2023

glenn-jocher commented May 8, 2023

arielkantorovich commented May 8, 2023

glenn-jocher commented May 8, 2023

aravindchakravarti commented Aug 4, 2023

glenn-jocher commented Nov 14, 2023

Varun3713-creator commented Jun 28, 2024

glenn-jocher commented Jun 28, 2024

Varun3713-creator commented Jun 28, 2024 via email

glenn-jocher commented Jun 28, 2024

Varun3713-creator commented Jul 1, 2024 via email

Varun3713-creator commented Jul 2, 2024 via email

Glenn commented Jul 2, 2024 •

edited

Loading

glenn-jocher commented Jul 2, 2024

Continue training on the best.pt after running all EPochs successfully #7343

Continue training on the best.pt after running all EPochs successfully #7343

Comments

dhruvildave22 commented Apr 8, 2022

Search before asking

Question

Additional

github-actions bot commented Apr 8, 2022 • edited by glenn-jocher Loading

Requirements

Environments

Status

glenn-jocher commented Apr 8, 2022 • edited Loading

Resume Single-GPU

Resume Multi-GPU

Start from Pretrained

dhruvildave22 commented Apr 8, 2022

VYRION-Ai commented Apr 8, 2022

javiercalle97 commented Apr 8, 2022

glenn-jocher commented Apr 8, 2022 • edited Loading

Resume Single-GPU

Resume Multi-GPU

Start from Pretrained

dhruvildave22 commented Apr 12, 2022

619732457 commented May 8, 2022

glenn-jocher commented May 8, 2022 • edited Loading

Resume Single-GPU

Resume Multi-GPU

Start from Pretrained

Transigent commented Feb 5, 2023

arielkantorovich commented May 8, 2023

glenn-jocher commented May 8, 2023

arielkantorovich commented May 8, 2023

glenn-jocher commented May 8, 2023

aravindchakravarti commented Aug 4, 2023

glenn-jocher commented Nov 14, 2023

Varun3713-creator commented Jun 28, 2024

glenn-jocher commented Jun 28, 2024

Varun3713-creator commented Jun 28, 2024 via email

glenn-jocher commented Jun 28, 2024

Varun3713-creator commented Jul 1, 2024 via email

Varun3713-creator commented Jul 2, 2024 via email

Glenn commented Jul 2, 2024 • edited Loading

glenn-jocher commented Jul 2, 2024

github-actions bot commented Apr 8, 2022 •

edited by glenn-jocher

Loading

glenn-jocher commented Apr 8, 2022 •

edited

Loading

glenn-jocher commented Apr 8, 2022 •

edited

Loading

glenn-jocher commented May 8, 2022 •

edited

Loading

Glenn commented Jul 2, 2024 •

edited

Loading