Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debugging custom models #107

Open
dribnet opened this issue Sep 16, 2021 · 5 comments
Open

debugging custom models #107

dribnet opened this issue Sep 16, 2021 · 5 comments
Assignees

Comments

@dribnet
Copy link

dribnet commented Sep 16, 2021

TL;DR: custom training is great! is there a good config or way to debug quality of result on small-ish datasets?


I've managed to train my own custom models using the excellent additions provided by @rom1504 in #54 and have hooked this up to clip + vqgan back propagation successfully. However so far the samples from my models are a bit glitchy. For example, with a custom dataset of images such as the following:

example

I'm only able to get a sample that looks something like this:

painting_16_06

Or similarly when I train on a dataset of sketches and images like these:

Sketch (40)

My clip + vqgan back propagation of "spider" with that model turns out like this:

sunset_ink1_15_01

So there is evidence that the model is picking up some gross information such as color distributions, but the results are far from what I would expect using a simpler model such as SyleGan on the same dataset.

So my questions:

  • Is there an easy change to instead more lightly fine tune an existing model on my dataset? This would probably be sufficient for my purposes and perhaps better in a low data regime (eg: 200-2000 image training set) and hopefully more robust to collapsing, etc.
  • Is there a recommended strategy to monitor / diagnose / fix the training regimen? The reconstructions during training in the logs directory look fine. Other issues such as Very confused by the discriminator loss #93 seem to hint that discriminator loss the main metric but aren't clear on how to course correct, etc.
@mrapplexz
Copy link

Is there an easy change to instead more lightly fine tune an existing model on my dataset?

I've managed to fine-tune an existing model with these steps:

  1. Download existing weights and config (e. g. https://heibox.uni-heidelberg.de/d/8088892a516d4e3baf92/)
  2. Create directories <taming-transformers repo root>/logs/<some name>/configs and <taming-transformers repo root>/logs/<some name>/checkpoints
  3. Put downloaded last.ckpt file into newly created checkpoints directory
  4. Rename downloaded model.yaml file into <some name>-project.yaml and put it into configs directory
  5. Add these lines to the end of <some name>-project.yaml file. Don't forget to adapt some values like you did when training a model from scratch
data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 5
    num_workers: 8
    train:
      target: taming.data.custom.CustomTrain
      params:
        training_images_list_file: some/training.txt
        size: 256
    validation:
      target: taming.data.custom.CustomTest
      params:
        test_images_list_file: some/test.txt
        size: 256
  1. Run python -m pytorch_lightning.utilities.upgrade_checkpoint --file logs/<some name>/checkpoints/last.ckpt
  2. Run python main.py -t True --gpus <gpus> --resume logs/<some name> and the training proccess should be started :)

@dribnet
Copy link
Author

dribnet commented Sep 27, 2021

Thanks heaps @mrapplexz - this is indeed working well for me. So far I'm surprised how powerful even 100 iterations of fine tuning is (I'll probably tweak the learning rate down, etc.) but this recipe was hugely helpful getting me unblocked!

@rromb rromb self-assigned this Sep 30, 2021
@Awj2021
Copy link

Awj2021 commented Apr 19, 2024

@mrapplexz @dribnet hi, Thank you for your amazing ideas, but there are some points confused me. When resuming the model, how to set the training steps? e.g., , I have 1M images.

@Awj2021
Copy link

Awj2021 commented Apr 19, 2024

And I have another question as showed issues/93, If use different dataset (e.g., medical Image dataset) to finetune the method, the parameter disc_start = 0 showed in https://heibox.uni-heidelberg.de/d/8088892a516d4e3baf92/ maybe not a good choice. But I am stilling training the model, so it's just a consumption.

@matthew-wave
Copy link

Is there an easy change to instead more lightly fine tune an existing model on my dataset?

I've managed to fine-tune an existing model with these steps:

  1. Download existing weights and config (e. g. https://heibox.uni-heidelberg.de/d/8088892a516d4e3baf92/)
  2. Create directories <taming-transformers repo root>/logs/<some name>/configs and <taming-transformers repo root>/logs/<some name>/checkpoints
  3. Put downloaded last.ckpt file into newly created checkpoints directory
  4. Rename downloaded model.yaml file into <some name>-project.yaml and put it into configs directory
  5. Add these lines to the end of <some name>-project.yaml file. Don't forget to adapt some values like you did when training a model from scratch
data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 5
    num_workers: 8
    train:
      target: taming.data.custom.CustomTrain
      params:
        training_images_list_file: some/training.txt
        size: 256
    validation:
      target: taming.data.custom.CustomTest
      params:
        test_images_list_file: some/test.txt
        size: 256
  1. Run python -m pytorch_lightning.utilities.upgrade_checkpoint --file logs/<some name>/checkpoints/last.ckpt
  2. Run python main.py -t True --gpus <gpus> --resume logs/<some name> and the training proccess should be started :)

Hello, thank you very much for your answer. It has been very helpful to me. I used Python - m pytorch lighting. utilities. upgrade_checkpoint -- file logs/must_finish/vq_f8_16384/checkpoints/last.ckpt
After this command, CUDA error: out of memory is displayed, which confuses me.
I am using the. ckpt file you linked to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants