Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issue with model reconstruction of the upper half of the image & saves model checkpoint in s3 #193

Closed
wants to merge 7 commits into from

Conversation

srmsoumya
Copy link
Collaborator

@srmsoumya srmsoumya commented Mar 26, 2024

This PR resolves an issue with the model reconstructing just the bottom 50% of the image during validation and stores model checkpoints in the s3 store.

  • Adds a shuffle argument to ClayModule that is set to False by default
  • Logs model checkpoints to aws s3 bucket

Fixes #156 #138

SRM added 4 commits March 11, 2024 13:03
- Lr -> 1e-5 to 1e-5
- Data -> Size: 256 x 256, patchsize: 16
- Log checkpoints to s3
- Save model params along with optimizer & epoch state
Copy link
Member

@yellowcap yellowcap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except for some strange test errors 🐈

srmsoumya pushed a commit that referenced this pull request Apr 19, 2024
half of the image & saves model checkpoint in s3 (#193)

- Fix issue with not shuffling during validation run. Use shuffle=True while training & validation.
- Log to devseed-gaia account of wandb & save checkpoints on s3.
- Update params for v0.2 model run
    - Lr -> 1e-4 to 1e-5
    - Data -> Size: 256 x 256, patchsize: 16
    - Log checkpoints to s3
    - Save model params along with optimizer & epoch state
@srmsoumya
Copy link
Collaborator Author

@weiji14 I am getting an error with create a conda-lock.yml file with new dependency.

Encountered problems while solving:
  - package pytorch-2.1.0-cuda120py38h1932296_301 requires cuda-version >=12.0,<13, but none of the providers can be installed

For now, I have merged this branch with main, as we need to develop v1 on top of v0.2. We can fix the issues with conda-lock & do a v0.2 release next week.

@srmsoumya srmsoumya closed this Apr 19, 2024
weiji14 added a commit that referenced this pull request Apr 21, 2024
Remove the `--platform linux-64` flag since unified lockfile is for linux-64, osx-64 and osx-arm64 as of #164. Also re-locking the conda-lock.yml file after 2a9ef9d/#193.
@weiji14
Copy link
Contributor

weiji14 commented Apr 21, 2024

@weiji14 I am getting an error with create a conda-lock.yml file with new dependency.

Encountered problems while solving:
  - package pytorch-2.1.0-cuda120py38h1932296_301 requires cuda-version >=12.0,<13, but none of the providers can be installed

Hmm, did you run conda-lock lock --mamba --file environment.yml --with-cuda=12.0? I get the same error you got without the --with-cuda=12.0 flag. For reference, my conda-lock/mamba versions are:

$ conda-lock --version
conda-lock, version 2.5.6
$ mamba --version
mamba 1.5.8
conda 24.3.0

I'll patch this up at #225, and also update the docs slightly under the Note section in https://clay-foundation.github.io/model/installation.html#advanced about re-locking the conda-lock.yml file.

weiji14 added a commit that referenced this pull request Apr 23, 2024
Remove the `--platform linux-64` flag since unified lockfile is for linux-64, osx-64 and osx-arm64 as of #164. Also re-locking the conda-lock.yml file after 2a9ef9d/#193.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upper image does not train?
3 participants