Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sweeps not initializing properly with PyTorch Lightening #1059

Closed
braaannigan opened this issue May 22, 2020 · 9 comments
Closed

Sweeps not initializing properly with PyTorch Lightening #1059

braaannigan opened this issue May 22, 2020 · 9 comments

Comments

@braaannigan
Copy link

wandb --version && python --version && uname
wandb, version 0.8.36
Python 3.7.6
Linux

Description

I'm trying to initialize a sweep using the WandB Logger for PyTorch Lightening. I'm following the keras example in 'Intro to Hyperparameter Sweeps with W&B.ipynb'. I'm running it in jupyter on my own machine.

Basic problem: nothing gets loggeed to wandb when I run the sweep.
Notable feature: when I start the sweep it initialized a new hyperparameter config and starts a new run. But then it initializes another run. Nothing gets logged to either of them.

Individual runs are fine.

What I Did

sweep_config = {
    'method': 'random', #grid, random
    'metric': {
      'name': 'val_loss',
      'goal': 'minimize'   
    },
    'parameters': {
        'lr': {
            'min': 1e-4,
            'max': 1e-1
        },
    }
}
sweep_id = wandb.sweep(sweep_config, entity="user", project="project-name")

Then I specify the training function:

wandb_logger = WandbLogger()

def train():
    config_defaults = {
    'epochs': 5,
    'bs': 64,
    'lr': 1e-3,
    'seed': 42
}
    wandb.init(config=config_defaults)
    config = config_defaults
   hparams = Namespace(
        lr =  config['lr'],
        bs = config['bs']
        )
    wandb_logger.log_hyperparams(hparams)
    model = AutoEncoder(hparams)
    trainer = pl.Trainer(
        logger=wandb_logger,
        max_epochs=config['epochs'])
    trainer.fit(model)

I then call the sweep

wandb.agent(sweep_id, train)

and get the following output at the start:

INFO:wandb.wandb_agent:Running runs: []
INFO:wandb.wandb_agent:Agent received command: run
INFO:wandb.wandb_agent:Agent starting run with config:
	lr: 0.020506108917114917

wandb: Agent Starting Run: 59fu3sst with config:
	lr: 0.020506108917114917
wandb: Agent Started Run: 59fu3sst
Logging results to Weights & Biases (Documentation).
Project page: https://app.wandb.ai/user/proj-name
Sweep page: https://app.wandb.ai/user/proj-name/sweeps/20thclh6
Run page: https://app.wandb.ai/user/proj-name/runs/59fu3sst

INFO:wandb.run_manager:system metrics and metadata threads started
INFO:wandb.run_manager:checking resume status, waiting at most 10 seconds
INFO:wandb.run_manager:resuming run from id: UnVuOnYxOjU5ZnUzc3N0OmVmZi1kaW0tcmVkLXByb2plY3Q6bGJyYW5uaWdhbg==
INFO:wandb.run_manager:upserting run before process can begin, waiting at most 10 seconds
INFO:wandb.run_manager:saving pip packages
INFO:wandb.run_manager:initializing streaming files api
INFO:wandb.run_manager:unblocking file change observer, beginning sync with W&B servers

Logging results to Weights & Biases (Documentation).
Project page: https://app.wandb.ai/user/proj-name
Run page: https://app.wandb.ai/user/proj-name/runs/xv9xywx7

So it starts the run and gives the sweeps page, but then seems to initialise a new run.
There's no additional wandb code in the model, it's a standard PTL set-up.

Any suggestions?

@cvphelps
Copy link
Contributor

Hi there, could you try specifying the entity and project in wandb.init to match your sweep config:
wandb.init(project="your-project-name", entity="your-username")

Could you share a link to a sweep where you're seeing this issue.

@braaannigan
Copy link
Author

Hi @cvphelps

Apologies for the delay. That still hasn't worked. I've put together a minimal example here with a simple autoencoder.

from argparse import Namespace
import math

import numpy as np
import pandas as pd

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split,Dataset,SubsetRandomSampler,ConcatDataset
from torch.optim import Adam
from torch import Tensor
from torch.autograd import Variable
import pytorch_lightning as pl
from pytorch_lightning.loggers import WandbLogger
import wandb

class DatasetLoader(Dataset):
    def __init__(self):
        self.data = np.random.randn(100,768)

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, item):
        return self.data[item].astype(np.float32)
class AutoEncoder(pl.LightningModule):
    def __init__(self,hparams):
        super().__init__()
        self.hparams = hparams
        hiddenDims = [384,192]
        hiddenDims = hiddenDims[:self.hparams.hdims+1]
        modules = []
        inDim = 768
        for hDim in hiddenDims:
            modules.append(
                nn.Sequential(
                nn.Linear(inDim, hDim),
                nn.ReLU())
                    )
            inDim = hDim
        self.encoder = nn.Sequential(*modules)
        modules = []
        hiddenDims.reverse()
        hiddenDims.append(768)
        for hDim in hiddenDims[1:]:
            print(inDim,hDim)
            modules.append(
                nn.Sequential(
                nn.Linear(inDim, hDim),
                nn.ReLU())
                    )
            inDim = hDim
        self.decoder = nn.Sequential(*modules)

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)

        return encoded,decoded

    def prepare_data(self):
        data = DatasetLoader()        
        self.train_data = data[:]

    def train_valid_loaders(self,dataset, valid_fraction = 0.1, **kwargs):
        num_train = len(dataset)
        indices = list(range(num_train))
        split = int(math.floor(valid_fraction* num_train))
        np.random.seed(17)
        np.random.shuffle(indices)
        if 'num_workers' not in kwargs:
            kwargs['num_workers'] = 1

        train_idx, valid_idx = indices[split:], indices[:split]
        train_sampler = SubsetRandomSampler(train_idx)
        valid_sampler = SubsetRandomSampler(valid_idx)

        train_loader = DataLoader(dataset,sampler=train_sampler,pin_memory=True,
                                                   **kwargs)
        valid_loader = DataLoader(dataset,sampler=valid_sampler,pin_memory=True,
                                                   **kwargs)
        return train_loader, valid_loader
    def train_dataloader(self):
        train_loader,_ = self.train_valid_loaders(self.train_data, valid_fraction = 0.1,batch_size=self.hparams.bs)
        return train_loader

    def configure_optimizers(self):
        return Adam(self.parameters(), lr = self.hparams.lr,weight_decay = self.hparams.wd)
        return dataTrain

    def training_step(self, batch, batch_idx):
        x = batch
        encoded,decoded = self(x) 
        loss = F.mse_loss(decoded,x)
        logs = {'train_loss': loss}
        return {'loss': loss,'log':logs}

    def val_dataloader(self):
        _,valid_loader = self.train_valid_loaders(self.train_data,batch_size=640)
        return valid_loader

    def validation_step(self, batch, batch_idx):
        x = batch
        encoded,decoded = self(x)
        loss = F.mse_loss(decoded,x)
        logs = {'val_loss': loss}
        return {'val_loss': loss,'log':logs}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss,'step': self.current_epoch}
        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}

# random data
x = np.random.random_sample((3, 768))
x = torch.tensor(x).float()
hparams = Namespace(
    lr =  1e-3,
    wd = 1e-5,
    hdims = 6,
    bs = 64
    )
# Init sweep
sweep_config = {
    'method': 'random', #grid, random
    'metric': {
      'name': 'val_loss',
      'goal': 'minimize'   
    },
    'parameters': {
        'hdims': {
            'values':[1,2]
        },
    }
}
sweep_id = wandb.sweep(sweep_config, entity="user", project="proj-name")

def train():
    config_defaults = {
    'lr': 1e-3,
    'epochs':3,
    'wd':1e-5,
    'bs':64
}
    run = wandb.init(config=config_defaults,project='proj-name',entity="lbrannigan")
    config = run.config
    hparams = Namespace(
        lr =  config['lr'],
        bs = config['bs'],
        hdims = config['hdims'],
        wd = config['wd']
        )
    hdims = config['hdims']
    wandb_logger = WandbLogger(name=str(hdims)+'enc',project='proj-name',entity="user")    

    wandb_logger.log_hyperparams(hparams)
    model = AutoEncoder(hparams)
    trainer = pl.Trainer(
        logger=wandb_logger,
        max_epochs=3)
    trainer.fit(model)

wandb.agent(sweep_id, function=train)

In this case the model again intialises twice (i.e. two Run Page links are generated)
and then errors out with error message

RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/THC/THCGeneral.cpp:55

@borisdayma
Copy link
Contributor

Thanks @braaannigan. I changed a few details (see my notebook below) but I can confirm the issue. Sweeps currently work with pytorch-lightning in scripts (see this example) but not in jupyter environments.

@vanpelt I think this is due to the fact that we added reinit=True in pytorch-lightning due to a distributed computing issue.

Based on my understanding, this creates a new run in jupyter (and detach the sweep run).

Should we make a modification in pytorch-lightning to have reinit=True only if we are sure we are not in a notebook environment?

Here is a notebook to reproduce the issue based on code from @braaannigan .

@drozzy
Copy link

drozzy commented Jun 6, 2020

FYI this happens in jupyter even without sweeps - just when you're trying to use WandBLogger and wandb.init together.

I am retrieving my experiment like this (this avoids the call to wandb.init):
run = wandb_logger.experiment

Of course, the problem with this is I can't pass the parameters I'm using to be logged into wandb.init.

@borisdayma
Copy link
Contributor

Actually you can log hyper-parameters with this object through run.config.update(dict).

@vanpelt
Copy link
Contributor

vanpelt commented Jun 7, 2020

Is this happening from within Jupyter or when run via python?

@borisdayma
Copy link
Contributor

Hi,

These issues should now be solved.

Here are some examples for running sweeps with pytorch-lightning:

Let me know if you still run into any issue.

@ariG23498
Copy link
Contributor

Hey folks
Due to the availability of a solution and inactivity of the thread we are closing this ticket.
In the past year we've majorly reworked the CLI and UI for Weights & Biases. Please comment to reopen. 😄

@kotchin
Copy link

kotchin commented Oct 28, 2022

Hi,

These issues should now be solved.

Here are some examples for running sweeps with pytorch-lightning:

* with a script: [lightning-kitti](https://github.com/borisdayma/lightning-kitti)

* with a colab or jupyter notebook: see at the end of [this colab](https://colab.research.google.com/drive/16d1uctGaw2y9KhGBlINNTsWpmlXdJwRW?usp=sharing)

Let me know if you still run into any issue.

Thank you for your help. I have a current experiment setup using LightningCLI which I enjoy, using yaml files for configuration and everything seems to be working well. I was wondering, would it be out of scope to consider a wandb sweep experiment using LightningCLI? I have looked through the internet and I can not find a single result anywhere, of anyone ever having published anything regarding the use of LightningCLI to setup wandb's sweep. My understanding of the sweep is very limited, but it would be great to see if both of them can be used together (and how). Again, if this I'm misinformed about the relevance of the use-case, please let me know. Thank you.

Edit: to clarify, the question is specifically around how to initialize agents with the right configuration, with the agents making use of LightningCLI. The agent needs to be aware of the configuration and sweep_id, using a yaml file.

Edit2: it seems it's not currently possible, as of 1.7.7, however, this could change with 1.8 if I understand correctly. 1.8 seems to introduce args in the LightningCLI function call, which allows to pass the configuration file.
Assuming this is correct, the following minimal changes are required to instantiate agents which make use of LightningCLI (to be made within the main python file instantiating the LightningCLI:

from:

from pytorch_lightning.cli import LightningCLI
from myDataModule import myDataModule
from myModule import myModule

cli = LightningCLI(myModule, myDataModule)

to

from pytorch_lightning.cli import LightningCLI
from myDataModule import myDataModule
from myModule import myModule

import wandb
import yaml

with open('configs/base_config_clean.yaml', 'r') as fh:
    base_config = yaml.safe_load(fh)

wandb.init(config=base_config)
config = wandb.config

cli = LightningCLI(myModule, myDataModule, args=config)

This allows to make use of a default base config, which is then updated with parameters provided by the sweep controller to the agent.
The agent is called exactly as normal, using wandb agent <SEED_ID>. This still needs to be verified either using the beta release or the final release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants