Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Trainer' object has no attribute 'proc_rank' #2267

Closed
vishal-burman opened this issue Jun 19, 2020 · 8 comments · Fixed by #2269
Closed

'Trainer' object has no attribute 'proc_rank' #2267

vishal-burman opened this issue Jun 19, 2020 · 8 comments · Fixed by #2269
Labels
help wanted Open to be worked on question Further information is requested

Comments

@vishal-burman
Copy link

🐛 Bug

1st epoch runs to completion and the above error is thrown in the is_logger() method.

To Reproduce

AttributeError                            Traceback (most recent call last)

<ipython-input-14-1b9ebf437115> in <module>()
      3 trainer = pl.Trainer(**train_params)
      4 
----> 5 trainer.fit(model)

8 frames

<ipython-input-3-bb983543bb31> in is_logger(self)
      8 
      9   def is_logger(self):
---> 10     return self.trainer.proc_rank <= 0
     11 
     12   def forward(

AttributeError: 'Trainer' object has no attribute 'proc_rank'

Code sample

class T5FineTuner(pl.LightningModule):
  def __init__(self, hparams):
    super(T5FineTuner, self).__init__()
    self.hparams = hparams
    
    self.model = T5ForConditionalGeneration.from_pretrained(hparams.model_name_or_path)
    self.tokenizer = T5Tokenizer.from_pretrained(hparams.tokenizer_name_or_path)
  
  def is_logger(self):
    return self.trainer.proc_rank <= 0
  
  def forward(
      self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
  ):
    return self.model(
        input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        lm_labels=lm_labels,
    )

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )

    loss = outputs[0]

    return loss

  def training_step(self, batch, batch_idx):
    loss = self._step(batch)

    tensorboard_logs = {"train_loss": loss}
    return {"loss": loss, "log": tensorboard_logs}
  
  def training_epoch_end(self, outputs):
    avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean()
    tensorboard_logs = {"avg_train_loss": avg_train_loss}
    return {"avg_train_loss": avg_train_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs}

  def validation_step(self, batch, batch_idx):
    loss = self._step(batch)
    return {"val_loss": loss}
  
  def validation_epoch_end(self, outputs):
    avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
    tensorboard_logs = {"val_loss": avg_loss}
    return {"avg_val_loss": avg_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs}

  def configure_optimizers(self):
    "Prepare optimizer and schedule (linear warmup and decay)"

    model = self.model
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": self.hparams.weight_decay,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=self.hparams.learning_rate, eps=self.hparams.adam_epsilon)
    self.opt = optimizer
    return [optimizer]
  
  def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None):
    if self.trainer.use_tpu:
      xm.optimizer_step(optimizer)
    else:
      optimizer.step()
    optimizer.zero_grad()
    self.lr_scheduler.step()
  
  def get_tqdm_dict(self):
    tqdm_dict = {"loss": "{:.3f}".format(self.trainer.avg_loss), "lr": self.lr_scheduler.get_last_lr()[-1]}

    return tqdm_dict

  def train_dataloader(self):
    train_dataset = get_dataset(tokenizer=self.tokenizer, type_path="train", args=self.hparams)
    dataloader = DataLoader(train_dataset, batch_size=self.hparams.train_batch_size, drop_last=True, shuffle=True, num_workers=4)
    t_total = (
        (len(dataloader.dataset) // (self.hparams.train_batch_size * max(1, self.hparams.n_gpu)))
        // self.hparams.gradient_accumulation_steps
        * float(self.hparams.num_train_epochs)
    )
    scheduler = get_linear_schedule_with_warmup(
        self.opt, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=t_total
    )
    self.lr_scheduler = scheduler
    return dataloader

  def val_dataloader(self):
    val_dataset = get_dataset(tokenizer=self.tokenizer, type_path="val", args=self.hparams)
    return DataLoader(val_dataset, batch_size=self.hparams.eval_batch_size, num_workers=4)

Expected behavior

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://github.com/raw/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.6.9
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: nVidia K80
  • Any other relevant information:

Additional context

This code does run in 0.7.6 version, but it breaks in the latest release

@vishal-burman vishal-burman added bug Something isn't working help wanted Open to be worked on labels Jun 19, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@Borda Borda added question Further information is requested and removed bug Something isn't working labels Jun 19, 2020
@Borda
Copy link
Member

Borda commented Jun 19, 2020

proc_rank was renamed to global_rank in #2166

EDIT: it is also recommended use rank_zero_only wrapper

@williamFalcon
Copy link
Contributor

but also, why do you need to check the proc rank?

@vishal-burman
Copy link
Author

Actually I borrowed most of the code from HuggingFace's T5 finetuning script, but I guess if it's not needed then I will remove it.

@sshleifer
Copy link
Contributor

sshleifer commented Jun 19, 2020

@vishal-burman do you have ddp working with transformers and pytorch-lightning==0.8.1?
I am struggling. I fixed this proc_rank bug, but now have nan loss in both fp16 and fp32. Very strange.

@vishal-burman
Copy link
Author

@sshleifer . I am also struggling with making ddp work. According to PyTorch documentation, there is a warning not to change model parameters after ddp construction. I wonder if that could be the case.

@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 19, 2020

can you guys post a minimal example that is breaking? in lightning we don’t change model stuff once ddp starts. maybe trnasformers is doing that?

but either way, the best thing is for us to have a model or test to test against.

@mrm8488
Copy link

mrm8488 commented Jun 22, 2020

QUICK FIX: if your are training your model on a single GPU, set:

  def is_logger(self):
    return True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Open to be worked on question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants