'Trainer' object has no attribute 'proc_rank' #2267

vishal-burman · 2020-06-19T10:59:46Z

🐛 Bug

1st epoch runs to completion and the above error is thrown in the is_logger() method.

To Reproduce

AttributeError                            Traceback (most recent call last)

<ipython-input-14-1b9ebf437115> in <module>()
      3 trainer = pl.Trainer(**train_params)
      4 
----> 5 trainer.fit(model)

8 frames

<ipython-input-3-bb983543bb31> in is_logger(self)
      8 
      9   def is_logger(self):
---> 10     return self.trainer.proc_rank <= 0
     11 
     12   def forward(

AttributeError: 'Trainer' object has no attribute 'proc_rank'

Code sample

class T5FineTuner(pl.LightningModule):
  def __init__(self, hparams):
    super(T5FineTuner, self).__init__()
    self.hparams = hparams
    
    self.model = T5ForConditionalGeneration.from_pretrained(hparams.model_name_or_path)
    self.tokenizer = T5Tokenizer.from_pretrained(hparams.tokenizer_name_or_path)
  
  def is_logger(self):
    return self.trainer.proc_rank <= 0
  
  def forward(
      self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
  ):
    return self.model(
        input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        lm_labels=lm_labels,
    )

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )

    loss = outputs[0]

    return loss

  def training_step(self, batch, batch_idx):
    loss = self._step(batch)

    tensorboard_logs = {"train_loss": loss}
    return {"loss": loss, "log": tensorboard_logs}
  
  def training_epoch_end(self, outputs):
    avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean()
    tensorboard_logs = {"avg_train_loss": avg_train_loss}
    return {"avg_train_loss": avg_train_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs}

  def validation_step(self, batch, batch_idx):
    loss = self._step(batch)
    return {"val_loss": loss}
  
  def validation_epoch_end(self, outputs):
    avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
    tensorboard_logs = {"val_loss": avg_loss}
    return {"avg_val_loss": avg_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs}

  def configure_optimizers(self):
    "Prepare optimizer and schedule (linear warmup and decay)"

    model = self.model
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": self.hparams.weight_decay,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=self.hparams.learning_rate, eps=self.hparams.adam_epsilon)
    self.opt = optimizer
    return [optimizer]
  
  def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None):
    if self.trainer.use_tpu:
      xm.optimizer_step(optimizer)
    else:
      optimizer.step()
    optimizer.zero_grad()
    self.lr_scheduler.step()
  
  def get_tqdm_dict(self):
    tqdm_dict = {"loss": "{:.3f}".format(self.trainer.avg_loss), "lr": self.lr_scheduler.get_last_lr()[-1]}

    return tqdm_dict

  def train_dataloader(self):
    train_dataset = get_dataset(tokenizer=self.tokenizer, type_path="train", args=self.hparams)
    dataloader = DataLoader(train_dataset, batch_size=self.hparams.train_batch_size, drop_last=True, shuffle=True, num_workers=4)
    t_total = (
        (len(dataloader.dataset) // (self.hparams.train_batch_size * max(1, self.hparams.n_gpu)))
        // self.hparams.gradient_accumulation_steps
        * float(self.hparams.num_train_epochs)
    )
    scheduler = get_linear_schedule_with_warmup(
        self.opt, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=t_total
    )
    self.lr_scheduler = scheduler
    return dataloader

  def val_dataloader(self):
    val_dataset = get_dataset(tokenizer=self.tokenizer, type_path="val", args=self.hparams)
    return DataLoader(val_dataset, batch_size=self.hparams.eval_batch_size, num_workers=4)

Expected behavior

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://github.com/raw/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

PyTorch Version (e.g., 1.0):
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.6.9
CUDA/cuDNN version: 10.1
GPU models and configuration: nVidia K80
Any other relevant information:

Additional context

This code does run in 0.7.6 version, but it breaks in the latest release

The text was updated successfully, but these errors were encountered:

github-actions · 2020-06-19T11:00:25Z

Hi! thanks for your contribution!, great first issue!

Borda · 2020-06-19T12:13:29Z

proc_rank was renamed to global_rank in #2166

EDIT: it is also recommended use rank_zero_only wrapper

williamFalcon · 2020-06-19T13:31:23Z

but also, why do you need to check the proc rank?

vishal-burman · 2020-06-19T14:37:56Z

Actually I borrowed most of the code from HuggingFace's T5 finetuning script, but I guess if it's not needed then I will remove it.

sshleifer · 2020-06-19T15:21:04Z

@vishal-burman do you have ddp working with transformers and pytorch-lightning==0.8.1?
I am struggling. I fixed this proc_rank bug, but now have nan loss in both fp16 and fp32. Very strange.

vishal-burman · 2020-06-19T16:05:35Z

@sshleifer . I am also struggling with making ddp work. According to PyTorch documentation, there is a warning not to change model parameters after ddp construction. I wonder if that could be the case.

williamFalcon · 2020-06-19T16:13:52Z

can you guys post a minimal example that is breaking? in lightning we don’t change model stuff once ddp starts. maybe trnasformers is doing that?

but either way, the best thing is for us to have a model or test to test against.

mrm8488 · 2020-06-22T16:34:40Z

QUICK FIX: if your are training your model on a single GPU, set:

  def is_logger(self):
    return True

vishal-burman added bug Something isn't working help wanted Open to be worked on labels Jun 19, 2020

Borda added question Further information is requested and removed bug Something isn't working labels Jun 19, 2020

Borda mentioned this issue Jun 19, 2020

deprecated Trainer proc_rank #2269

Merged

7 tasks

williamFalcon closed this as completed in #2269 Jun 19, 2020

sshleifer mentioned this issue Jun 19, 2020

Upgrade examples to pl=0.8.1 huggingface/transformers#5146

Merged

Arij-Aladel mentioned this issue Jan 2, 2021

TypeError: optimizer_step() got an unexpected keyword argument 'on_tpu' #5326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Trainer' object has no attribute 'proc_rank' #2267

'Trainer' object has no attribute 'proc_rank' #2267

vishal-burman commented Jun 19, 2020

github-actions bot commented Jun 19, 2020

Borda commented Jun 19, 2020 •

edited

Loading

williamFalcon commented Jun 19, 2020

vishal-burman commented Jun 19, 2020

sshleifer commented Jun 19, 2020 •

edited

Loading

vishal-burman commented Jun 19, 2020

williamFalcon commented Jun 19, 2020 •

edited

Loading

mrm8488 commented Jun 22, 2020 •

edited by Borda

Loading

'Trainer' object has no attribute 'proc_rank' #2267

'Trainer' object has no attribute 'proc_rank' #2267

Comments

vishal-burman commented Jun 19, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Jun 19, 2020

Borda commented Jun 19, 2020 • edited Loading

williamFalcon commented Jun 19, 2020

vishal-burman commented Jun 19, 2020

sshleifer commented Jun 19, 2020 • edited Loading

vishal-burman commented Jun 19, 2020

williamFalcon commented Jun 19, 2020 • edited Loading

mrm8488 commented Jun 22, 2020 • edited by Borda Loading

Borda commented Jun 19, 2020 •

edited

Loading

sshleifer commented Jun 19, 2020 •

edited

Loading

williamFalcon commented Jun 19, 2020 •

edited

Loading

mrm8488 commented Jun 22, 2020 •

edited by Borda

Loading