Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning rate finder crashes if accumulate_grad_batches is not set to 1 #1726

Closed
RafailFridman opened this issue May 4, 2020 · 8 comments · Fixed by #1801
Closed

Learning rate finder crashes if accumulate_grad_batches is not set to 1 #1726

RafailFridman opened this issue May 4, 2020 · 8 comments · Fixed by #1801
Labels
question Further information is requested

Comments

@RafailFridman
Copy link

I'm not sure if it is expected behavior or a bug, but when I'm trying to find a learning rate like this:

trainer = pl.Trainer(gpus = [1], accumulate_grad_batches=8)
lr_finder = trainer2.lr_find(model,min_lr = 1e-8, max_lr = 1e-1, num_training = 300)

It throws an error AttributeError: 'NoneType' object has no attribute 'item', which happens on the line 335 of lr_finder.py : current_loss = trainer.running_loss.last().item()

When I remove accumulate_grad_batches=8 everything works as expected
If it is expected behavior, I suggest implementing a more expressive error message

@RafailFridman RafailFridman added the question Further information is requested label May 4, 2020
@RafailFridman RafailFridman changed the title Learning rate finder crashes if accumulate_grad_batches is not set to 1 Learning rate finder crashes if accumulate_grad_batches is not set to 1 May 4, 2020
@SkafteNicki
Copy link
Member

Just to be sure, is it an typing error that the trainer that gets initialized is called trainer and the trainer that gets used with learning rate finder is called trainer2, or is it two different trainers?

@RafailFridman
Copy link
Author

@SkafteNicki yeah, sorry, I just tried different trainers and copied the wrong one.
Can you please check on your side if this error exists?

@SkafteNicki
Copy link
Member

This is very strange because the accumulate_grad_batches variable are override by the learning rate finders own argument num_accumulation_steps while it is running. I will look into whats coursing this error.

Just to be sure, do you want to accumulate gradients during the learning rate finder or is it just for later fitting?

@RafailFridman
Copy link
Author

I want to accumulate batches in training, so I suppose I should set accumulate_grad_batches parameter as in the training phase. Do I understand this wrong?

@SkafteNicki
Copy link
Member

No, nothing wrong with your understanding of the code. I have found a solution to the problem and will create a PR soon.

@florisdf
Copy link

I'm having the same error. Any solutions ready to be pulled in?

@jopo666
Copy link

jopo666 commented May 11, 2020

Just use the num_accumulation_steps option used by the learning rate finder for now.

trainer = pl.Trainer(gpus=1, accumulate_grad_batches=1)
lr_finder = trainer.lr_find(model, num_accumulation_steps=8)

[solution doesn't work]

@alexstoken
Copy link

@jopo666 @florisdf I do not think that will solve the problem if the goal is to accumulate gradients during the lr_find experiment. The global_step of the trainer, which only iterates when the learning rate is updated, runs every batch during the lr_find experiment, regardless of the num_accumulate_steps. This number resets itself after the finder is done running, but adding a print statement to line 434 or line 471 of training_loop.py will show that the learning rate (and the gradients) are updated every batch.

Tested on a nightly from last week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants