Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why I have to write .epoch_end and all the *_step #2619

Closed
FrancescoSaverioZuppichini opened this issue Jul 16, 2020 · 15 comments
Closed

Why I have to write .epoch_end and all the *_step #2619

FrancescoSaverioZuppichini opened this issue Jul 16, 2020 · 15 comments
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@FrancescoSaverioZuppichini

🚀 Feature

Dear all,

Thank you for this amazing library. However, there are some features that leads to avoidable boilerplate code.

Let's take an example

class TestModule(LightningModule):
    def __init__(self, model: torch.nn.Module):
        super().__init__()
        self.model = model

    def forward(self, x):
        return self.model(x)

    def get_metrics(self, pred, y):
        return {
            'accuracy': accuracy(pred, y),
            'f1': f1_score(pred, y, num_classes=2),
            'roc_auc': auroc(pred, y)
        }

    def step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        pred = torch.argmax(y_hat, dim=1)
        metrics = self.get_metrics(pred, y)
        loss = F.cross_entropy(y_hat, y)
        
        return loss, metrics

    def training_step(self, batch, batch_idx):
        loss, metrics = self.step(batch, batch_idx)
        return {'loss': loss, 'metrics': metrics}

    def validation_step(self, batch, batch_idx):
        loss, metrics = self.step(batch, batch_idx)
        return {'val_loss': loss, 'metrics': metrics}

    def test_step(self, batch, batch_idx):
        loss, metrics = self.step(batch, batch_idx)
        return {'test_loss': loss, 'metrics': metrics}

Why I have to write all the *_step methods

Most of the time you want to use the same loss and return the same metrics you use in training_step.

We should have a .step the method that, if one of the *_step methods are not implemented, it will be used instead, this will remove most of the boilerplate code.

Why I have to manually compute the average metrics for each epoch.

I was expecting Pytorch Lighting to compute the average of each metric by itself as all other libraries do. Is there a specific reason for not doing it?

Thank you,

Best Regards,

Francesco

@FrancescoSaverioZuppichini FrancescoSaverioZuppichini added feature Is an improvement or enhancement help wanted Open to be worked on labels Jul 16, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@awaelchli
Copy link
Member

See #2615 PR that tries to simplify this.

@FrancescoSaverioZuppichini
Copy link
Author

FrancescoSaverioZuppichini commented Jul 16, 2020

@awaelchli Thank you for your reply. IMHO the new syntax proposed in #2615 seems even weirded and does not seem to solve the issue.

Why do you have to reinvent the wheel? Most libraries have something like:


tr = Trainer(..., batch_metrics=[Accuracy()], epoch_metrics=[Accuracy(), ROCAUC()]
...

This has three advantages:

  1. it makes sense
  2. no need to explain in the doc how to properly return the metrics or to return them in order to be added to the logger etc
  3. no need to write more code than needed

Then one can also choose to manually return a metric from the *_step functions.

To save memory and compute the epochs metrics you can have default names to the metric and store the batch ones only if the epoch one exists.

What do you think?

Cheers,

Francesco

@awaelchli
Copy link
Member

Why do you have to reinvent the wheel? Most libraries have something like:

I disagree. PyTorchLighting is not trying to be like other libraries. Which parts do you see as "reinvented"?

  1. it makes sense

Not to me by a large margin, but your example code for me perfectly highlights the design concepts behind LightningModule: Everything that is specific to the training and evaluation to a system goes into the LightningModule. The Trainer can be thought of the encapsulation of all the "boilerplate" PyTorch training loop etc and does not contain any research specific properties (like metrics or how losses are computed etc.)
A Trainer object can also be used for several different LightningModules. So having these metrics as argument to the Trainer is not a good thing in my view.

  1. no need to explain in the doc how to properly return the metrics or to return them in order to be added to the logger etc

It looks like the docs will become simpler with the structured results feature, but remain backward compatible.

  1. no need to write more code than needed

With the structured results, you won't have to implement the *_end and *_epoch_end unless you want to do crazy things as I understand it.

@FrancescoSaverioZuppichini
Copy link
Author

FrancescoSaverioZuppichini commented Jul 16, 2020

Thank you @awaelchli for the fast reply!

I disagree. PyTorchLighting is not trying to be like other libraries. Which parts do you see as "reinvented"?

It remembers me of web development some years ago, everybody claiming they were doing different things ... but this is out of topic

Well, in my opinion, if I had to rewrite three times the same code train_step, val_step, etc.. it means that something was not well made and can be definitely improved! I hope you agree to at least one point :) To be honest, you have to rewrite the same code three times for the *_step methods and three-time for the *_end methods if you want to also log something at the end of each epoch.

So having these metrics as argument to the Trainer is not a good thing in my view.

You can pass them to the .fit method :)

Anyway, thank you again for the fast reply. I really appreciate it!

Cheers,

Francesco

@awaelchli
Copy link
Member

You can pass them to the .fit method :)

My same argument applies there. It's not different from the Trainer args case. I want the metrics etc. in my Lightningmodule, and not outside. If I wanted to pass them in from the outside, I would make it a hyperparameter.

@PyTorchLightning/core-contributors any thoughts here?

@FrancescoSaverioZuppichini
Copy link
Author

@awaelchli Thank you for your reply. At this point, I don't see any point in continuing the discussion. I will just copy and paste the code over and over again hoping somebody will implement a new library with better practices soon. Thank you.

Be safe!

Francesco

@awaelchli
Copy link
Member

About the code repetitions you mention, there is a way to get rid of them. You can define your default val- and test_step methods in a separate class and then inherit it in your various LightningModules. Btw I like the idea of the general purpose step method.

@williamFalcon
Copy link
Contributor

williamFalcon commented Jul 26, 2020

in the latest version on master you don’t need step_end or epoch_end.

to use the same step you can just call validation_step from training_step.

docs will be updated in a few days.

Before

(to be fair, only training_step was required)

def training_step(...):
    return loss

def training_step_end(...):
    ...

def training_epoch_end(...):
   ...

Now:

def training_step(...):
    result = TrainResult(minimize=loss)
    return result

To use the same loop for val, test and train

def step(...):
   return something

def training_step(...):
    something = self.step(...)
    result = TrainResult(minimize=something)
    return result

def validation_step(...):
    something = self.step(...)
    result = EvalResult(checkpoint_on=something)
    result.log('val_loss', something)
    return result

def test_step(...):
    result = self.validation_step(...)
    result.log('test_loss', something)
    return result

Maybe for your applications you use the same loop for all three... however, that's very unusual and usually what happens in the train step is different than the other loops. And the metrics need to be names according to what loop you are in... For instance in train loop you log or track things on every batch. On val/test loop you only care about full epoch statistics.

So, I'm curious @FrancescoSaverioZuppichini. In what application are you able to use the same loop for everything?

@williamFalcon
Copy link
Contributor

williamFalcon commented Jul 26, 2020

Ok, docs have been updated!

Quick start guide

https://pytorch-lightning.readthedocs.io/en/latest/new-project.html

Or the walk-through

https://pytorch-lightning.readthedocs.io/en/latest/introduction_guide.html#validating

@FrancescoSaverioZuppichini is that more in line with what you were hoping for? :)

@FrancescoSaverioZuppichini
Copy link
Author

Hi @williamFalcon,

I hope you are doing great. Thank you for the reply, I am doing image classification and I want to return the same metrics from all my dataset splits.

I have found it very hard to understand what I have to return from each *_step and I am used to code a lot. I am sorry to say but the new APIs do not look great, it seems (to me) to have even complicated more the codebase. What I want is to just pass a metric and that's it, it should work. I think this is something that people expect to be super easy because it is!

Let me know if I can help or you are fine with the current implementation.

Thanks :)

Francesco

@williamFalcon
Copy link
Contributor

williamFalcon commented Jul 26, 2020

Could you give me an example of how it might work? if it meets the following requirements we could consider it:

  • it needs to allow the user to only log
  • it needs to allow the user to only show in progress bar
  • it can allow both or neither
  • it can be the metric for every step
  • or it can be the metric for the full epoch
  • or it can be both
  • and for more complicated use cases (lightning js a Pro tool after all and not something for beginners), i might want to do a distributed softmax and would want access to all outputs of all GPUs on the same node.
  • the thing i want to log is not just MSE loss or accuracy... but something more complicated like a GAN loss with a custom regularizer or even a perceptual loss using MSE on VGG feature maps.

i think your metrics passed to trainer suggestion is interesting but i think this has a few pitfalls:

  1. you don’t have control as a user of what actually happens in that metric.
  2. models often depend on certain metrics... removing them from the model makes it unusable.
  3. what about non standard metrics? if your work is 99% accuracy based then ok. but unlikely to be true in a majority of cases.

also, what about when i may want accuracy but calculated in a weird way or for a different type of batch structure? ie: metric learning or something. then my model is no longer self contained

thanks!

@FrancescoSaverioZuppichini
Copy link
Author

FrancescoSaverioZuppichini commented Sep 13, 2020

@williamFalcon sorry for the late reply, I was super busy and this topic went under the radar, my bad. I see your point but (with all due respect) I think you are wrong. Unfortunately, I don't have the time know to show my counter-arguments and I suspect the discussion will go on forever. Very quickly, you can have default behavior and allow users to implement specific methods as you are doing now, this is classic OOP.

I am down for a quick voice chat if you would like to hear my suggestion in detail or I will comment here in the not so near future. In the meantime, I will drop PyTorch lighting since it doesn't speed up my productivity, and, sorry to say, bugs are everywhere!

Thank you for the discussion ;)

Bests

@williamFalcon
Copy link
Contributor

I hear you!
Unfortunately Lightning is a professional tool... so, it's not meant to abstract details as you're suggesting.

This is why it's used to create new state of the art methods and new research.
(example: https://github.com/mindslab-ai/cotatron).

You might be looking for something else that doesn't require as much pytorch knowledge or is meant for simpler cases.

This is unfortunately not the aim of lightning.

@FrancescoSaverioZuppichini
Copy link
Author

FrancescoSaverioZuppichini commented Sep 14, 2020

Hi @williamFalcon, thank you for the reply! I have a lot of programming knowledge and I have been using PyTorch for years. But if I have to write tons of code to just do a very basic image classification, then probably there is something not so well design in your library (+ bugs). Don't get me wrong, there are lots of good features.

Tensorflow is also used to create sota methods, but it is a terrible tool. IMHO it should be easy to use for the easy task and hard to use for the hard task, not hard for both. I hope no offense is taken, but because you are the creator you are not so willing to hear :)

Thank you :)

Bests,

P.S. You can close the issue now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants