Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

[WIP] Add a pytorch lightning trainer #3860

Closed
wants to merge 1 commit into from

Conversation

matt-gardner
Copy link
Contributor

I'm still noodling around, but in about 20 minutes I was able to get a simple test passing. This needs a lot of work to actually be usable, but the proof of concept is here, it doesn't look like it's very much work, or very much code that we'd have to maintain.

I only used Params here because that's what the test I was copying did. Still to do:

  • Completely remove the Params bit out of the code that's added here, and just make a TrainWithLightning object that takes concrete objects as arguments (that we can construct FromParams if anyone wants to use that pipeline). This probably also requires a Registrable AllenNlpLightningModule that can be subclassed for different configuration.
  • Make sure this actually trains a real model, and serializes stuff how we want (we almost certainly want to create our typical model archive, for example, though I'm not sure what pytorch lightning does by default)
  • Add whatever other bells and whistles people might want (not really sure what's necessary here; if anyone reading this uses or wants to use pytorch lightning, let me know what features should definitely be supported)

Bottom line: this is definitely possible, and should be very easy to support. I don't know exactly what that support should look like, as I have no experience with this other trainer. If you have specific requests, comment below. I also will be out of town for a week and likely won't be working on this for a little bit, so if anyone is super eager and wants to try fleshing this out, let me know.

@matt-gardner
Copy link
Contributor Author

Also, if you're excited about this possibility, please put an emoji reaction on the PR description, so we know how much to prioritize this work.

@bryant1410
Copy link
Contributor

Also, if you're excited about this possibility, please put an emoji reaction on the PR description, so we know how much to prioritize this work.

Now that you mention this, there's FeatHub that allows people to propose and vote for features. See this for an example.

@DeNeutoy
Copy link
Contributor

@matt-gardner how does this interact with the integration of some callbacks into the original trainer? Should I wait on that given this PR? We should have either this or the current trainer with some additional callbacks, we should not retain two trainers. I would be in favour of this pytorch lightning approach as I think it will make it easier/encourage people to use allennlp as a library.

@matt-gardner
Copy link
Contributor Author

I think we should have both for now, until we have some evidence on what people prefer and what actually works better. This is not a Trainer subclass, it's an entirely separate way of training. We just have two different ways of doing things.

@matt-gardner
Copy link
Contributor Author

And before you say, "two ways of doing things is confusing", I don't think so in this case, because we're not actually implementing a trainer here. We're implementing a shim that connects allennlp model and data code with the pytorch lightning trainer, just as our TrainModel shim connects allennlp model and data code with our trainer. Having these two options, and clearly explaining them, shows people that you can use whatever pieces of allennlp you want with whatever other libraries you want to use. We have a trainer that works, and has some allennlp-specific customizations. If you don't care about those, and want a different trainer, that's fine, here's how to do it.

@matt-gardner
Copy link
Contributor Author

Another argument: we should definitely not make our library entirely dependent on another trainer. Because then if that other trainer decides to change things in a way that we're not happy with, we're in a pretty bad position, and our library is totally unusable. We should have a built-in trainer.

What this work is trying to do is demonstrate that our data and model code are also compatible with other trainers. I'm not really sure why this is a bad thing.

@bryant1410
Copy link
Contributor

What about doing it in another repo, that has both dependencies?

@williamFalcon
Copy link

cool that you guys are adding this! happy to help if needed.
feel free to join the lightning slack and ping any of us there.

https://join.slack.com/t/pytorch-lightning/shared_invite/enQtODU5ODIyNTUzODQwLTFkMDg5Mzc1MDBmNjEzMDgxOTVmYTdhYjA1MDdmODUyOTg2OGQ1ZWZkYTQzODhhNzdhZDA3YmNhMDhlMDY4YzQ

@mikerossgithub
Copy link
Contributor

@matt-gardner I plan to take a shot at the details in the next few days. Should I submit as a separate PR or make a pull-request into this pull-request?

@DeNeutoy
Copy link
Contributor

DeNeutoy commented Mar 4, 2020

@mikerossgithub Matt is traveling at the moment, so i'd suggest a separate PR - we can merge them or close this one later. Thanks for contributing!

@mikerossgithub
Copy link
Contributor

I checked in a version to the pytorch-lightning library. I'm not really sure which project the code belongs in, as it needs to remain compatible with both allennlp and pytorch-lightning

@matt-gardner
Copy link
Contributor Author

Thanks @mikerossgithub, that's great! For the next week, we're pretty focused on EMNLP submissions and our 1.0 release, but I should be able to look at what you have and what else is left in a week or two. I'm fine with this code living in either place; @williamFalcon, any opinion on which repo should have it?

@matt-gardner
Copy link
Contributor Author

Within the next week, I'm going to make a github template repository that gives an easy starting place for someone who wants to use pytorch lightning with allennlp data and model code. It'll be similar to this, but with the fixes I mentioned above. A template project seems like the right way to provide this kind of integration.

Given this planned work, I'm closing this PR.

@williamFalcon
Copy link

williamFalcon commented Jun 18, 2020

@mikerossgithub (@matt-gardner) Depends what you are looking for:

  1. if it's just a demo then we can add to the lightning demos (which will be tested in our CI including on GPUs).
  2. If it's more functional and meant to intermingle with other models, then bolts is the best place.
    https://github.com/PyTorchLightning/pytorch-lightning-bolts.
    Bolts is meant to be a collection of contributions built-in lightning (ie: allenAI can have its own path in bolts with your models, etc... including attribution and author information).

In terms of using lightning as a backbone, we're heavily used at many labs and companies. As such, we work closely with them to make sure they have what they need. If there's something specific you might want just ping me on our Slack.

The downsize of having a parallel trainer is that we are moving very fast and adding new features/testing/support which you'll likely miss out on. For instance we now support TPU training and are currently working on doing TPU testing on CI as well. Not to mention all the integrations and other partners that we have, in addition to the almost 200 contributors + 8 core members + staff members on the lightning team working around the clock to add new features and make the library more robust :)

@tkornuta-nvidia
Copy link

. It'll be similar to this, but with the fixes I mentioned above. A template project seems like the right way to provide this kind of integration.

Hi @matt-gardner any updates on this project?

@williamFalcon
Copy link

matt and I spoke a few weeks back. We’re allocating some resources to build the templates for the team. cc @nateraw

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants