Update/wrap to cuda-convnet2 #1044

nouiz · 2014-07-25T02:43:46Z

Time to upgrade pylearn2's warp of cuda-convnet:
https://code.google.com/p/cuda-convnet2/
https://plus.google.com/u/0/+AlexKrizhevsky/posts/GeGh4j7kDcR

Need to check for the license, it is apache. Also will probably need to select the old or new version depending of the user GPU, as the new one don't do this. Or at least, test if the new one work and isn't slower on older GTX580

nouiz · 2014-07-25T02:44:41Z

Other info by @memimo on pylearn-dev:

I'm not sure what you mean by requestion temp memory. But, yes it still use B01C order

I think this is still the best option for pylearn2 for following reasons:

-Its interface hasn't change much, so we can update our wrap with the least amount of effort
-It's not just the conv2D that we care about. cuda-convnet has code for optimized pooling too
-It's now optimized for the Titan black and k20 that we use at LISA
-It's support multiple-gpu architecture
-The only other library that meet our needs would be cafe, which apparently its license is not that flexible.

benanne · 2014-07-25T15:28:10Z

This is probably worth keeping an eye on: https://github.com/soumith/convnet-benchmarks
Not too many results there yet, but it should be cool to have some raw performance numbers. I hope raw performance will also factor into the decision on what implementation(s) to wrap :)

benanne · 2014-07-28T02:21:26Z

Soumith just posted some results for a single convolutional layer (see README in his repo). Looks like this is definitely going to be worth the effort :)

dwf · 2014-07-28T07:25:40Z

If it's Apache-licensed then we cannot include it in pylearn2 directly. The
Apache license is incompatible with BSD, and we need to stay BSD for a
variety of reasons. We'll have to refactor around allowing different
convolution "plugins".

I think the Theano ops are the right layer at which to do this.

On Thu, Jul 24, 2014 at 7:43 PM, Frédéric Bastien notifications@github.com
wrote:

Time to upgrade pylearn2's warp of cuda-convnet:
https://code.google.com/p/cuda-convnet2/
https://plus.google.com/u/0/+AlexKrizhevsky/posts/GeGh4j7kDcR

Need to check for the license, it is apache. Also will probably need to
select the old or new version depending of the user GPU, as the new one
don't do this. Or at least, test if the new one work and isn't slower on
older GTX580

—
Reply to this email directly or view it on GitHub
#1044.

madisonmay · 2014-07-28T13:39:57Z

This stack exchange post seems to suggest that you should be fine including Apache licensed code in a BSD licensed project, provided that you also include the Apache license file with the module that was released with cuda_convnet2: http://programmers.stackexchange.com/questions/40561/is-bsd-license-compatible-with-apache. Here's the relevant bit of the Apache license: http://www.apache.org/licenses/LICENSE-2.0.html#redistribution. Don't know if the reasons that pylearn2 must stay BSD prohibit that format, though.

bhack · 2014-08-08T09:58:52Z

Take a look here

nouiz · 2014-08-08T20:22:42Z

Hi,

the apache license have much more restriction on the users the bsd. For
example, if you use the apache code, you agree to don't sue for copyright
the author of that code. Mixing part of pylern2 or theano code with code
with such restriction make user enter in a strange area: if have that code
installed, but manually disabled its usage, what happen? I'm not a lawyer
but I saw many compagnies having questions about the restriction they have
related to the bsd license. If we start to mix license, it will make it
even harder for them to know what they agree to use and don't have problem.

I think we could look into making a separate repo, but add in the setup.py
stuff, such that if it isn't installed prompt the user if he want to
install it. That way they have a clear decision of what they want to do.

Anyway, in all case, we first need someone to do the wrapper. We can always
later change the place it is included or the new repo. But I don't think we
will include it directly in Pylearn2. Atleast, not before much more
verification, which I won't do shortly.

Fred

On Fri, Aug 8, 2014 at 5:58 AM, bhack notifications@github.com wrote:

Take a look here http://www.apache.org/legal/resolved.html#category-a

—
Reply to this email directly or view it on GitHub
#1044 (comment).

madisonmay · 2014-09-08T05:34:28Z

Is there anyone actively working on this port? I'd be very interested in moving forward on this issue technically, even if there are licensing constraints that we'd have to consider later on when integrating with pylearn2. The support offered for multiple gpus would be an excellent value add for pylearn2.

This years ILSVRC competition featured VGG's convnet trained for ~4-6 weeks on 4 gpus. On a single gpu that kind of computation would be infeasible, and it would be great to have pylearn2 help facilitate research at that scale.

I understand that @goodfeli and @dwf were responsible for the original wrapper around cuda-convnet for pylearn, and would be curious to hear what your estimates would be for a port of Krizhevsky's cuda-convnet2 library. A cursory comparison of cuda-convnet2 makes it seem like the high-level interface to the library has stayed very similar, so I would anticipate a port being pretty feasible with a few weekends worth of dedicated work. I'd also appreciate a quick assessment of whether or not the Krizhevsky's hybrid data/model parallelism method would play well with Theano -- if not, pure data parallelism might provide most of the benefit with a smaller amount of effort.

Even if multiple-gpu support may require a longer term effort to port, the improved train times on Kepler gpus would still be a nice value add.

goodfeli · 2014-09-08T13:55:07Z

As long as the interface is indeed similar, you're right that it should
only be a few weekends of work to copy-paste our wrapper and make it work
with the new library.
theano-dev would be a better place to ask about this stuff, especially
multi-GPU support.
The cuda-convnet wrappers really should not be in pylearn2, I just put them
there because theano_linear was there. In my opinion all of pylearn2.linear
should be in theano, but I think that's still controversial. I think it's
less controversial that the ops underlying it should be in theano.
Side comment: there's no 'c' in Krizhevsky.

2014-09-07 22:34 GMT-07:00 Madison May notifications@github.com:

Is there anyone actively working on this port? I'd be very interested in
moving forward on this issue technically, even if there are licensing
constraints that we'd have to consider later on when integrating with
pylearn2. The support offered for multiple gpus would be an excellent value
add for pylearn2.

This years ILSVRC competition featured VGG's convnet trained for ~4-6
weeks on 4 gpus. On a single gpu that kind of computation would be
infeasible, and it would be great to have pylearn2 help facilitate research
at that scale.

I understand that @goodfeli https://github.com/goodfeli and @dwf
https://github.com/dwf were responsible for the original wrapper around
cuda-convnet for pylearn, and would be curious to hear what your estimates
would be for a port of Krizchevsky's cuda-convnet2 library. A cursory
comparison of cuda-convnet2 makes it seem like the high-level interface to
the library has stayed very similar, so I would anticipate a port being
pretty feasible with a few weekends worth of dedicated work. I'd also
appreciate a quick assessment of whether or not the Krizchevsky's hybrid
data/model parallelism method http://arxiv.org/pdf/1404.5997v2.pdf
would play well with Theano -- if not, pure data parallelism might provide
most of the benefit with a smaller amount of effort.

Even if multiple-gpu support may require a longer term effort to port, the
improved train times on Kepler gpus would still be a nice value add.

—
Reply to this email directly or view it on GitHub
#1044 (comment).

nouiz · 2014-09-08T14:05:28Z

the cuda-convnet2 can't be put in Theano or Pylearn2 due to license. It
should be in a separate repo.

Moving the cuda-convnet to Theano make sence, but it probably will become
useless (Nvidia just release its own library with convolution) so I don't
think it is worth the time to move them. Just let them there for history.
With NVIDIA release that we will probably finish wrapping the convolution
code this week in Theano, I don't think it will be wise to spend time on
cuda-convnet2 unless we see a clear reason. For now, I don't see one. I'm
pretty sure the multi-gpu can be done with the nvidia lib. Maybe we need to
do it manually, but I think it can be done. @abergeron do you have the same
impression?

For the multi-GPU, we should talk about that in theano-dev. We have short
term plan to finish that in Theano, but the "short" term seem to always
take longer. A "very short" term would mean making just a convolution op be
multi-gpu and not everything. If you are interrested to help/continue this
discussion, start a new thread on theano-dev.

Fred

On Mon, Sep 8, 2014 at 9:55 AM, Ian Goodfellow notifications@github.com
wrote:

As long as the interface is indeed similar, you're right that it should
only be a few weekends of work to copy-paste our wrapper and make it work
with the new library.
theano-dev would be a better place to ask about this stuff, especially
multi-GPU support.
The cuda-convnet wrappers really should not be in pylearn2, I just put
them
there because theano_linear was there. In my opinion all of
pylearn2.linear
should be in theano, but I think that's still controversial. I think it's
less controversial that the ops underlying it should be in theano.
Side comment: there's no 'c' in Krizhevsky.

2014-09-07 22:34 GMT-07:00 Madison May notifications@github.com:

Is there anyone actively working on this port? I'd be very interested in
moving forward on this issue technically, even if there are licensing
constraints that we'd have to consider later on when integrating with
pylearn2. The support offered for multiple gpus would be an excellent
value
add for pylearn2.

This years ILSVRC competition featured VGG's convnet trained for ~4-6
weeks on 4 gpus. On a single gpu that kind of computation would be
infeasible, and it would be great to have pylearn2 help facilitate
research
at that scale.

I understand that @goodfeli https://github.com/goodfeli and @dwf
https://github.com/dwf were responsible for the original wrapper
around
cuda-convnet for pylearn, and would be curious to hear what your
estimates
would be for a port of Krizchevsky's cuda-convnet2 library. A cursory
comparison of cuda-convnet2 makes it seem like the high-level interface
to
the library has stayed very similar, so I would anticipate a port being
pretty feasible with a few weekends worth of dedicated work. I'd also
appreciate a quick assessment of whether or not the Krizchevsky's hybrid
data/model parallelism method http://arxiv.org/pdf/1404.5997v2.pdf
would play well with Theano -- if not, pure data parallelism might
provide
most of the benefit with a smaller amount of effort.

Even if multiple-gpu support may require a longer term effort to port,
the
improved train times on Kepler gpus would still be a nice value add.

—
Reply to this email directly or view it on GitHub
#1044 (comment).

—
Reply to this email directly or view it on GitHub
#1044 (comment).

benanne · 2014-09-08T14:17:46Z

Has CuDNN been compared against cuda-convnet2? I found it odd that the blog post about CuDNN made no mention of it. Soumith's benchmarks seem to indicate that cuda-convnet2 beats the Caffe gemm approach for a few configurations ( https://github.com/soumith/convnet-benchmarks ). Since CuDNN is supposedly only 1.2x - 1.3x faster than Caffe, it might still be beneficial to use cuda-convnet2 for certain configurations.

It might not be worth the effort though... perhaps it would be a good idea to wait with that decision until CuDNN support is implemented so it can be included in the benchmarks. If cuda-convnet2 still turns out to have an edge for some input configurations, a more informed decision can be made.

nouiz · 2014-09-08T14:52:01Z

My guess is that cudnn will get updated until it always bet cuda-convnet2.
But the futur is not sure! So if someone want to work on that, do it. I
won't discourage people of doing that. I just don't want people to have
wrong expectation.

I agree, it would be good to have it in the benchmark to know the current
speed.

On Mon, Sep 8, 2014 at 10:17 AM, Sander Dieleman notifications@github.com
wrote:

Has CuDNN been compared against cuda-convnet2? I found it odd that the
blog post about CuDNN made no mention of it. Soumith's benchmarks seem to
indicate that cuda-convnet2 beats the Caffe gemm approach for a few
configurations ( https://github.com/soumith/convnet-benchmarks ). Since
CuDNN is supposedly only 1.2x - 1.3x faster than Caffe, it might still be
beneficial to use cuda-convnet2 for certain configurations.

It might not be worth the effort though... perhaps it would be a good idea
to wait with that decision until CuDNN support is implemented so it can be
included in the benchmarks. If cuda-convnet2 still turns out to have an
edge for some input configurations, a more informed decision can be made.

—
Reply to this email directly or view it on GitHub
#1044 (comment).

madisonmay · 2014-09-08T15:42:39Z

@goodfeli, thanks for the analysis. It seems like the general consensus is that any sort of integration should be addressed at the Theano level rather than the pylearn2 level, so I will gladly move that discussion to the theano-dev mailing list. And thanks for the correction w/ regards to Krizhevsky.

@nouiz, it looks like caffe's integration of cuDNN required many thousand lines of code, so I'm not sure how short-term that project will be. I'd like to stay up to date on that progress, though. I was unable to find an open issue / PR about multi-gpu support on the theano github page -- if that does exist, do you think you could drop in a link to that?

I'm of the opinion that it would still be worth pursuing the cuda-convnet2 integration in parallel, since as @benanne mentions it's unlikely that the difference in performance between the two will be too substantial.

nouiz · 2014-09-08T15:51:18Z

There is no ticket about cudnn. I just created one:

Theano/Theano#2094

Last Friday, @abergeron finished the first version of our wrapping of there
convolution code. This is what could give the most speed up. I think we can
have that merged in Theano this week.

On Mon, Sep 8, 2014 at 11:42 AM, Madison May notifications@github.com
wrote:

@goodfeli https://github.com/goodfeli, thanks for the analysis. It
seems like the general consensus is that any sort of integration should be
addressed at the Theano level rather than the pylearn2 level, so I will
gladly move that discussion to the theano-dev mailing list. And thanks for
the correction w/ regards to Krizhevsky.

@nouiz https://github.com/nouiz, it looks like caffe's integration of
cuDNN https://github.com/BVLC/caffe/pull/1046/files required many
thousand lines of code, so I'm not sure how short-term that project will
be. I'd like to stay up to date on that progress, though. I was unable to
find an open issue / PR on the theano github page -- if that does exist, do
you think you could drop in a link to that?

I'm of the opinion that it would still be worth pursuing the cuda-convnet2
integration in parallel, since as @benanne https://github.com/benanne
mentions it's unlikely that the difference in performance between the two
will be too substantial.

—
Reply to this email directly or view it on GitHub
#1044 (comment).

madisonmay · 2014-09-08T15:54:10Z

Yeah, the estimate in Krizhevsky's paper was that ~90% of the speedup from multi-gpu support could be achieved by supporting data parallelism in the conv layers. Thanks for creating that ticket.

benanne · 2014-12-29T11:54:45Z

Bump :) Is this still being considered? Soumith's latest benchmarks ( https://github.com/soumith/convnet-benchmarks ) show that cuda-convnet2 is pretty competitive for some configurations, even compared to cudnn R2.

I am still using the cuda-convnet wrappers a lot, because even on the GTX 980, I can still get substantial speedups from them compared to all the other convolution implementations that are now available in Theano. So I imagine cuda-convnet2 would probably be even faster for my use cases.

I'm willing to help with this if I can be of any use, but someone else would need to take the lead as I'm not comfortable at all with C/C++.

kloudkl mentioned this issue Aug 8, 2014

Multi-GPU operation and data / model Parallelism BVLC/caffe#876

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update/wrap to cuda-convnet2 #1044

Update/wrap to cuda-convnet2 #1044

nouiz commented Jul 25, 2014

nouiz commented Jul 25, 2014

benanne commented Jul 25, 2014

benanne commented Jul 28, 2014

dwf commented Jul 28, 2014

madisonmay commented Jul 28, 2014

bhack commented Aug 8, 2014

nouiz commented Aug 8, 2014

madisonmay commented Sep 8, 2014

goodfeli commented Sep 8, 2014

nouiz commented Sep 8, 2014

benanne commented Sep 8, 2014

nouiz commented Sep 8, 2014

madisonmay commented Sep 8, 2014

nouiz commented Sep 8, 2014

madisonmay commented Sep 8, 2014

benanne commented Dec 29, 2014

Update/wrap to cuda-convnet2 #1044

Update/wrap to cuda-convnet2 #1044

Comments

nouiz commented Jul 25, 2014

nouiz commented Jul 25, 2014

benanne commented Jul 25, 2014

benanne commented Jul 28, 2014

dwf commented Jul 28, 2014

madisonmay commented Jul 28, 2014

bhack commented Aug 8, 2014

nouiz commented Aug 8, 2014

madisonmay commented Sep 8, 2014

goodfeli commented Sep 8, 2014

nouiz commented Sep 8, 2014

benanne commented Sep 8, 2014

nouiz commented Sep 8, 2014

madisonmay commented Sep 8, 2014

nouiz commented Sep 8, 2014

madisonmay commented Sep 8, 2014

benanne commented Dec 29, 2014