Theano fft experimental version #5

nouiz · 2014-07-29T20:55:36Z

This add the benchmark of Theano fft experimental version. I also try to make it more clear that this is work in progress and which conclusion can't be infered.

nouiz · 2014-07-29T20:56:34Z

You probably want to rewrite my modification to the README, as my English need upgrade:)

Theano fft experimental version

soumith · 2014-07-29T20:57:17Z

thanks Frederic! I will :)

soumith · 2014-07-29T21:00:52Z

@nouiz do I have to reinstall theano? or is there a custom module that I have to install? Right now it gives me this:

> THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python pylearn2_benchmark.py
Using gpu device 0: GeForce GTX TITAN Black

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride = 1 )
Input shape: (128, 128)
Detector space: (118, 118)
Output space: (118, 118)
pylearn2.models.mlp.ConvElemwise: 296.912916004 GFLOP/s ( tm = 0.418362498283 )
Traceback (most recent call last):
  File "pylearn2_benchmark.py", line 103, in <module>
    on_unused_input='ignore', mode=mode)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/function.py", line 223, in function
    profile=profile)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/pfunc.py", line 511, in pfunc
    on_unused_input=on_unused_input)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/function_module.py", line 1332, in orig_function
    defaults)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/function_module.py", line 1198, in create
    _fn, _i, _o = self.linker.make_thunk(input_storage=input_storage_lists)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/gof/link.py", line 489, in make_thunk
    output_storage=output_storage)[:3]
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/gof/vm.py", line 882, in make_all
    no_recycling))
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/sandbox/cuda/fftconv.py", line 58, in make_thunk
    from theano.misc.pycuda_utils import to_gpuarray
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/misc/pycuda_utils.py", line 2, in <module>
    import pycuda.gpuarray
ImportError: ('The following error happened while compiling the node', CuFFTOp(GpuContiguous.0), '\n', 'No module named pycuda.gpuarray')

benanne · 2014-07-29T21:02:21Z

The current FFT-based implementation in Theano depends on pyCUDA, and (unless they've modified it in the meantime) scikits.cuda. So you will need those two packages to be able to run this.

soumith · 2014-07-29T21:03:43Z

@benanne could you help me modify the README.md in the theano section to have additional setup instructions.

benanne · 2014-07-29T21:05:27Z

What kind of instructions do you mean? Just the added dependencies?

soumith · 2014-07-29T21:06:12Z

right now these are the instructions:
Install Theano:

git clone git://github.com/Theano/Theano.git
cd Theano
sudo python setup.py develop

Install pylearn2:

git clone git://github.com/lisa-lab/pylearn2.git
cd pylearn2
sudo python setup.py develop

Launch the script:

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python pylearn2_benchmark.py

soumith · 2014-07-29T21:06:44Z

I'm assuming I have to add install instructions for pycuda and scikit-learn, correct?

benanne · 2014-07-29T21:07:37Z

I see. That should suffice for the legacy kernels and the wrapped cuda-convnet code. For the FFT-based implementation you will indeed need pycuda / scikits.cuda (not scikit-learn) as dependencies.

soumith · 2014-07-29T21:20:08Z

cool, made some more progress, but still errors out, is there a version of pycuda/scikits.cuda that i require?

> THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python pylearn2_benchmark.py
Using gpu device 0: GeForce GTX TITAN Black

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride = 1 )
Input shape: (128, 128)
Detector space: (118, 118)
Output space: (118, 118)
pylearn2.models.mlp.ConvElemwise: 291.007529832 GFLOP/s ( tm = 0.426852285862 )
Traceback (most recent call last):
  File "pylearn2_benchmark.py", line 108, in <module>
    fprop()
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/function_module.py", line 579, in __call__
    outputs = self.fn()
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/sandbox/cuda/fftconv.py", line 328, in thunk
    output_b_pycuda)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/sandbox/cuda/fftconv.py", line 275, in sc_complex_dot_batched
    cublas.cublasCgemmBatched(handle, transb, transa, m, n, k, alpha,
AttributeError: 'module' object has no attribute 'cublasCgemmBatched'
Apply node that caused the error: BatchedComplexDotOp(GpuContiguous.0, GpuContiguous.0)
Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D)]
Inputs shapes: [(8320, 128, 3, 2), (8320, 3, 96, 2)]
Inputs strides: [(768, 6, 2, 1), (576, 192, 2, 1)]
Inputs scalar values: ['not scalar', 'not scalar']

HINT: Re-running with most Theano optimization disabled could give you a back-traces when this node was created. This can be done with by setting the Theano flags optimizer=fast_compile
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: invalid value
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: invalid value
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: invalid value
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
Aborted (core dumped)

soumith · 2014-07-29T21:21:41Z

I installed the latest versions listed on pypi

https://pypi.python.org/pypi/pycuda
https://pypi.python.org/pypi/scikits.cuda

benanne · 2014-07-29T21:24:14Z

Yeah, that scikits.cuda version is too old, you'll need 0.5.0 at least. The cublassCgemmBatched wrapper was something I added when I worked on this. Sorry, I should have mentioned this earlier.

soumith · 2014-07-29T21:26:23Z

ok great, works now! thanks.

soumith · 2014-07-29T21:29:00Z

just to confirm (so that this place isn't another war zone), the fft version looks to be about 2x faster than ConvElemwise for the first case.

Does that sound about right?
pylearn2.models.mlp.ConvElemwise: 299.74149507 GFLOP/s ( tm = 0.414414525032 )
(fft experimental) pylearn2.models.mlp.ConvElemwise: 595.44122732 GFLOP/s ( tm = 0.208613753319 )
pylearn2.sandbox.cuda_convnet: 1354.6420678 GFLOP/s ( tm = 0.0916974544525 )

benanne · 2014-07-29T21:38:34Z

Could be, I haven't tested the current implementation myself. It'll also depend on the input size a lot. I believe you're using 3 input feature maps at the moment - in my experience, the FFT-based version will be mostly beneficial when there are a lot of input feature maps, because this becomes the inner dimension of a batched dot product in the Fourier domain.

Note that it will also have some overhead on the first run, because the FFT plan has to be created. Subsequent runs should be faster.

soumith · 2014-07-29T21:39:33Z

indeed, it is very fast for the later layers. Full log here:
https://github.com/soumith/convnet-benchmarks/blob/master/theano/output.log

benanne · 2014-07-29T21:52:23Z

very cool :) I should mention though that the Gflop/s metric doesn't really make sense for the FFT implementation, it's not actually performing this many floating point operations, the FFT approach just needs fewer. 7 Tflop/s is actually more than the maximum that the Titan is capable of (about 4.5 Tflop/s).

I suppose the same goes for the Toeplitz matrix approach that Caffe uses, it will also need a different number of flops for a given convolution.

That said it's still useful to see how many Gflop/s it is equivalent to compared to a naive implementation.

soumith · 2014-07-29T21:53:01Z

I am changing the metrics as we speak :)

soumith · 2014-07-29T22:31:53Z

One last question before I add this entry into the table:

Where is the code located for this (experiemental FFT) module?
What is the status of this wrt usability (does it have :backward() code as well, is it unit tested etc.)?

benanne · 2014-07-29T22:35:35Z

you can find the code here: https://github.com/Theano/Theano/blob/master/theano/sandbox/cuda/fftconv.py

Regarding usability: afaik there are tests, I don't know if anyone has tried using it 'in production' though. The main problem with it is that it uses a lot of memory, so it isn't applicable for every use case. No free lunch! :)

By :backward() I assume you mean the gradient. The way this is implemented is as an optimization that replaces Theano's own ConvOp with the FFT-based one. Because this only happens in the optimization phase, the gradient has already been calculated at that point, so the convolutions that are part of the gradient are also replaced by their FFT versions automatically. In short, it does not have its own gradient implementation, but because of the way Theano works, this is not necessary. The implementation of Theano's own ConvOp is reused.

soumith · 2014-07-29T22:38:27Z

Awesome, I am going to add this in for now with the information you provided. Thank you.

I have finished up the benchmark code for all the other libraries (except ccv) for the backpropagation as well (i.e. calculate gradients wrt image, gradients wrt parameters). Would anyone from LISA Lab modify this test to add the backpropagation timings as well?

soumith · 2014-07-29T23:02:17Z

Added! and made the table report numbers for 5 configurations. This module is currently the FASTEST!

benanne · 2014-07-29T23:04:57Z

Cool! I have a feeling that might change when you get to the backward pass ;)

liuliu · 2014-07-29T23:06:36Z

I could help with making ccv work with your configuration. For larger kernel window, FFT is the best, although AlexNet and MattNet has small kernels. Probably small kernel works better just because we have fewer samples though.

soumith · 2014-07-29T23:07:36Z

thanks, i will add the :backward numbers for all the modules this weekend (I've already finished it for ccn2, caffe and torch) i hope someone can change the Theano benchmark to incorporate the backward pass numbers as well.

soumith · 2014-07-29T23:08:14Z

@liuliu that would be great! the way ccv benchmark works right now, it is a little hacky and i didn't find the time to modify cwc-bench for each of the configurations. Thanks!

liuliu · 2014-07-29T23:11:05Z

@soumith , yeah, looking closer to your layer configuration, it is a bit hard to gets ccv's number because each layer's output doesn't match the other. Due to all the assertion check for layer output / input consistency, you have to go all the way to call the actual CUDA kernel probably to get some numbers out.

liuliu · 2014-07-29T23:13:05Z

Also, you probably want to run with some real data (such as image or something) rather than the allocated uninitialized memory, I noticed that for all zero regions, my TITAN card sometimes cheats and have a shorter running time.

nouiz · 2014-07-30T01:26:25Z

I just saw the update table results. I think it you should keep the fft
with the experimental.

Also, maybe it would be great to add a section with the limitation of each
module. For example, Theano(legacy) is the slowest, but most versatile in
the size/shape accepted.

Also, what about a new column with the extra temps memory needed for each
method? caffe need memory for the toeplitz matrix and the fft method need
extra memory too. For example, we could put the extra memory needed as the
formula for the shape like batchsize*kernelsize.

On Tue, Jul 29, 2014 at 7:13 PM, Liu Liu notifications@github.com wrote:

Also, you probably want to run with some real data (such as image or
something) rather than the allocated uninitialized memory, I noticed that
for all zero regions, my TITAN card sometimes cheats and have a shorter
running time.

—
Reply to this email directly or view it on GitHub
#5 (comment)
.

nouiz added 3 commits July 29, 2014 16:48

pep8

8af6626

Add experimental Theano fft convolution to the benchmark.

ebbb3e5

Better warning about work in progress.

d82aa91

soumith added a commit that referenced this pull request Jul 29, 2014

Merge pull request #5 from nouiz/fft2

7044171

Theano fft experimental version

soumith merged commit 7044171 into soumith:master Jul 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Theano fft experimental version #5

Theano fft experimental version #5

nouiz commented Jul 29, 2014

nouiz commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

liuliu commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

liuliu commented Jul 29, 2014

liuliu commented Jul 29, 2014

nouiz commented Jul 30, 2014

Theano fft experimental version #5

Theano fft experimental version #5

Conversation

nouiz commented Jul 29, 2014

nouiz commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

benanne commented Jul 29, 2014

liuliu commented Jul 29, 2014

soumith commented Jul 29, 2014

soumith commented Jul 29, 2014

liuliu commented Jul 29, 2014

liuliu commented Jul 29, 2014

nouiz commented Jul 30, 2014