Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Theano fft experimental version #5

Merged
merged 3 commits into from
Jul 29, 2014
Merged

Theano fft experimental version #5

merged 3 commits into from
Jul 29, 2014

Conversation

nouiz
Copy link
Contributor

@nouiz nouiz commented Jul 29, 2014

This add the benchmark of Theano fft experimental version. I also try to make it more clear that this is work in progress and which conclusion can't be infered.

@nouiz
Copy link
Contributor Author

nouiz commented Jul 29, 2014

You probably want to rewrite my modification to the README, as my English need upgrade:)

soumith added a commit that referenced this pull request Jul 29, 2014
Theano fft experimental version
@soumith soumith merged commit 7044171 into soumith:master Jul 29, 2014
@soumith
Copy link
Owner

soumith commented Jul 29, 2014

thanks Frederic! I will :)

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

@nouiz do I have to reinstall theano? or is there a custom module that I have to install? Right now it gives me this:

> THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python pylearn2_benchmark.py
Using gpu device 0: GeForce GTX TITAN Black

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride = 1 )
Input shape: (128, 128)
Detector space: (118, 118)
Output space: (118, 118)
pylearn2.models.mlp.ConvElemwise: 296.912916004 GFLOP/s ( tm = 0.418362498283 )
Traceback (most recent call last):
  File "pylearn2_benchmark.py", line 103, in <module>
    on_unused_input='ignore', mode=mode)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/function.py", line 223, in function
    profile=profile)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/pfunc.py", line 511, in pfunc
    on_unused_input=on_unused_input)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/function_module.py", line 1332, in orig_function
    defaults)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/function_module.py", line 1198, in create
    _fn, _i, _o = self.linker.make_thunk(input_storage=input_storage_lists)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/gof/link.py", line 489, in make_thunk
    output_storage=output_storage)[:3]
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/gof/vm.py", line 882, in make_all
    no_recycling))
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/sandbox/cuda/fftconv.py", line 58, in make_thunk
    from theano.misc.pycuda_utils import to_gpuarray
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/misc/pycuda_utils.py", line 2, in <module>
    import pycuda.gpuarray
ImportError: ('The following error happened while compiling the node', CuFFTOp(GpuContiguous.0), '\n', 'No module named pycuda.gpuarray')

@benanne
Copy link

benanne commented Jul 29, 2014

The current FFT-based implementation in Theano depends on pyCUDA, and (unless they've modified it in the meantime) scikits.cuda. So you will need those two packages to be able to run this.

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

@benanne could you help me modify the README.md in the theano section to have additional setup instructions.

@benanne
Copy link

benanne commented Jul 29, 2014

What kind of instructions do you mean? Just the added dependencies?

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

right now these are the instructions:
Install Theano:

git clone git://github.com/Theano/Theano.git
cd Theano
sudo python setup.py develop

Install pylearn2:

git clone git://github.com/lisa-lab/pylearn2.git
cd pylearn2
sudo python setup.py develop

Launch the script:

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python pylearn2_benchmark.py 

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

I'm assuming I have to add install instructions for pycuda and scikit-learn, correct?

@benanne
Copy link

benanne commented Jul 29, 2014

I see. That should suffice for the legacy kernels and the wrapped cuda-convnet code. For the FFT-based implementation you will indeed need pycuda / scikits.cuda (not scikit-learn) as dependencies.

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

cool, made some more progress, but still errors out, is there a version of pycuda/scikits.cuda that i require?

> THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python pylearn2_benchmark.py
Using gpu device 0: GeForce GTX TITAN Black

CONFIG: input = 3 x 128 x 128 * ker = 3 x 96 x 11 x 11 ( bs = 128 , stride = 1 )
Input shape: (128, 128)
Detector space: (118, 118)
Output space: (118, 118)
pylearn2.models.mlp.ConvElemwise: 291.007529832 GFLOP/s ( tm = 0.426852285862 )
Traceback (most recent call last):
  File "pylearn2_benchmark.py", line 108, in <module>
    fprop()
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/function_module.py", line 589, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/compile/function_module.py", line 579, in __call__
    outputs = self.fn()
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/sandbox/cuda/fftconv.py", line 328, in thunk
    output_b_pycuda)
  File "/home/fatbox/code/convnet-benchmarks/theano/Theano/theano/sandbox/cuda/fftconv.py", line 275, in sc_complex_dot_batched
    cublas.cublasCgemmBatched(handle, transb, transa, m, n, k, alpha,
AttributeError: 'module' object has no attribute 'cublasCgemmBatched'
Apply node that caused the error: BatchedComplexDotOp(GpuContiguous.0, GpuContiguous.0)
Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D)]
Inputs shapes: [(8320, 128, 3, 2), (8320, 3, 96, 2)]
Inputs strides: [(768, 6, 2, 1), (576, 192, 2, 1)]
Inputs scalar values: ['not scalar', 'not scalar']

HINT: Re-running with most Theano optimization disabled could give you a back-traces when this node was created. This can be done with by setting the Theano flags optimizer=fast_compile
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: invalid value
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: invalid value
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: invalid value
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
Aborted (core dumped)

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

I installed the latest versions listed on pypi

https://pypi.python.org/pypi/pycuda
https://pypi.python.org/pypi/scikits.cuda

@benanne
Copy link

benanne commented Jul 29, 2014

Yeah, that scikits.cuda version is too old, you'll need 0.5.0 at least. The cublassCgemmBatched wrapper was something I added when I worked on this. Sorry, I should have mentioned this earlier.

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

ok great, works now! thanks.

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

just to confirm (so that this place isn't another war zone), the fft version looks to be about 2x faster than ConvElemwise for the first case.

Does that sound about right?
pylearn2.models.mlp.ConvElemwise: 299.74149507 GFLOP/s ( tm = 0.414414525032 )
(fft experimental) pylearn2.models.mlp.ConvElemwise: 595.44122732 GFLOP/s ( tm = 0.208613753319 )
pylearn2.sandbox.cuda_convnet: 1354.6420678 GFLOP/s ( tm = 0.0916974544525 )

@benanne
Copy link

benanne commented Jul 29, 2014

Could be, I haven't tested the current implementation myself. It'll also depend on the input size a lot. I believe you're using 3 input feature maps at the moment - in my experience, the FFT-based version will be mostly beneficial when there are a lot of input feature maps, because this becomes the inner dimension of a batched dot product in the Fourier domain.

Note that it will also have some overhead on the first run, because the FFT plan has to be created. Subsequent runs should be faster.

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

indeed, it is very fast for the later layers. Full log here:
https://github.com/soumith/convnet-benchmarks/blob/master/theano/output.log

@benanne
Copy link

benanne commented Jul 29, 2014

very cool :) I should mention though that the Gflop/s metric doesn't really make sense for the FFT implementation, it's not actually performing this many floating point operations, the FFT approach just needs fewer. 7 Tflop/s is actually more than the maximum that the Titan is capable of (about 4.5 Tflop/s).

I suppose the same goes for the Toeplitz matrix approach that Caffe uses, it will also need a different number of flops for a given convolution.

That said it's still useful to see how many Gflop/s it is equivalent to compared to a naive implementation.

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

I am changing the metrics as we speak :)

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

One last question before I add this entry into the table:

  • Where is the code located for this (experiemental FFT) module?
  • What is the status of this wrt usability (does it have :backward() code as well, is it unit tested etc.)?

@benanne
Copy link

benanne commented Jul 29, 2014

you can find the code here: https://github.com/Theano/Theano/blob/master/theano/sandbox/cuda/fftconv.py

Regarding usability: afaik there are tests, I don't know if anyone has tried using it 'in production' though. The main problem with it is that it uses a lot of memory, so it isn't applicable for every use case. No free lunch! :)

By :backward() I assume you mean the gradient. The way this is implemented is as an optimization that replaces Theano's own ConvOp with the FFT-based one. Because this only happens in the optimization phase, the gradient has already been calculated at that point, so the convolutions that are part of the gradient are also replaced by their FFT versions automatically. In short, it does not have its own gradient implementation, but because of the way Theano works, this is not necessary. The implementation of Theano's own ConvOp is reused.

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

Awesome, I am going to add this in for now with the information you provided. Thank you.

I have finished up the benchmark code for all the other libraries (except ccv) for the backpropagation as well (i.e. calculate gradients wrt image, gradients wrt parameters). Would anyone from LISA Lab modify this test to add the backpropagation timings as well?

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

Added! and made the table report numbers for 5 configurations. This module is currently the FASTEST!

@benanne
Copy link

benanne commented Jul 29, 2014

Cool! I have a feeling that might change when you get to the backward pass ;)

@liuliu
Copy link
Contributor

liuliu commented Jul 29, 2014

I could help with making ccv work with your configuration. For larger kernel window, FFT is the best, although AlexNet and MattNet has small kernels. Probably small kernel works better just because we have fewer samples though.

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

thanks, i will add the :backward numbers for all the modules this weekend (I've already finished it for ccn2, caffe and torch) i hope someone can change the Theano benchmark to incorporate the backward pass numbers as well.

@soumith
Copy link
Owner

soumith commented Jul 29, 2014

@liuliu that would be great! the way ccv benchmark works right now, it is a little hacky and i didn't find the time to modify cwc-bench for each of the configurations. Thanks!

@liuliu
Copy link
Contributor

liuliu commented Jul 29, 2014

@soumith , yeah, looking closer to your layer configuration, it is a bit hard to gets ccv's number because each layer's output doesn't match the other. Due to all the assertion check for layer output / input consistency, you have to go all the way to call the actual CUDA kernel probably to get some numbers out.

@liuliu
Copy link
Contributor

liuliu commented Jul 29, 2014

Also, you probably want to run with some real data (such as image or something) rather than the allocated uninitialized memory, I noticed that for all zero regions, my TITAN card sometimes cheats and have a shorter running time.

@nouiz
Copy link
Contributor Author

nouiz commented Jul 30, 2014

I just saw the update table results. I think it you should keep the fft
with the experimental.

Also, maybe it would be great to add a section with the limitation of each
module. For example, Theano(legacy) is the slowest, but most versatile in
the size/shape accepted.

Also, what about a new column with the extra temps memory needed for each
method? caffe need memory for the toeplitz matrix and the fft method need
extra memory too. For example, we could put the extra memory needed as the
formula for the shape like batchsize*kernelsize.

On Tue, Jul 29, 2014 at 7:13 PM, Liu Liu notifications@github.com wrote:

Also, you probably want to run with some real data (such as image or
something) rather than the allocated uninitialized memory, I noticed that
for all zero regions, my TITAN card sometimes cheats and have a shorter
running time.


Reply to this email directly or view it on GitHub
#5 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants