Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After FFT & Winograd, what next? #110

Open
ottolu opened this issue Jul 19, 2016 · 3 comments
Open

After FFT & Winograd, what next? #110

ottolu opened this issue Jul 19, 2016 · 3 comments

Comments

@ottolu
Copy link

ottolu commented Jul 19, 2016

Thanks @scott-gray @andravin for the awesome Winograd work. That really makes small conv kernel run super fast!
And the cuDNN team implement them so rapidly and nicely, really makes life much easier. Good job! @jdemouth

After these fancy ideas, I can't help thinking that what can we do to speed up training next?

Following the path of mathematics-based matrix multiplication optimization approach, with Winograd, we are likely to have reached to roof. On the top of Winograd, I know Le Gall & François did some great works around 2014, but no breakthrough improvement.
(Edit: my bad, thanks @andravin for reminding.)

Another interesting thing is the lack of FP16x2 support on GP104, which we highly look forward to, but instead, we got the full throughput dp4a, very powerful int8 computation ability.
I think in theory 8-bit is enough to carry the information with quantization. But how could we make good use of dp4a in training, it would be interesting.

So I wonder if @scott-gray @andravin @jdemouth @hughperkins @bhack @soumith etc. could share any ideas about this?
Sorry for can't cc all you lovely guys in community, who care about and contribute to DL performance. And any ideas are warmly welcomed!

@soumith, if you think here it's not so proper to discuss this topic, pls help me to close it and sorry for bothering. ;)

@andravin
Copy link

andravin commented Jul 20, 2016

Hi @nomi-wei , just a clarification: our fast convnet algorithms use Winograd's convolution algorithms. But the same Shmuel Winograd did co-author the Coppersmith-Winograd fast matrix multiplication algorithm, so the confusion is understandable (I probably should not even mention that Winograd also devised fast DFT algorithms ;-)

@ottolu
Copy link
Author

ottolu commented Jul 21, 2016

@andravin Ha-ha, my bad. Thanks for your clarification. It's really really helpful. ;-)
I didn't got the book you referenced, so I thought you might use winograd's famous mm method for this. LoL. No wonder I still find it hard to understand your approach, after I learned these mm algorithms from scratch these few days.

Thanks again!
BTW, Andrew, I wonder if you're still working on improving this conv kernel stuff? If yes, that would be awesome.

@hughperkins
Copy link
Contributor

I think in theory 8-bit is enough to carry the information with quantization.

googling for dp4a reaches a thread with Scott Gray in as first hit :-) https://devtalk.nvidia.com/default/topic/934562/cuda-programming-and-performance/nvidia-pascal-geforce-gtx-1080-amp-gtx-1070/post/4889687/ So I would say he's aware of it :-)

I was actually pondering dabbling with ints way back in 2014 http://computer-go.org/pipermail/computer-go/2014-December/007105.html ... but it's just one of many things that never survived contact with finite-hours-in-the-day :-)

Considering the effort involved in making gpus work, and work quickly, I would think the first thing to do might be to demonstrate using normal cpu code that you can get ok results? You could just fire up torch, and create torch.ByteTensors.

A few questions which occur:

  • how will you deal with overflows? It's one thing to have multiplications truncated to some maximum value, it's another thing for them to overflow into the opposite sign...
  • 8-bits means there are 256 different values. How will you deal with, well, gradients and stuff?

Hmmm, I'm simply reciting back to you the questions that were stated to me when I mentioned the idea myself :-) http://computer-go.org/pipermail/computer-go/2014-December/007106.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants