-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize Forward / Backward by Depth #547
Comments
One of the design that this feature could speed up I think that is the model in the diagram at page 13 of this pubblication: |
Pyramids and any model with late fusion [1, plus others and more to come] should likewise benefit. [1] Large-Scale Video Classification with Convolutional Neural Networks |
Fresh meat from cvpr 🍖 |
Actually any two non overlapping paths could be run in parallel, even if On Thursday, June 26, 2014, bhack notifications@github.com wrote:
Sergio |
@sguada right, advancing by depth covers that case too: execute in parallel depth-by-depth and if any particular path completes that's fine, just keep going until the execution of the deepest layer. There's no requirement for equal length. There has to be some logic to decide the number of streams / handles though. To start this could simply be manually selected. |
The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. |
@kloudkl I don't know if also graphlab and graphchi could be useful: |
A simple graph traversal to make a depth -> layers mapping should suffice for our purposes. Thanks for the project pointers all the same. |
Yes but what kind of parallelization paradigm? Multi thread, multi device, (long term) distributed, or any combination of this. |
My only concern with paths of different lengths is that one can be Sergio 2014-06-27 11:49 GMT-07:00 bhack notifications@github.com:
|
Our parallelization goal is entirely single node. We have single process multi-thread / multi-device parallelism in mind. Distributed computation has its place, but in my opinion there's no point pursuing it while there are still important single node gains to be made. Of course anyone is free to pursue whatever parallelization they want, but this is the present direction of the project.
That was my thinking. We can always engage in fancier parallelization later if need-be, but depth ordering should suffice. |
Have you tried to parallelize on a multi-device node using NVBLAS(#194) which only requires dynamically linking the shared library? |
@kloudkl no, because I want to control the communication and only It could be interesting to try distributing all BLAS operations with On Sun, Jun 29, 2014 at 7:04 PM, kloudkl notifications@github.com wrote:
|
NVBLAS is such a low-hanging fruit that it is really worth some benchmarks. But I don't have acess to multi-GPU devices in the near future. I hope someone interested will be able to do so. |
I'm a little skeptical because it has a shared host memory model that That said, I've only given a cursory look at cublasXT and would welcome On Sun, Jun 29, 2014 at 7:35 PM, kloudkl notifications@github.com wrote:
|
Forward and Backward are done in sequence by layer ID at the moment. In principle, all Forward / Backward steps at the same depth in the DAG can be executed in parallel.
In DAG models where single layer operations do not saturate the host / device, this should improve performance.
As I understand it, this would be done by batch cuBLAS and streams for parallel kernel execution at each depth in the model.
The text was updated successfully, but these errors were encountered: