Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: plug-in support for new devices #4359

Closed
vrv opened this issue Sep 13, 2016 · 77 comments
Closed

Feature Request: plug-in support for new devices #4359

vrv opened this issue Sep 13, 2016 · 77 comments

Comments

@vrv
Copy link

vrv commented Sep 13, 2016

We need to be able to support using new devices in TensorFlow without requiring editing the TensorFlow binary source code.

This requires a few changes. Among many others:

  • The ability to dynamically register an implementation / factories for alternative devices
  • New or better APIs for allowing OpKernel's to use arbitrary device resources (e.g., right now we assume either the use of StreamExecutor or EigenDevice implementations, but that's obviously not general).
@vrv vrv self-assigned this Sep 13, 2016
@aodhan-domhnaill
Copy link

Few questions.

About Feature Request

By dynamically, I assume you are referring to at compile time, but self-contained code changes. If you mean at runtime via dynamic module loading, I don't know off hand how to do this, but I think we can figure it out.

How do you wish to handle the kernel code? Should I plan to have new hardware supported by developers modifying the tensorflow/core/kernels/* and registering the new hardware? Or do you want this also separated. Although this currently is handled with macros and registration systems in the C++, but I ask from a build perspective. We could make a new folder that contains all the code for extra devices and contains all the build scripts over there. It this is done, I can write template files that have all the basic structural code (many projects do this for configuration scripts) that a user could basically rename and be ready to go.

About TensorFlow

TensorFlow Device Contexts, Streams, and Context Switching

@vrv
Copy link
Author

vrv commented Sep 14, 2016

Yes, I do mean self-contained code changes: a user should be able to use an existing binary install of TensorFlow that doesn't contain the custom device code, but then call a function to "load a device" and then another function to "load the kernels" for that device.

The latter is already possible for existing devices via the tf.load_op_library() mechanism, so theoretically something similar could be done for a new tf.load_device().

I'll answer your other question on SO, but I don't think the answer there will be instructive for new devices. Every device has its own execution model, and so the way the device code is written needs to take into account the execution model. On CPU, the ThreadPoolDevice is just a device that uses Eigen's multi-threaded CPU support to implement operations. On GPU, we have to manage streams and execution order carefully because GPU devices are asynchronous, not synchronous.

If you told us more about the execution model of your hardware, we might be able to suggest something more relevant -- it's likely that copying the CPU or the GPU way of doing things is not the right solution.

@aodhan-domhnaill
Copy link

I need to talk with my supervisor before giving too many details. Here is the public info, but as a high level understanding, the chip is basically a cluster computer reduced to fit on a single chip. The many DSPs need to work together to do anything useful.

I didn't know about the operation interface! Its pretty awesome and I definitely think that is what I want to build.

It would seem that at a minimum, a developer would need to write an Allocator, Device, DeviceFactory, and DeviceContext. This would give a non-functional device because there are no kernels registered to it. As I was developing, I noticed that some kernels seemed to be core functions like ConstOp, AssignOp, NoOp, etc. that are needed for other things to work. It would seem that a user wouldn't want to code these explicitly as they are kind of obvious and redundant. Do you think these can/should be automatically built into the framework so that every device at least has these working out of the box?

@vrv
Copy link
Author

vrv commented Sep 14, 2016

Yes, those four pieces are the minimal requirements just to get the basics working, and then you'd have to register kernels for every op that was supported.

You're right that some ops are probably trivially implementable as long as some of the basics above are implemented. We'd have to think about how to 'auto register kernels' for all devices. However, for things like NoOp, it shouldn't be too hard: it would just be an include and a registration like in

REGISTER_KERNEL_BUILDER(Name("NoOp").Device(DEVICE_GPU), NoOp);

@aodhan-domhnaill
Copy link

aodhan-domhnaill commented Sep 14, 2016

How do I add headers to be installed into the TensorFlow include path? And how about for the .proto files? I need to "compile" them to headers and have them installed

I need several framework and common_runtime headers to compile device code. I tried to look through the Bazel files, but couldn't find anything obvious.

EDIT:

I tried a workaround of just setting my include path to the root of the Git repo, but found that things like #include "unsupported/Eigen/CXX11/Tensor" are being included. These would seem to be unnecessary for a "general" device. What should I do about this? I figure my options are: just include them (what I will do for the mean time), or write stripped down headers that contain only the minimal necessary to construct a device. IMO the second option is nicer for long term, but certainly more annoying to do.

EDIT 2:

Also, the Eigen files don't appear to have header guards and cause recurrent includes

@aodhan-domhnaill
Copy link

FYI. I got a fake device working with contexts, factories, kernels and all, so I will begin to try to make it self contained. I posted various questions to StackOverflow and things. Those are more educational. The library problem above is more important for the purposes of this issue.

Thanks a bunch!

@vrv
Copy link
Author

vrv commented Sep 15, 2016

  1. At the moment, unsupported/Eigen/CXX11/Tensor is required by the framework -- it's how you can map our tensor.h Tensor into typed and shaped tensors.

  2. Adding @keveman about how to add TF headers to include path. I thought https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#setting-up-tensorflow-for-development might be relevant, but I'm not sure.

@aodhan-domhnaill
Copy link

I got all the code down to one file with no changes to the main code base except the below. I list my solution ideas, but they involve modification of the main code so I wanted to make sure its okay.

diff --git a/tensorflow/core/util/device_name_utils.cc b/tensorflow/core/util/device_name_utils.cc
index 5816dbd..50309ed 100644
--- a/tensorflow/core/util/device_name_utils.cc
+++ b/tensorflow/core/util/device_name_utils.cc
@@ -162,6 +162,16 @@ bool DeviceNameUtils::ParseFullName(StringPiece fullname, ParsedName* p) {
       }
       progress = true;
     }
+    if (str_util::ConsumePrefix(&fullname, "/kpu:") ||
+        str_util::ConsumePrefix(&fullname, "/KPU:")) {
+      p->has_type = true;
+      p->type = "KPU";  // Treat '/kpu:..' as uppercase '/device:KPU:...'
+      p->has_id = !str_util::ConsumePrefix(&fullname, "*");
+      if (p->has_id && !ConsumeNumber(&fullname, &p->id)) {
+        return false;
+      }
+      progress = true;
+    }

     if (!progress) {
       return false;

To get rid of this one, I was thinking of using the existing registration, but that means exposing this function. I could just make the underlying std::unordered_map a private/protected member of the DeviceFactory class and then make DeviceNameUtils a friend. Or just expose the function handle in the header.

diff --git a/tensorflow/python/framework/device.py b/tensorflow/python/framework/device.py
index 8f5125d..505c2a9 100644
--- a/tensorflow/python/framework/device.py
+++ b/tensorflow/python/framework/device.py
@@ -155,7 +155,9 @@ class DeviceSpec(object):
         elif ly == 2 and y[0] == "task":
           self.task = y[1]
         elif ((ly == 1 or ly == 2) and
-              ((y[0].upper() == "GPU") or (y[0].upper() == "CPU"))):
+              ((y[0].upper() == "GPU") or 
+               (y[0].upper() == "CPU") or
+               (y[0].upper() == "KPU"))):
           if self.device_type is not None:
             raise ValueError("Cannot specify multiple device types: %s" % spec)
           self.device_type = y[0].upper()

If there is Python access to the same interface as above, then I can just use that. I am a fan of less configuration.

@vrv
Copy link
Author

vrv commented Sep 16, 2016

I think if you name your device "/device:KPU:0" instead of just "/KPU:0", it should work without any additional edits. "/cpu:0" and "/gpu:0" were shortcuts before we realized it was helpful to have the "/device" qualifier like we do for the other parts of a full device name. Can you give that a try and let me know?

@aodhan-domhnaill
Copy link

Yup that worked!

Device mapping:
/job:localhost/replica:0/task:0/device:FPU:0 -> physical description
/job:localhost/replica:0/task:0/device:FPU:1 -> physical description
/job:localhost/replica:0/task:0/device:FPU:2 -> physical description
/job:localhost/replica:0/task:0/device:FPU:3 -> physical description

Once I figure out how to get those includes and compile it outside of TF, I will submit a pull request within documentation a la "Adding an Op"

After that. I would like to discuss how to best support my particular hardware. My company probably wants this to be private as I will need to tell you details, so would you be willing to do it over email and not as a GitHub Issue?

@vrv
Copy link
Author

vrv commented Sep 16, 2016

Adding @petewarden

@aodhan-domhnaill
Copy link

@keveman

I have been manually adding headers to my include path and I need,

tensorflow/core/framework/device_attributes.pb_text.h
tensorflow/core/framework/op_segment.h
tensorflow/core/framework/resource_mgr.h
tensorflow/core/framework/tensor.pb_text.h
tensorflow/core/framework/type_index.h
tensorflow/core/graph/edgeset.h
tensorflow/core/graph/graph.h
tensorflow/core/graph/types.h
tensorflow/core/arena.h
tensorflow/core/lib/gtl/int_type.h
tensorflow/core/lib/gtl/iterator_range.h
tensorflow/core/public/session_options.h

Manually adding symlinks to for all those above headers tothe TensorFlow source, I still get this error,

g++ -std=c++11 -shared fpu_device.cc -o fpu.so -fPIC -I /usr/local/lib/python2.7/dist-packages/tensorflow/include
In file included from /usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/common_runtime/local_device.h:19:0,
                 from fpu_device.cc:10:
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/common_runtime/device.h: In member function ‘std::string tensorflow::Device::DebugString() const’:
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/common_runtime/device.h:139:74: error: ‘ProtoDebugString’ was not declared in this scope
   string DebugString() const { return ProtoDebugString(device_attributes_); }
                                                                          ^
fpu_device.cc: In member function ‘virtual tensorflow::Status tensorflow::FPUDevice::MakeTensorFromProto(const tensorflow::TensorProto&, tensorflow::AllocatorAttributes, tensorflow::Tensor*)’:
fpu_device.cc:84:41: error: ‘ProtoDebugString’ was not declared in this scope
            ProtoDebugString(tensor_proto));

@petewarden
Copy link
Contributor

ProtoDebugString() is generated by tensorflow/tools/proto_text, which is a bit like protoc but for a minimal footprint class instead of a full protobuf one.

@aodhan-domhnaill
Copy link

Thanks. Do you know how I can get it to be recognized when I am compiling it into an external module?

I am currently running

g++ -std=c++11 -shared fpu_device.cc -o fpu.so -fPIC -I /usr/local/lib/python2.7/dist-packages/tensorflow/include

@DavidNorman
Copy link
Contributor

DavidNorman commented Sep 22, 2016

Hello. I think I should introduce myself. I am working for Graphcore, a UK startup making a graph processor / accelerator. It seems that I am traveling the same path as Aidan.

I have a vaguely functioning device, although I was stuck on the 'having to edit the device_name_utils.cc and device.py' problem. It would be a very useful thing to be able to build a dynamically linked device/kernels module indeed. I have isolated all of my device/kernel code into the third_party branch. Does this make sense?

My device depends on an external library, and so I have also added some code to the workspace.bzl file. With only one workspace for the whole of tensorflow, there doesn't seem to be a way around this. Is there is something that I have missed about bazel that would allow me to break the need for that?

Cheers

@DavidNorman
Copy link
Contributor

I saw the slashdot conversation about DeviceContexts.

Can you clarify the difference between Devices and DeviceContexts. Why would I not store device specific information in my Device class? For instance, handles to device specific structures, mapped memory, etc?

@aodhan-domhnaill
Copy link

@DavidNorman If you look back a couple posts ago I had the same problem. The trick is calling you device something like "device:FPU". In my pull request, I do this

Also, @DavidNorman were you able to get your code to compile as a .so file? I have trouble with the includes. Or are you compiling into the TensorFlow main body

@aodhan-domhnaill
Copy link

@DavidNorman with regards to the Device Contexts, I forget were I heard it, but I think @vrv told me that GPUs have multiple contexts which handle different parts of computation like memory access, pushing kernels onto the device, etc.

Also, if you could put both your questions on StackOverflow, I will answer them there for better accessibility in the future. eg. My question about compilation

@DavidNorman
Copy link
Contributor

@aidan-plenert-macdonald thanks for the info I do not have a separate .so file at the moment (or a .dylib as it would be on my Mac). I have a few more hurdles to get past before I try to make that happen i think. The requirement for an external library means I have to modify the workspace.bzl, so I'm not sure that there is too much point unless I can figure out how to avoid that.

Have posted this question on slashdot

@aodhan-domhnaill
Copy link

@vrv I was wondering if you have any further information about adding includes. We just need those to be done.

@vrv
Copy link
Author

vrv commented Sep 27, 2016

I really don't know :( Going to assign this one to @keveman (I'll be OOO soon for a week).

@vrv vrv assigned keveman and vrv and unassigned vrv Sep 27, 2016
@aodhan-domhnaill
Copy link

Got it!! Simple one liner in a BUILD file.

Bottom of the page

@aodhan-domhnaill
Copy link

@DavidNorman
You were looking at the difference between DeviceContexts and Devices. I am debating between having Device represent actual hardware with several DeviceContext's that control virtual compute nodes (my device can allocate compute resources dynamically) which would each handle a vertex in my TF compute graph (I will use node to indicate hardware and vertex for TF operations). My concern is that TF doesn't appear to support binding Allocators to DeviceContexts which would be needed so I can make sure allocation for each TF vertex happens within my virtual compute node. On the other hand if I have Devices be my virtual compute nodes and bind each one to a TF vertex, then I can't dynamically size them as I believe the creation of devices in the DeviceFactories happens before the TF compute graph is assembled.

Can I ask how you are assigning compute resources to nodes in the TF compute graph? Are you using Device to control hardware or is this done with contexts? Do you have similar problems with allocation? Do you know anything about TF's ability to do automatic resource allocation? I believe TF can auto-assign devices to operations.

@vrv Is there any way to bind Allocators to specific device contexts?

@vrv
Copy link
Author

vrv commented Sep 29, 2016

@aidan-plenert-macdonald: Typically an Allocator manages the memory for an entire device.

In the GPU case, there is one allocator for each GPU device, but a GPU device has hundreds to thousands of "cores", each which has access to the global memory that the allocator is responsible for allocating, and programs are responsible for setting properly. CUDA has ways in its programming model to additionally have local fine-grained sharing of memory among cores, and that is specific to the cuda programming language, and we don't touch that really, since it's not part of the global memory pool.

I suspect that your device is really more like many loosely coupled cores with message passing, which is a model we haven't really optimized for. We would normally treat each 'core' as a separate device, and then your memcopy primitives (device to device) would use whatever high-performance inter-node communication primitives you wanted.

Alternatively, if you'd rather treat your entire processor as a single "device", it might still be possible: allocator.h has an AllocatorAttributes structure, which is a 64-bit opaque value blob. The top 8 bits of that structure we've made device specific.

OpKernel has a GetAllocator() function that takes allocator attributes, so it might be possible for you to have the DeviceContext contain information that an OpKernel can use to set the appropriate bits of the AllocatorAttributes, and then you'd implement DeviceBase::GetAllocator() to return a different allocator based on the top 8 bits of the AllocatorAttributes. If you used all 8 bits as an allocator id, you'd be able to address 256 different allocators in a single device.

Without knowing too much about your device, I'm not sure which approach is better, but those might help you make progress.

@eliben
Copy link
Contributor

eliben commented Jan 27, 2017

@RichDubielzig not sure I understand the maximum code size requirement. Care to elaborate? Where in the API would you expect to see this?

@RichDubielzig
Copy link

I would expect to see code size considerations taken in generation of kernel code from LLVM IR. The instuction memory for individual cores in Knureon is quite small, and it looks like XLA could JIT kernel code that would overrun program memory. Is there a mechanism to break up a block of IR into smaller pieces?

@eliben
Copy link
Contributor

eliben commented Jan 27, 2017

@RichDubielzig this functionality is not supported by the current CPU/GPU backends; however if you'd plan to add a custom backend for your architecture you could implement this.

@aodhan-domhnaill
Copy link

@eliben Do you mean implement it in the main TensorFlow, or just in our library? Can you recommend a good place that it might fit into nicely?

@RichDubielzig
Copy link

I'm afraid I'm having trouble understanding how to handle synchronization and asynchronous kernel resources when developing in TensorFlow using TensorFlow objects. Here is my problem:

The knureon system can be thought of as a large pool of completely independent cores which can be assigned operations at any time. So, for example, given the data flow below:

A: m = a * b
B: n = c *d
C: o = e * f

If I have 128 available cores, then I might request A to run on 64 cores, B to run on another 64 cores, and then I would need to wait for one of the first two operations to complete before I can C on 64 cores. To this end, I have created a little launch queue object that can hold computation requests until compute resources are available, then run. In pseudocode, my naive OpKernel Compute() implementation is below. Note that this employes a Knureon-specific command queue object which can be used both to launch operations and to pend on running ones. Multiple queues are allowed to exist and run in parallel on a single Knureon system.

Compute( context  input):
  Get input tensors A & B from context
  Also get a Knureon command queue handle Q from the DeviceContext object.
  Figure out the dimensions of the tensors
  Assemble a call f(A,B,dims) to my Knureon matrix multipllier algorithm
  Queue the tuple (f(),Q) in the launch queue.
  We can now block on Q until f() completes.

(in launch queue, in a separate thread:)
  Wait for resources to run f()
  dispatch f() using the queue Q on the Knureon complex.

The problem should be apparent: I don't want TensorFlow to sit around waiting on Q if there are other operations that can be deployed right away on available resources. The alternative appears to be to use an asynchronous opkernel:

ComputeAsync( context  input, DoneCallback done):
  Get input tensors A & B from context
  Also get a Knureon command queue handle Q from the DeviceContext object.
  Figure out the dimensions of the tensors
  Assemble a call f(A,B,dims) to my Knureon matrix multipllier algorithm
  Queue the tuple (f(), Q, done) in the launch queue.
  Return immediately

(in launch queue, in a separate thread:)
  Wait for resources to run f()
  run f() using Q on the Knureon complex and spawn a task that will wait on Q.  This task will call the ComputeAsync done() call back when f() is complete.

But I am not sure if this is the right approach, as it seems to be reserved for high-latency operations such as receiving over a network. I've been over and through the code, and unfortunately this has only increased my confusion. I see that the GPU uses a StreamExecutor, but the presence of all the ThenXxxx() functions in the Stream prototype makes me suspect that it is not what I want.

I have also noticed that OpKernel Compute() methods can be called in parallel from concurrent Executor threads. So do I even need to sweat about parallel execution at all? When an OpKernel's Compute() method is invoked, am I guaranteed that there will be no benefit to running asynchronously because all the other OpKernels managed by my threads' Executor have data dependencies on my operation?

Thank you in advance and my apologies for rambling. I've had to spend a few days figuring out how to phrase this question in a coherent manner, and I'm not sure I've met my goal, so if you need anything clarified please let me know.

@GiuseppeDiGuglielmo
Copy link

GiuseppeDiGuglielmo commented Feb 23, 2017

Hi! As many others I am really interested in defining my own devices.

I read this thread as well as some others. Is there any documentation available to implement new devices (either in the TensorFlow source code or as a separate module)?

@RichDubielzig and @aidan-plenert-macdonald were working on a guide, but at the moment I could only see this:
https://github.com/knuedge/tensorflow/blob/36e0cdf04f294bfd51931d4f78e291590ed0d3ec/tensorflow/g3doc/hardware/adding_support/index.md

In there anything more recent (targeting the release 1.0)?

Thank you!

@vrv
Copy link
Author

vrv commented Feb 24, 2017

@CUinNYC indeed, that doc is a great attempt to show it's done, and that's the basic idea of how the scaffolding works, but the implementation details for every new hardware device means it takes care to figure out exactly how to implement the device code.

We're working with a few external folks such as those in this thread to provide some early guidance as to how it's done, and at some point we'll have some nice examples to point at (beyond say, our CPU and GPU implementations).

@RichDubielzig: the dataflow model of TF execution means that any operations that can be run at a given time will be run by the executor (assuming there are enough inter-op-parallelism threads to run them). So yes, it is possible to have multiple "matmul" nodes running at once if both can be run at the same time.

On CPU, we use Eigen for a lot of the computation, and each op has a shared threadpool on which to execute -- so those threads will typically always be busy if there's lots of op work to do (though I admit, it probably won't be super efficient to context switch all the time, but I digress).

On GPU, we use GPU streams to enqueue asyncrhronous ops: execution of an OpKernel just queues the work to be done, and we use streams to enforce execution order at the actual device.

@hawkinsp might have more to say here.

@GiuseppeDiGuglielmo
Copy link

GiuseppeDiGuglielmo commented Feb 24, 2017

@vrv I really appreciate the initial documentation. Thanks to that, I successfully created a "fake CPU" (really easy indeed). It required minor changes because of the TensorFlow code evolution (r1.0). So having an external-module approach would be more portable. Let me know if you need a help or feedback comments on a preliminary documentation.

In particular, I am interested in how to allocate memory for accelerators. My goal is to reduce as much as possible memory copies.

If anybody has comments or examples I will really appreciate that.

@RichDubielzig
Copy link

Just following up on my question: I am seeing results with this approach:

  1. Define my operation as an AsyncOpKernel. Kernel pseudocode:
ComputeAsync( context  input, DoneCallback done):
  Get input tensors A & B from context
  Also get a Knureon command queue handle Q from the DeviceContext object.
  Figure out the dimensions of the tensors
  Assemble a call f(A,B,dims) to my Knureon matrix multipllier algorithm
  Queue the tuple (f(), Q, done) in the launch queue.
  Return immediately
  1. In launch queue, in a separate thread:
(as part of init routine, we have created our own dedicated ThreadPool, which includes a thread to run this wait/spawn loop)

  Wait for resources R to run f()
  schedule a task to run f() on R resources using Q and calling done().
  1. In the task scheduled above:
  deploy f() to R and kick it off.
  wait on Q to finish
  release R back to the Knureon environment
  call the AsyncOpKernel done() callback
  exit

@RichDubielzig
Copy link

Asked a new question on StackOverflow about the ConstTensor, which doesn't map to the DMAHelper::base() function:

http://stackoverflow.com/questions/42707600/tensorflow-how-to-get-pointer-to-data-contents-of-consttensor

@RichDubielzig
Copy link

Posted another question on StackOverflow regarding my confusion with DeviceContext and when it needs to be used. I am revisiting the issue because it turns out we have a smaller limit than I thought on open queues with the system and I'm wondering if I should

  1. redesign with the idea of mapping one DeviceContext per queue
  2. keep with the current design of one DeviceContext per OpKernelContext and manage queues in a separate pool as they are needed.

http://stackoverflow.com/questions/42869340/tensorflow-why-does-a-gpu-device-only-have-one-device-context-and-does-it-rea

@RichDubielzig
Copy link

Another issue we have run into in attempting to debug XLA: It doesn't seem like we are able to exercise any backend compiler when running in a debugger. The issue is here:

#9194

@vrv
Copy link
Author

vrv commented Apr 13, 2017

@RichDubielzig I've been told that XLA-related development discussion has been moved to https://groups.google.com/forum/#!forum/xla-dev -- you might get more help / feedback there now.

@RichDubielzig
Copy link

Asked a new question related to this on stackoverflow. Will also post to the xla-dev forum

https://stackoverflow.com/questions/44271775/can-tensorflow-xla-be-supported-by-plug-ins

@vrv vrv removed their assignment Jul 10, 2017
@hawkinsp hawkinsp removed their assignment Dec 20, 2017
@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly.

1 similar comment
@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly.

@eliben eliben removed their assignment Jan 4, 2018
@shivaniag
Copy link
Contributor

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests