Update: actually, most layers can be efficiently ported by using the cltorch methods directly from lua. A notable exception to this is the convolutional layers.
Anyway, if you do want/need to port a layer that can't be efficiently implemented using cltorch methods, directly in lua, here are some guidelines for how you might proceed.
- make sure that
cunn
is checked out to directorycunn
, at the same level asclnn
directory - change into
clnn
directory - create a directory
port
- run
python util/port.py
, which will do a first-cut port of the cuda files from../cunn
directory into theport
subdirectory - use meld or similar to copy the two or so files from the desired layer across into the
clnn
directory - add the .cpp file to CMakeLists.txt
- change the init function, at hte bottom of the layer's .cpp file to not be static
- add a call to the layer's init function to the
init.cpp
file - add some includes to the top of the .cpp file:
#include "luaT.h"
#include "THClTensor.h"
#include "THClTensorMath.h"
#include "THClBlas.h"
#include "THClKernels.h"
#include "templates/TemplatedKernel.h"
#include <iostream>
#include <string>
using namespace std;
- comment out any cuda-stuff in the .cpp file, and add
THError("Not implemented");
in their place - try building, and fix any build errors
In the .cl file:
- replace each
float * foo
kernel parameter withglobal float *foo_data, int foo_offset
- at the start of the kernel, for each of these parameters, put:
global float *foo = foo_data + foo_offset;
- put
global
in front of any float * variables that are derived from these variables - put
local
in front of any float * variables derived from any float variables
- Add a
stringify
section to the bottom of the .cpp file. It should look something like:
std::string MyNewLayer_getKernelTemplate() {
// [[[cog
// import stringify
// stringify.write_kernel( "kernel", "MyNewLayer.cl" )
// ]]]
// [[[end]]]
return kernelSource;
}
- change the bit saying MyNewLayer.cl to have the actual name of the .cl file
- change the name of the method to replace
MyNewLayer
with the actual name of the layer - cd into
build
directory, runccmake ..
, and change optionDEV_RUN_COG
toON
, and doconfigure
andgenerate
- rebuild => the bottom of the .cpp file should now contain the .cl source code, as a c++ std::string
- copy the declaration of this method to the top of the .cpp file
Calling the kernel comprises the following parts:
- create a kernel templater, something like
TemplatedKernel kernelBuilder(THClState_getCl(state));
- If there are any templated parameter to replace (not discussed in this doc yet), you'll need to pass those to the templater now
- create the kernel
- you need to create a unique name. This will be used to lookup the compiled kernel, and re-use. If it is not sufficiently unique, it will collide with other kernels of the same name, and the wrong kernel will be called ;-)
- give the name of the cl file (this wont affect anything, just used for error messages; not so critical)
- you need to provide the name of the stringify function you created above
- you need to provide the exact name of the kernel function; if it's wrong, then the kernel wont be able to be run
std::string uniqueName = __FILE__ "maxpool";
CLKernel *kernel = kernelBuilder.buildKernel(uniqueName, __FILE__,
SpatialMaxPooling_getKernelTemplate(), "maxpool");
- create a THClKernels object, from the kernel object you created just now
THClKernels k(state, kernel);
- pass in parameters
k.in(input);
k.out(output);
k.out(indices);
k.in((int)(nbatch*nInputPlane*nOutputCols*nOutputRows));
k.in((int)0);
k.in((int)nInputPlane);
k.in((int)nInputRows);
k.in((int)nInputCols);
k.in((int)kH);
k.in((int)kW);
k.in((int)dH);
k.in((int)dW);
- call the kernel :-)
k.run(blocks, threads);
Now build, and run it, and fix any issues :-P
I'm using OpenCL on an nVidia device, which means no kind of debugging or profiling available to me.
What I tend to do is:
- focus on a single thread, on a single workgroup
- that's easy to do, just put:
if(get_global_id(0) == 0 && get_global_id(1) == 0 ) {
// only one thread here :-)
}
- comment out anything that changes the output tensor
// out_data[i] = sum;
- use our single thread to write out interesting data to the output tensor, that we can then read from the lua, like
if(get_global_id(0) == 0 && get_global_id(1) == 0 ) {
for(int i = 0; i < 6; i++ ) {
out_data[i] = smem[i];
}
}
- it's a bit more painful than using
printf
andcout
and stuff, but it's workable :-)