[Backend][hlib][v0.3] External IPs Integration Support for HeteroCL #170

hecmay · 2020-03-24T15:46:50Z

In this PR, we enable support for HLS and RTL IPs integration into HeteroCL. The external IPs are pre-defined functions in hlib, consisting of both functional behavior level description (used for LLVM JIT simulation) and IP information (e.g. interface ports, IP file directory). Take vector add RTL IP as an example. It requires users to call the pre-defined function in hlib.

        A = hcl.placeholder(in_shape, name="A")
        B = hcl.placeholder(in_shape, name="B")

         def func(A, B):
            return hlib.op.extern.vector_add_rtl(A, B)

         s = hcl.create_schedule([A, B], func)

The IP integration will happen in the code generation phase, where the code generator creates the corresponding Makefile and XML options to integrate the RTL / HLS IPs.

Tutorial on Adding HLSC/OpenCL IP in to HeteroCL:

This tutorial will walk you through the main steps to create, simulate and deploy a HLS (i.e., HLSC or OpenCL) IP into HeteroCL. We will take FFT (Fast Fourier Transformation) as an example. How FFT algorithm works is out of scope for this tutorial. Please check this link if you are interested.

Create a behavior level function

the behavior level function is the functionally equivalent HeteroCL code of the HLS IP to be integrated. This part is recommended if you want to verify the IP works correctly along with other components in the program using HeteroCL LLVM JIT simulation. The HeteroCL version of FFT is available in the master branch.

For the HeteroCl implementation of the algorithm, you can either create & return tensors, or update the passed-in tensors. The algorithm part should be wrapped with a HeteroCl super stage, hcl.Stage("ExternModule") in this example. If you do not want to run any SW simulation, simply creating some dummy HeteroCL statements under the super stage should also work (not recommended).

     import heterocl as hcl
     from hlib.op.extern import create_top_module
     
     def fft_module(X_real, X_img)
     	# step 1. create behavior function for soft ip
     	with hcl.Stage("ExternModule") as Module:
         	# implement the FFT logic in HeteroCL API
             # hcl.update(X_real, lambda *args: ...)
             # return hcl.compute((L,), lambda i: ... )
         
         # step 2. configure the soft ip 
         # IP function name 
         dicts["name"] = "hls::fft<config>"
         # tensor inputs (name, dtype tuple) to the IP function
     	 tensors = [X_real, X_imag, F_real, F_imag]
     	 dicts["args"] = [(_.name, _.dtype) for _ in tensors]

        # ip function headers and calling convention 

         dicts["header"] = 
"""
#include \"hls_fft.h\"
#include <complex>
struct config : hls::ip_fft::params_t {
  static const unsigned ordering_opt = hls::ip_fft::natural_order;
  static const unsigned config_width = 16; // FFT_CONFIG_WIDTH
};
typedef std::complex<ap_fixed<16,1>> fxpComplex;
"""
     
     # statements to be inserted before IP function 
     dicts["ip_func"] = 
"""
hls::ip_fft::config_t<config> fft_config;
hls::ip_fft::config_t<config> fft_status; 
fft_config.setDir(0);
fft_config.setSch(0x2AB);
complex<ap_fixed<16,1>> xn[{}];
complex<ap_fixed<16,1>> xk[{}];
for (int i = 0; i < {}; i++)
    xn[i] = fxpComplex({}[i], {}[i]);
hls::fft<config>(xn, xk, &fft_config, &fft_status);
for (int i = 0; i < {}; i++) {{
    {}[i] = xk.real();
    {}[i] = xk.imag();
}}
""".format(L, L, L, X_real.name, X_imag.name,
             L, F_real.name, F_imag.name)

         # dictionary specifying header, pre-function and post-function cfg
         create_top_module(Module, dicts, ip_type="hls")

Configure the inputs, outputs and core logic for software IP module

To configure the IP and let HeteroCL integrate your IP, you need to pass the IP information into the create_top_module function provided by HeteroCL, as shown in the snippet above. We sue this function to create a top-level module (which will be mapped to e.g. an OpenCL kernel function in the code generation stage) for the soft IP. We also support integrating the IP within a top module.

The dicts argument is the core of HLS IP integration process, in which we allowed users to directly insert raw HLS statements into HeteroCL program. Since most of advanced C/C++ features cannot be expressed with HeteroCL, we leave the IP configuration to users to keep the flexibility of IP integration. Users are allowed to insert HLS code in header, right before and after the IP function.

Notice that the inputs and outputs arguments must be tensors, and if the users want to use some IP function with other data types, like complex data type in the example, the conversion logic must be implemented using dicts["ip_func"]. In the later release, we need to add some automatic detection algorithm to generate data type conversion logic.

Data movement with HLS IP

There are three IP types (RTL / HLS / Host). The IP core of type RTL and HLS must be moved to device scope using .to (as shown in the example in the snippet below). The IP core is the minimum placement uint in the view of data placement API. Namely, you cannot move any tensors inside an IP core back and forth between device and host.

       A = hcl.placeholder(in_shape, name="A")
       B = hcl.placeholder(in_shape, name="B")
       
       def kernel(A, B):
           real, imag = fft_module(A, B)
           return hcl.compute((length,), lambda x:
                hcl.sqrt(real[x] * real[x] + imag[x] * imag[x]), name="abs")
       
       s.to([A, B], target.xcel)

The code for this example is available here: https://github.com/Hecmay/heterocl/blob/extern/hlib/python/hlib/op/extern.py#L202

zhangzhiru · 2020-03-24T17:13:44Z

This is an excellent starting pointing.

What does .op mean?
Also, do we have to put external libs under hclib? We need to be more careful naming the libraries. In this case, we need to have a separate lib for xilinx and further separate the HLS and RTL IPs.

hecmay · 2020-03-24T17:27:14Z

This is an excellent starting pointing.

What does .op mean?
Also, do we have to put external libs under hclib? We need to be more careful naming the libraries. In this case, we need to have a separate lib for xilinx and further separate the HLS and RTL IPs.

.op means operator. The hlib.op includes many common operations (e.g. exp or NN layers). I put the external IP APIs in the same level for regularity and consistency. For now all of the external libs are under the hlib folder.

Each IP core will be marked with a specific attribute, indicating its targeting FPGA and levels of abstraction. I will also add another IR pass to support automatic data type transformation for the external IP calls (e.g. transforming a tensor to hls::stream<ap_axiu<>>)

… extern

hecmay · 2020-03-30T17:43:12Z

New features introduced in this PR:

Code Generator to generate TCL / Makefile: The integrated RTL IP is considered as a blackbox, with which we need to add some additional flags to Makefile as well as extra TCL scripts to specify the port interface of the IP.
New IR node for device placement: a new ExternModule IR is introduced in this PR. This IR node will wrap all statements running on a specific device (e.g. SSD or another node in the cluster). The new code generator gives us more flexibility to support different devices with various requirement.

s.to(tensorA, target.host.Flash)
s.to(tensorB, target.HBM)

hecmay · 2020-04-07T20:58:35Z

Integration granularity of the external RTL IPs.

Ideally we want to integrate all the RTL IPs as blackboxes into our kernel program, where we can simply call the RTL IP as a sub-function, and EDA tool will replace the function call with the user-provided RTL code.

def kernel(image):
    out1 = hlib.op.extern.rtl.image_filter(image)
    out2 = hlib.op.extern.rtl.refine(out1)
    return out2

s.to(image, target.xcel)
s.to(kernel.out2, target.host)

However, to integrate the RTL IPs into HLSC program, we need a interface specification configuration file like: https://github.com/Xilinx/HLS-Tiny-Tutorials/blob/master/misc_rtl_as_blackbox/rtl_model.json, which is oftentimes unavailable from neither the users or HeteroCL.

seanlatias · 2020-04-25T14:13:13Z

Can you fix the tests?

seanlatias · 2020-04-25T14:14:34Z

Please also replace your fist post with your documentation so that the users do not need to scroll down to see it.

hecmay · 2020-04-25T14:20:48Z

Please also replace your fist post with your documentation so that the users do not need to scroll down to see it.

Moved the tutorial to the top. Will fix the test now.

hecmay · 2020-04-25T15:19:38Z

Data Movement in Heterogenous Memory System

In this proposal we use HBM as an example. The channel or bank allocation for DDR and PLRAM fits well with the same interface proposed here.

The assignment of HBM channels comes along with compute unit (CU) replication. We are supposed to assign different channels to each arguments in each CU duplicate to maximize the bandwidth. Here is the proposed interface:

1) we can specify the kernel number (i.e. how many CU to duplicate) in the data movement API with splitting_factor option. In this case, multiple CU duplicates are created, inputs will be split evenly and assigned to different HBM channels (If the total # greater then 32, some arguments will be assigned to the same HBM channel)

2) split the input tensors in a single dimension using splitting_dim option. In this case, we can reshape the input tensors, and split the tensors along certain dimension. In this example, we split the input tensor along the 0-th dimension, and 16 CU duplicates are generated accordingly.

A = hcl.placeholder(in_shape, name="A")
B = hcl.placeholder(in_shape, name="B")

def kernel(...):
    # algorithm...

# create custom platform 
config = {
    "host": hcl.device.cpu("intel", "e5"),
    "xcel": {
        hcl.device.fpga("xilinx", "xcvu19p"),
        hcl.device.gpu("nvidia", "gtx-1080") 
    }
}
p = hcl.platform.custom(config)

# case 1. move tensors to HBM with splitting factor: the input tensors are 
# split into multiple pieces and each piece assigned to a separate CU
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

# case 2. assign the channel explicitly with a single CU
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm.bank0)

# case 3. reshape and split along certain dimmension  
s.reshape([A, B], (2, 16))
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_dim=0)

zhangzhiru · 2020-04-25T21:30:38Z

A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm.bank0)

This is a good starting point. As always, we need to streamline the terms. Does bank correspond to a virtual channel? Also I suggest we use bank[0] instead of bank0

zhangzhiru · 2020-04-25T21:32:14Z

case 1. move tensors to HBM with splitting factor: the input tensors are
split into multiple pieces and each piece assigned to a separate CU
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

I don't think it's a good idea to mix compute and memory customizations. Here we should combine .to() with a separate .parallel() primitive to clearly indicate which kernel we are duplicating.

zhangzhiru · 2020-04-25T21:35:13Z

s.reshape([A, B], (2, 16))
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_dim=0)

Similar to my previous comment, we shall cascade .to() with another reshape/partition primitive. It's really important to not to entangle multiple optimizations in one primitive.

hecmay · 2020-04-25T21:51:09Z

case 1. move tensors to HBM with splitting factor: the input tensors are
split into multiple pieces and each piece assigned to a separate CU
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

I don't think it's a good idea to mix compute and memory customizations. Here we should combine .to() with a separate .parallel() primitive to clearly indicate which kernel we are duplicating.

We do not have such a kernel here to apply parallel primitive. That's why I used this entangled approach as a workaround. All stages dependent on the tensors moved to device form a kernel: as shown in the example here. If we move tensor A and B to device, and move tensor ret back to host, then the combination of all stages in the middle (i.e. stage 1 to k) is considered as the kernel in this program.

A = hcl.placeholder((10,))
B = hcl.placeholder((10,))

# stage 1 to stage k
# .... compute something

ret = hcl.compute((10,), lambda *args: ...)

I cannot find a clean and concise way to specify the range of the kernel. @seanlatias Do you have any suggestion?

zhangzhiru · 2020-04-25T23:29:25Z

split into multiple pieces and each piece assigned to a separate CU

I thought the CU you're referring to here has to correspond to a compute kernel that needs to be duplicated? If not, why are we moving the tensor to the device?

hecmay · 2020-04-26T03:58:30Z

The discussion for heterogeneous memory placement has been moved to #180.

init

f3d7e53

extern module ir

9f40b2a

hecmay mentioned this pull request Mar 27, 2020

[API] Data Movement Support Enhancement #171

Merged

hecmay added 3 commits March 28, 2020 18:44

Merge branch 'v0.3' of https://github.com/cornell-zhang/heterocl into…

f6bfd33

… extern

decorator

5eaa37b

demo

ebf32ca

hecmay mentioned this pull request Mar 29, 2020

Testing does not compile on CircleCI #173

Closed

codegen fix

c054fe2

rtl ip

22eebaa

hecmay mentioned this pull request Apr 10, 2020

SW Simulation failed using RTL IP integration #176

Open

hecmay added 2 commits April 19, 2020 17:07

fft sim works

85e2946

verify ir

fa9467b

fix test

c4513b6

update interface

705d3f7

hecmay changed the title ~~[WIP][Utils] External IPs Integration Support for HeteroCL~~ [Utils] External IPs Integration Support for HeteroCL Apr 29, 2020

clean up

d756f4b

hecmay requested a review from seanlatias May 1, 2020 03:01

hecmay added 3 commits May 1, 2020 00:27

remove blackbox

e0df371

change name

e021af7

remove config.ini

73e4369

seanlatias added Backend hlib labels May 2, 2020

seanlatias merged commit 220c7a6 into cornell-zhang:v0.3 May 2, 2020

seanlatias changed the title ~~[Utils] External IPs Integration Support for HeteroCL~~ [Backend][hlib] External IPs Integration Support for HeteroCL May 2, 2020

seanlatias added the v0.3 label May 3, 2020

seanlatias changed the title ~~[Backend][hlib] External IPs Integration Support for HeteroCL~~ [Backend][hlib][v0.3] External IPs Integration Support for HeteroCL May 3, 2020

seanlatias mentioned this pull request May 3, 2020

[Roadmap] HeteroCL Release Note v0.3 #188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend][hlib][v0.3] External IPs Integration Support for HeteroCL #170

[Backend][hlib][v0.3] External IPs Integration Support for HeteroCL #170

hecmay commented Mar 24, 2020 •

edited

Loading

zhangzhiru commented Mar 24, 2020 •

edited

Loading

hecmay commented Mar 24, 2020

hecmay commented Mar 30, 2020

hecmay commented Apr 7, 2020

seanlatias commented Apr 25, 2020

seanlatias commented Apr 25, 2020

hecmay commented Apr 25, 2020 •

edited

Loading

hecmay commented Apr 25, 2020 •

edited

Loading

zhangzhiru commented Apr 25, 2020

zhangzhiru commented Apr 25, 2020 •

edited

Loading

zhangzhiru commented Apr 25, 2020

hecmay commented Apr 25, 2020

zhangzhiru commented Apr 25, 2020

hecmay commented Apr 26, 2020

[Backend][hlib][v0.3] External IPs Integration Support for HeteroCL #170

[Backend][hlib][v0.3] External IPs Integration Support for HeteroCL #170

Conversation

hecmay commented Mar 24, 2020 • edited Loading

Tutorial on Adding HLSC/OpenCL IP in to HeteroCL:

Create a behavior level function

Configure the inputs, outputs and core logic for software IP module

Data movement with HLS IP

zhangzhiru commented Mar 24, 2020 • edited Loading

hecmay commented Mar 24, 2020

hecmay commented Mar 30, 2020

hecmay commented Apr 7, 2020

seanlatias commented Apr 25, 2020

seanlatias commented Apr 25, 2020

hecmay commented Apr 25, 2020 • edited Loading

hecmay commented Apr 25, 2020 • edited Loading

Data Movement in Heterogenous Memory System

zhangzhiru commented Apr 25, 2020

zhangzhiru commented Apr 25, 2020 • edited Loading

zhangzhiru commented Apr 25, 2020

hecmay commented Apr 25, 2020

zhangzhiru commented Apr 25, 2020

hecmay commented Apr 26, 2020

hecmay commented Mar 24, 2020 •

edited

Loading

zhangzhiru commented Mar 24, 2020 •

edited

Loading

hecmay commented Apr 25, 2020 •

edited

Loading

hecmay commented Apr 25, 2020 •

edited

Loading

zhangzhiru commented Apr 25, 2020 •

edited

Loading