Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backend][hlib][v0.3] External IPs Integration Support for HeteroCL #170

Merged
merged 15 commits into from
May 2, 2020

Conversation

hecmay
Copy link
Collaborator

@hecmay hecmay commented Mar 24, 2020

In this PR, we enable support for HLS and RTL IPs integration into HeteroCL. The external IPs are pre-defined functions in hlib, consisting of both functional behavior level description (used for LLVM JIT simulation) and IP information (e.g. interface ports, IP file directory). Take vector add RTL IP as an example. It requires users to call the pre-defined function in hlib.

        A = hcl.placeholder(in_shape, name="A")
        B = hcl.placeholder(in_shape, name="B")

         def func(A, B):
            return hlib.op.extern.vector_add_rtl(A, B)

         s = hcl.create_schedule([A, B], func)

The IP integration will happen in the code generation phase, where the code generator creates the corresponding Makefile and XML options to integrate the RTL / HLS IPs.

Tutorial on Adding HLSC/OpenCL IP in to HeteroCL:

This tutorial will walk you through the main steps to create, simulate and deploy a HLS (i.e., HLSC or OpenCL) IP into HeteroCL. We will take FFT (Fast Fourier Transformation) as an example. How FFT algorithm works is out of scope for this tutorial. Please check this link if you are interested.

Create a behavior level function

the behavior level function is the functionally equivalent HeteroCL code of the HLS IP to be integrated. This part is recommended if you want to verify the IP works correctly along with other components in the program using HeteroCL LLVM JIT simulation. The HeteroCL version of FFT is available in the master branch.

For the HeteroCl implementation of the algorithm, you can either create & return tensors, or update the passed-in tensors. The algorithm part should be wrapped with a HeteroCl super stage, hcl.Stage("ExternModule") in this example. If you do not want to run any SW simulation, simply creating some dummy HeteroCL statements under the super stage should also work (not recommended).

     import heterocl as hcl
     from hlib.op.extern import create_top_module
     
     def fft_module(X_real, X_img)
     	# step 1. create behavior function for soft ip
     	with hcl.Stage("ExternModule") as Module:
         	# implement the FFT logic in HeteroCL API
             # hcl.update(X_real, lambda *args: ...)
             # return hcl.compute((L,), lambda i: ... )
         
         # step 2. configure the soft ip 
         # IP function name 
         dicts["name"] = "hls::fft<config>"
         # tensor inputs (name, dtype tuple) to the IP function
     	 tensors = [X_real, X_imag, F_real, F_imag]
     	 dicts["args"] = [(_.name, _.dtype) for _ in tensors]

        # ip function headers and calling convention 

         dicts["header"] = 
"""
#include \"hls_fft.h\"
#include <complex>
struct config : hls::ip_fft::params_t {
  static const unsigned ordering_opt = hls::ip_fft::natural_order;
  static const unsigned config_width = 16; // FFT_CONFIG_WIDTH
};
typedef std::complex<ap_fixed<16,1>> fxpComplex;
"""
     
     # statements to be inserted before IP function 
     dicts["ip_func"] = 
"""
hls::ip_fft::config_t<config> fft_config;
hls::ip_fft::config_t<config> fft_status; 
fft_config.setDir(0);
fft_config.setSch(0x2AB);
complex<ap_fixed<16,1>> xn[{}];
complex<ap_fixed<16,1>> xk[{}];
for (int i = 0; i < {}; i++)
    xn[i] = fxpComplex({}[i], {}[i]);
hls::fft<config>(xn, xk, &fft_config, &fft_status);
for (int i = 0; i < {}; i++) {{
    {}[i] = xk.real();
    {}[i] = xk.imag();
}}
""".format(L, L, L, X_real.name, X_imag.name,
             L, F_real.name, F_imag.name)

         # dictionary specifying header, pre-function and post-function cfg
         create_top_module(Module, dicts, ip_type="hls")

Configure the inputs, outputs and core logic for software IP module

To configure the IP and let HeteroCL integrate your IP, you need to pass the IP information into the create_top_module function provided by HeteroCL, as shown in the snippet above. We sue this function to create a top-level module (which will be mapped to e.g. an OpenCL kernel function in the code generation stage) for the soft IP. We also support integrating the IP within a top module.

The dicts argument is the core of HLS IP integration process, in which we allowed users to directly insert raw HLS statements into HeteroCL program. Since most of advanced C/C++ features cannot be expressed with HeteroCL, we leave the IP configuration to users to keep the flexibility of IP integration. Users are allowed to insert HLS code in header, right before and after the IP function.

Notice that the inputs and outputs arguments must be tensors, and if the users want to use some IP function with other data types, like complex data type in the example, the conversion logic must be implemented using dicts["ip_func"]. In the later release, we need to add some automatic detection algorithm to generate data type conversion logic.

Data movement with HLS IP

There are three IP types (RTL / HLS / Host). The IP core of type RTL and HLS must be moved to device scope using .to (as shown in the example in the snippet below). The IP core is the minimum placement uint in the view of data placement API. Namely, you cannot move any tensors inside an IP core back and forth between device and host.

       A = hcl.placeholder(in_shape, name="A")
       B = hcl.placeholder(in_shape, name="B")
       
       def kernel(A, B):
           real, imag = fft_module(A, B)
           return hcl.compute((length,), lambda x:
                hcl.sqrt(real[x] * real[x] + imag[x] * imag[x]), name="abs")
       
       s.to([A, B], target.xcel)

The code for this example is available here: https://github.com/Hecmay/heterocl/blob/extern/hlib/python/hlib/op/extern.py#L202

@zhangzhiru
Copy link

zhangzhiru commented Mar 24, 2020

This is an excellent starting pointing.

What does .op mean?
Also, do we have to put external libs under hclib? We need to be more careful naming the libraries. In this case, we need to have a separate lib for xilinx and further separate the HLS and RTL IPs.

@hecmay
Copy link
Collaborator Author

hecmay commented Mar 24, 2020

This is an excellent starting pointing.

What does .op mean?
Also, do we have to put external libs under hclib? We need to be more careful naming the libraries. In this case, we need to have a separate lib for xilinx and further separate the HLS and RTL IPs.

.op means operator. The hlib.op includes many common operations (e.g. exp or NN layers). I put the external IP APIs in the same level for regularity and consistency. For now all of the external libs are under the hlib folder.

Each IP core will be marked with a specific attribute, indicating its targeting FPGA and levels of abstraction. I will also add another IR pass to support automatic data type transformation for the external IP calls (e.g. transforming a tensor to hls::stream<ap_axiu<>>)

@hecmay
Copy link
Collaborator Author

hecmay commented Mar 30, 2020

New features introduced in this PR:

  1. Code Generator to generate TCL / Makefile: The integrated RTL IP is considered as a blackbox, with which we need to add some additional flags to Makefile as well as extra TCL scripts to specify the port interface of the IP.
  2. New IR node for device placement: a new ExternModule IR is introduced in this PR. This IR node will wrap all statements running on a specific device (e.g. SSD or another node in the cluster). The new code generator gives us more flexibility to support different devices with various requirement.
s.to(tensorA, target.host.Flash)
s.to(tensorB, target.HBM)

@hecmay
Copy link
Collaborator Author

hecmay commented Apr 7, 2020

Integration granularity of the external RTL IPs.

Ideally we want to integrate all the RTL IPs as blackboxes into our kernel program, where we can simply call the RTL IP as a sub-function, and EDA tool will replace the function call with the user-provided RTL code.

def kernel(image):
    out1 = hlib.op.extern.rtl.image_filter(image)
    out2 = hlib.op.extern.rtl.refine(out1)
    return out2

s.to(image, target.xcel)
s.to(kernel.out2, target.host)

However, to integrate the RTL IPs into HLSC program, we need a interface specification configuration file like: https://github.com/Xilinx/HLS-Tiny-Tutorials/blob/master/misc_rtl_as_blackbox/rtl_model.json, which is oftentimes unavailable from neither the users or HeteroCL.

@seanlatias
Copy link
Collaborator

Can you fix the tests?

@seanlatias
Copy link
Collaborator

Please also replace your fist post with your documentation so that the users do not need to scroll down to see it.

@hecmay
Copy link
Collaborator Author

hecmay commented Apr 25, 2020

Please also replace your fist post with your documentation so that the users do not need to scroll down to see it.

Moved the tutorial to the top. Will fix the test now.

@hecmay
Copy link
Collaborator Author

hecmay commented Apr 25, 2020

Data Movement in Heterogenous Memory System

In this proposal we use HBM as an example. The channel or bank allocation for DDR and PLRAM fits well with the same interface proposed here.

The assignment of HBM channels comes along with compute unit (CU) replication. We are supposed to assign different channels to each arguments in each CU duplicate to maximize the bandwidth. Here is the proposed interface:

1) we can specify the kernel number (i.e. how many CU to duplicate) in the data movement API with splitting_factor option. In this case, multiple CU duplicates are created, inputs will be split evenly and assigned to different HBM channels (If the total # greater then 32, some arguments will be assigned to the same HBM channel)

2) split the input tensors in a single dimension using splitting_dim option. In this case, we can reshape the input tensors, and split the tensors along certain dimension. In this example, we split the input tensor along the 0-th dimension, and 16 CU duplicates are generated accordingly.

A = hcl.placeholder(in_shape, name="A")
B = hcl.placeholder(in_shape, name="B")

def kernel(...):
    # algorithm...

# create custom platform 
config = {
    "host": hcl.device.cpu("intel", "e5"),
    "xcel": {
        hcl.device.fpga("xilinx", "xcvu19p"),
        hcl.device.gpu("nvidia", "gtx-1080") 
    }
}
p = hcl.platform.custom(config)

# case 1. move tensors to HBM with splitting factor: the input tensors are 
# split into multiple pieces and each piece assigned to a separate CU
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

# case 2. assign the channel explicitly with a single CU
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm.bank0)

# case 3. reshape and split along certain dimmension  
s.reshape([A, B], (2, 16))
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_dim=0)

@zhangzhiru
Copy link

A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm.bank0)

This is a good starting point. As always, we need to streamline the terms. Does bank correspond to a virtual channel? Also I suggest we use bank[0] instead of bank0

@zhangzhiru
Copy link

zhangzhiru commented Apr 25, 2020

case 1. move tensors to HBM with splitting factor: the input tensors are
split into multiple pieces and each piece assigned to a separate CU
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

I don't think it's a good idea to mix compute and memory customizations. Here we should combine .to() with a separate .parallel() primitive to clearly indicate which kernel we are duplicating.

@zhangzhiru
Copy link

s.reshape([A, B], (2, 16))
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_dim=0)

Similar to my previous comment, we shall cascade .to() with another reshape/partition primitive. It's really important to not to entangle multiple optimizations in one primitive.

@hecmay
Copy link
Collaborator Author

hecmay commented Apr 25, 2020

case 1. move tensors to HBM with splitting factor: the input tensors are
split into multiple pieces and each piece assigned to a separate CU
A_new, B_new = s.to([A, B], p.xcel, if=p.xcel.hbm, splitting_factor=3)

I don't think it's a good idea to mix compute and memory customizations. Here we should combine .to() with a separate .parallel() primitive to clearly indicate which kernel we are duplicating.

We do not have such a kernel here to apply parallel primitive. That's why I used this entangled approach as a workaround. All stages dependent on the tensors moved to device form a kernel: as shown in the example here. If we move tensor A and B to device, and move tensor ret back to host, then the combination of all stages in the middle (i.e. stage 1 to k) is considered as the kernel in this program.

A = hcl.placeholder((10,))
B = hcl.placeholder((10,))

# stage 1 to stage k
# .... compute something

ret = hcl.compute((10,), lambda *args: ...)

I cannot find a clean and concise way to specify the range of the kernel. @seanlatias Do you have any suggestion?

@zhangzhiru
Copy link

split into multiple pieces and each piece assigned to a separate CU

I thought the CU you're referring to here has to correspond to a compute kernel that needs to be duplicated? If not, why are we moving the tensor to the device?

@hecmay
Copy link
Collaborator Author

hecmay commented Apr 26, 2020

The discussion for heterogeneous memory placement has been moved to #180.

@hecmay hecmay changed the title [WIP][Utils] External IPs Integration Support for HeteroCL [Utils] External IPs Integration Support for HeteroCL Apr 29, 2020
@hecmay hecmay requested a review from seanlatias May 1, 2020 03:01
@seanlatias seanlatias merged commit 220c7a6 into cornell-zhang:v0.3 May 2, 2020
@seanlatias seanlatias changed the title [Utils] External IPs Integration Support for HeteroCL [Backend][hlib] External IPs Integration Support for HeteroCL May 2, 2020
@seanlatias seanlatias added the v0.3 label May 3, 2020
@seanlatias seanlatias changed the title [Backend][hlib] External IPs Integration Support for HeteroCL [Backend][hlib][v0.3] External IPs Integration Support for HeteroCL May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants