diff --git a/CHANGELOG.md b/CHANGELOG.md index 65b05e3a4..1251d790b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -29,6 +29,7 @@ mimpid = 0x01040312 -> Version 01.04.03.12 -> v1.4.3.12 | Date | Version | Comment | Link | |:----:|:-------:|:--------|:----:| +| 18.03.2024 | 1.9.6.9 | :sparkles: update CFU example: now implementing the Extended Tiny Encryption Algorithm (XTEA) | [#855](https://github.com/stnolting/neorv32/pull/855) | | 16.03.2024 | 1.9.6.8 | rework cache system: L1 + L2 caches, all based on the generic cache component | [#853](https://github.com/stnolting/neorv32/pull/853) | | 16.03.2024 | 1.9.6.7 | cache optimizations: add read-only option, add option to disable direct/uncached accesses | [#851](https://github.com/stnolting/neorv32/pull/851) | | 15.03.2024 | 1.9.6.6 | :warning: clean-up configuration generics (remove XBUS endianness configuration; refine JEDED/VENDORID configuration); rearrange SYSINFO.SOC bits | [#850](https://github.com/stnolting/neorv32/pull/850) | diff --git a/docs/datasheet/cpu_cfu.adoc b/docs/datasheet/cpu_cfu.adoc index cb74ed92f..2e738acc0 100644 --- a/docs/datasheet/cpu_cfu.adoc +++ b/docs/datasheet/cpu_cfu.adoc @@ -2,8 +2,8 @@ :sectnums: === Custom Functions Unit (CFU) -The Custom Functions Unit is the central part of the <<_zxcfu_isa_extension>> and represents -the actual hardware module, which can be used to implement _custom RISC-V instructions_. +The Custom Functions Unit (CFU) is the central part of the NEORV32-specific <<_zxcfu_isa_extension>> and +represents the actual hardware module that can be used to implement **custom RISC-V instructions**. The CFU is intended for operations that are inefficient in terms of performance, latency, energy consumption or program memory requirements when implemented entirely in software. Some potential application fields and exemplary @@ -13,23 +13,27 @@ use-cases might include: * **Cryptographic:** bit substitution and permutation * **Communication:** conversions like binary to gray-code; multiply-add operations * **Image processing:** look-up-tables for color space transformations -* implementing instructions from **other RISC-V ISA extensions** that are not yet supported by the NEORV32 +* implementing instructions from **other RISC-V ISA extensions** that are not yet supported by NEORV32 -[NOTE] -The CFU is not intended for complex and _CPU-independent_ functional units that implement complete accelerators +The CFU is not intended for complex and **CPU-independent** functional units that implement complete accelerators (like block-based AES encryption). These kind of accelerators should be implemented as memory-mapped <<_custom_functions_subsystem_cfs>>. A comparison of all NEORV32-specific chip-internal hardware extension options is provided in the user guide section https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules]. +.Default CFU Hardware Example +[TIP] +The default CFU module (`rtl/core/neorv32_cpu_cp_cfu.vhd`) implements the _Extended Tiny Encryption Algorithm (XTEA)_ +as "real world" application example. + :sectnums: ==== CFU Instruction Formats The custom instructions executed by the CFU utilize a specific opcode space in the `rv32` 32-bit instruction -space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("Guaranteed -Non-Standard Encoding Space"). The NEORV32 CFU uses the `custom` opcodes to identify the instructions implemented -by the CFU and to differentiate between the different instruction formats. The according binary encoding of these +encoding space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("Guaranteed +Non-Standard Encoding Space"). The NEORV32 CFU uses the `custom-*` opcodes to identify the instructions implemented +by the CFU and to differentiate between the available instruction formats. The according binary encoding of these opcodes is shown below: * `custom-0`: `0001011` RISC-V standard, used for <<_cfu_r3_type_instructions>> @@ -44,9 +48,10 @@ opcodes is shown below: The R3-type CFU instructions operate on two source registers `rs1` and `rs2` and return the processing result to the destination register `rd`. The actual operation can be defined by using the `funct7` and `funct3` bit fields. These immediates can also be used to pass additional data to the CFU like offsets, look-up-tables addresses or -shift-amounts. However, the actual functionality is entirely user-defined. +shift-amounts. However, the actual functionality is entirely user-defined. Note that all immediate values are +always compile-time-static. -Example operation: `rd <= rs1 xnor rs2` +Example operation: `rd <= rs1 xnor rs2` (bit-wise XNOR) .CFU R3-type instruction format image::cfu_r3type_instruction.png[align=center] @@ -74,9 +79,10 @@ R3-type instructions can be implemented (7-bit + 3-bit = 10 bit -> 1024 differen The R4-type CFU instructions operate on three source registers `rs1`, `rs2` and `rs2` and return the processing result to the destination register `rd`. The actual operation can be defined by using the `funct3` bit field. Alternatively, this immediate can also be used to pass additional data to the CFU like offsets, look-up-tables -addresses or shift-amounts. However, the actual functionality is entirely user-defined. +addresses or shift-amounts. However, the actual functionality is entirely user-defined. Note that all immediate +values are always compile-time-static. -Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]` +Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]` (multiply-and-accumulate; "MAC") .CFU R4-type instruction format image::cfu_r4type_instruction.png[align=center] @@ -111,9 +117,9 @@ The R5-type CFU instructions operate on four source registers `rs1`, `rs2`, `rs3 processing result to the destination register `rd`. As all bits of the instruction word are used to encode the five registers and the opcode, no further immediate bits are available to specify the actual operation. There are two different R5-type instruction with two different opcodes available. Hence, only two R5-type operations -can be implemented out of the box. +can be implemented by default. -Example operation: `rd <= rs1 & rs2 & rs3 & rs4` +Example operation: `rd <= rs1 & rs2 & rs3 & rs4` (bit-wise AND of 4 operands) .CFU R5-type instruction A format image::cfu_r5type_instruction_a.png[align=center] @@ -207,7 +213,6 @@ neorv32_cpu_csr_write(CSR_CFUREG0, 0xabcdabcd); // write data to CFU CSR 0 uint32_t tmp = neorv32_cpu_csr_read(CSR_CFUREG3); // read data from CFU CSR 3 ---- - .Additional CFU-internal CSRs [TIP] If more than four CFU-internal CSRs are required the designer can implement an "indirect access mechanism" based @@ -215,35 +220,35 @@ on just two of the default CSRs: one CSR is used to configure the index while th data with the indexed CFU-internal CSR - this concept is similar to the RISC-V Indirect CSR Access Extension Specification (`Smcsrind`). +.Security Considerations +[NOTE] +The CFU CSRs are mapped to the user-mode CSR space so software running at _any privilege level_ can access these +CSRs. However, accesses can be constrained to certain privilege level (see <<_custom_instructions_hardware>>). + :sectnums: ==== Custom Instructions Hardware The actual functionality of the CFU's custom instructions is defined by the user-defined logic inside -the CFU hardware module `rtl/core/neorv32_cpu_cp_cfu.vhd`. +the CFU hardware module `rtl/core/neorv32_cpu_cp_cfu.vhd`. This file is highly commented to illustrate the +hardware design considerations. CFU operations can be entirely combinatorial (like bit-reversal) so the result is available at the end of the current clock cycle. Operations can also take several clock cycles to complete (like multiplications) and may also include internal states and memories. The CFU's internal control unit takes care of interfacing the custom user logic to the CPU pipeline. -.CFU Hardware Example & More Details -[TIP] -The default CFU hardware module already implement some exemplary instructions that are used for illustration -by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which -is highly commented to explain the available signals, implementation options and the handshake with the CPU pipeline. - .CFU Hardware Resource Requirements [NOTE] Enabling the CFU and actually implementing R4-type and/or R5-type instructions (or more precisely, using the according operands for the CFU hardware) will add one or two, respectively, additional read ports to the core's register file significantly increasing resource requirements. -.CFU Access +.CFU Access Privilege Levels [NOTE] The CFU is accessible from all privilege modes (including CFU-internal registers accessed via the indirects CSR access mechanism). It is the task of the CFU designers to add according access-constraining logic if certain CFU -states shall not be exposed to all privilege levels (i.e. exncryption keys). +states shall not be exposed to all privilege levels (i.e. encryption keys). .CFU Execution Time [NOTE] diff --git a/rtl/core/neorv32_cpu_cp_cfu.vhd b/rtl/core/neorv32_cpu_cp_cfu.vhd index 36f68d8bc..86fd3136f 100644 --- a/rtl/core/neorv32_cpu_cp_cfu.vhd +++ b/rtl/core/neorv32_cpu_cp_cfu.vhd @@ -1,5 +1,5 @@ -- ################################################################################################# --- # << NEORV32 CPU - Co-Processor: Custom (Instructions) Functions Unit >> # +-- # << NEORV32 CPU - Co-Processor: Custom (RISC-V Instructions) Functions Unit (CFU) >> # -- # ********************************************************************************************* # -- # For custom/user-defined RISC-V instructions (R3-type, R4-type and R5-type formats). See the # -- # CPU's documentation for more information. Also take a look at the "software-counterpart" of # @@ -67,7 +67,7 @@ end neorv32_cpu_cp_cfu; architecture neorv32_cpu_cp_cfu_rtl of neorv32_cpu_cp_cfu is - -- CFU Control - do not modify! ---------------------------- + -- CFU Control --------------------------------------------- -- ------------------------------------------------------------ type control_t is record busy : std_ulogic; -- CFU is busy @@ -85,29 +85,39 @@ architecture neorv32_cpu_cp_cfu_rtl of neorv32_cpu_cp_cfu is constant r5typeA_c : std_ulogic_vector(1 downto 0) := "10"; -- R5-type instruction A (custom-2 opcode) constant r5typeB_c : std_ulogic_vector(1 downto 0) := "11"; -- R5-type instruction B (custom-3 opcode) - -- User-Defined Logic -------------------------------------- -- ------------------------------------------------------------ - -- multiply-add unit (r4-type instruction example) -- - type madd_t is record - sreg : std_ulogic_vector(2 downto 0); -- 3 cycles latency = 3 bits in arbitration shift register - done : std_ulogic; - -- - opa : std_ulogic_vector(XLEN-1 downto 0); - opb : std_ulogic_vector(XLEN-1 downto 0); - opc : std_ulogic_vector(XLEN-1 downto 0); - mul : std_ulogic_vector(2*XLEN-1 downto 0); - res : std_ulogic_vector(2*XLEN-1 downto 0); + -- xtea instructions (funct3 bit-field) -- + constant xtea_enc_v0_c : std_ulogic_vector(2 downto 0) := "000"; + constant xtea_enc_v1_c : std_ulogic_vector(2 downto 0) := "001"; + constant xtea_dec_v0_c : std_ulogic_vector(2 downto 0) := "010"; + constant xtea_dec_v1_c : std_ulogic_vector(2 downto 0) := "011"; + constant xtea_init_c : std_ulogic_vector(2 downto 0) := "100"; + + -- xtea round-key adjusting -- + constant xtea_delta_c : std_ulogic_vector(31 downto 0) := x"9e3779b9"; + + -- xtea key storage (accessed via CFU CSRs) -- + type key_mem_t is array (0 to 3) of std_ulogic_vector(31 downto 0); + signal key_mem : key_mem_t; + + -- xtea processing logic -- + type xtea_t is record + done : std_ulogic_vector(1 downto 0); -- multi-cycle operation SREG + opa : std_ulogic_vector(31 downto 0); -- input operand a + opb : std_ulogic_vector(31 downto 0); -- input operand b + sum : std_ulogic_vector(31 downto 0); -- round key buffer + res : std_ulogic_vector(31 downto 0); -- operation results end record; - signal madd : madd_t; + signal xtea : xtea_t; - -- custom control and status registers (CSRs) -- - signal cfu_csr_0, cfu_csr_1 : std_ulogic_vector(XLEN-1 downto 0); + -- xtea helper -- + signal tmp_a, tmp_b, tmp_x, tmp_y, tmp_z, tmp_r : std_ulogic_vector(31 downto 0); begin -- ************************************************************************************************************************** - -- This controller is required to handle the CFU <-> CPU interface. Do not modify! + -- This controller is required to handle the CFU <-> CPU interface. -- ************************************************************************************************************************** -- CFU Controller ------------------------------------------------------------------------- @@ -122,13 +132,13 @@ begin control.busy <= '0'; elsif rising_edge(clk_i) then res_o <= (others => '0'); -- default; all CPU co-processor outputs are logically OR-ed - if (control.busy = '0') then -- idle - if (start_i = '1') then -- trigger new CFU operation - control.busy <= '1'; + if (control.busy = '0') then -- CFU is idle + control.busy <= start_i; -- trigger new CFU operation + else -- CFU operation in progress + res_o <= control.result; -- output result only if CFU is processing; has to be all-zero otherwise + if (control.done = '1') or (ctrl_i.cpu_trap = '1') then -- operation done or abort if trap (exception) + control.busy <= '0'; end if; - elsif (control.done = '1') or (ctrl_i.cpu_trap = '1') then -- operation done? abort if trap (exception) - res_o <= control.result; -- output result for just one cycle, CFU output has to be all-zero otherwise - control.busy <= '0'; end if; end if; end process cfu_control; @@ -143,7 +153,7 @@ begin -- ************************************************************************************************************************** - -- CFU Interface Documentation + -- CFU Hardware Documentation -- ************************************************************************************************************************** -- ---------------------------------------------------------------------------------------- @@ -221,7 +231,6 @@ begin -- -- [NOTE] If the signal is not set within a bound time window (default = 512 cycles) the CFU operation is -- automatically terminated by the hardware and an illegal instruction exception is raised. This feature can also be - -- be used to implement custom CFU exceptions (for example to indicate invalid CFU operations). -- ---------------------------------------------------------------------------------------- -- CFU-Internal Control and Status Registers (CFU-CSRs) @@ -241,145 +250,130 @@ begin -- ************************************************************************************************************************** - -- Actual CFU User Logic Example - replace this with your custom logic + -- Actual CFU User Logic Example: XTEA - Extended Tiny Encryption Algorithm (replace this with your custom logic) -- ************************************************************************************************************************** - -- CFU-Internal Control and Status Registers (CFU-CSRs) ----------------------------------- + -- This CFU example implements the Extended Tiny Encryption Algorithm (XTEA). + -- The CFU provides 5 custom instructions to accelerate encryption and decryption using dedicated hardware. + -- The RTL code is not optimized (not for area, not for clock speed, not for performance) and was + -- implemented according to a software C reference (https://de.wikipedia.org/wiki/Extended_Tiny_Encryption_Algorithm). + + + -- CFU-Internal Control and Status Registers (CFU-CSRs): 128-Bit Key Storage -------------- -- ------------------------------------------------------------------------------------------- -- synchronous write access -- csr_write_access: process(rstn_i, clk_i) begin if (rstn_i = '0') then - cfu_csr_0 <= (others => '0'); - cfu_csr_1 <= (others => '0'); + key_mem <= (others => (others => '0')); elsif rising_edge(clk_i) then - if (csr_we_i = '1') and (csr_addr_i = "00") then - cfu_csr_0 <= csr_wdata_i; - end if; - if (csr_we_i = '1') and (csr_addr_i = "01") then - cfu_csr_1 <= csr_wdata_i; + if (csr_we_i = '1') then + key_mem(to_integer(unsigned(csr_addr_i))) <= csr_wdata_i; end if; end if; end process csr_write_access; -- asynchronous read access -- - csr_read_access: process(csr_addr_i, cfu_csr_0, cfu_csr_1) - begin - case csr_addr_i is - when "00" => csr_rdata_o <= cfu_csr_0; -- CSR0: simple read/write register - when "01" => csr_rdata_o <= cfu_csr_1; -- CSR1: simple read/write register - when "10" => csr_rdata_o <= x"1234abcd"; -- CSR2: hardwired/read-only register - when others => csr_rdata_o <= (others => '0'); -- CSR3: not implemented - end case; - end process csr_read_access; + csr_rdata_o <= key_mem(to_integer(unsigned(csr_addr_i))); - -- Iterative Multiply-Add Unit ------------------------------------------------------------ + -- XTEA Processing Core ------------------------------------------------------------------ -- ------------------------------------------------------------------------------------------- - -- iteration control -- - madd_control: process(rstn_i, clk_i) + xtea_core: process(rstn_i, clk_i) begin if (rstn_i = '0') then - madd.sreg <= (others => '0'); + xtea.done <= (others => '0'); + xtea.opa <= (others => '0'); + xtea.opb <= (others => '0'); + xtea.sum <= (others => '0'); + xtea.res <= (others => '0'); elsif rising_edge(clk_i) then - -- operation trigger -- - if (control.busy = '0') and -- CFU is idle (ready for next operation) - (start_i = '1') and -- CFU is actually triggered by a custom instruction word - (control.rtype = r4type_c) and -- this is a R4-type instruction - (control.funct3(2 downto 1) = "00") then -- trigger only for specific funct3 values - madd.sreg(0) <= '1'; - else - madd.sreg(0) <= '0'; + + -- shift register for computation delay -- + xtea.done(0) <= '0'; -- default + xtea.done(1) <= xtea.done(0); + + -- trigger new operation -- + if (start_i = '1') and (control.rtype = r3type_c) then -- execution trigger and correct instruction type + xtea.opa <= rs1_i; -- buffer input operand rs1 (for improved physical timing) + xtea.opb <= rs2_i; -- buffer input operand rs2 (for improved physical timing) + xtea.done(0) <= '1'; -- result is available in the 2nd cycle end if; - -- simple shift register for tracking operation -- - madd.sreg(madd.sreg'left downto 1) <= madd.sreg(madd.sreg'left-1 downto 0); -- shift left - end if; - end process madd_control; - -- processing has reached last stage (= done) when sreg's MSB is set -- - madd.done <= madd.sreg(madd.sreg'left); + -- data processing -- + if (xtea.done(0) = '1') then -- second-stage execution trigger + -- update "sum" round key -- + if (control.funct3(2) = '1') then -- initialize + xtea.sum <= xtea.opa; -- set initial round key + elsif (control.funct3(1 downto 0) = xtea_enc_v0_c(1 downto 0)) then -- encrypt v0 + xtea.sum <= std_ulogic_vector(unsigned(xtea.sum) + unsigned(xtea_delta_c)); + elsif (control.funct3(1 downto 0) = xtea_dec_v1_c(1 downto 0)) then -- decrypt v1 + xtea.sum <= std_ulogic_vector(unsigned(xtea.sum) - unsigned(xtea_delta_c)); + end if; + -- process "v" operands -- + if (control.funct3(1) = '0') then -- encrypt + xtea.res <= std_ulogic_vector(unsigned(tmp_b) + unsigned(tmp_r)); + else -- decrypt + xtea.res <= std_ulogic_vector(unsigned(tmp_b) - unsigned(tmp_r)); + end if; + end if; - -- arithmetic core -- - madd_core: process(rstn_i, clk_i) - begin - if (rstn_i = '0') then - madd.opa <= (others => '0'); - madd.opb <= (others => '0'); - madd.opc <= (others => '0'); - madd.mul <= (others => '0'); - madd.res <= (others => '0'); - elsif rising_edge(clk_i) then - -- stage 0: buffer input operands -- - madd.opa <= rs1_i; - madd.opb <= rs2_i; - madd.opc <= rs3_i; - -- stage 1: multiply rs1 and rs2 -- - madd.mul <= std_ulogic_vector(unsigned(madd.opa) * unsigned(madd.opb)); - -- stage 2: add rs3 to multiplication result -- - madd.res <= std_ulogic_vector(unsigned(madd.mul) + unsigned(madd.opc)); end if; - end process madd_core; + end process xtea_core; + + -- helpers -- + tmp_a <= xtea.opb when (control.funct3(0) = '0') else xtea.opa; -- v1 / v0 select + tmp_b <= xtea.opa when (control.funct3(0) = '0') else xtea.opb; -- v0 / v1 select + tmp_x <= xtea.opb(27 downto 0) & "0000" when (control.funct3(0) = '0') else xtea.opa(27 downto 0) & "0000"; -- v << 4 + tmp_y <= "00000" & xtea.opb(31 downto 5) when (control.funct3(0) = '0') else "00000" & xtea.opa(31 downto 5); -- v >> 5 + tmp_z <= key_mem(to_integer(unsigned(xtea.sum(1 downto 0)))) when (control.funct3(0) = '0') else -- key[sum & 3] + key_mem(to_integer(unsigned(xtea.sum(12 downto 11)))); -- key[(sum >> 11) & 3] + tmp_r <= std_ulogic_vector(unsigned(tmp_x xor tmp_y) + unsigned(tmp_a)) xor std_ulogic_vector(unsigned(xtea.sum) + unsigned(tmp_z)); - -- Output select -------------------------------------------------------------------------- + -- Function Result Select ----------------------------------------------------------------- -- ------------------------------------------------------------------------------------------- - out_select: process(control, rs1_i, rs2_i, rs3_i, rs4_i, madd) + result_select: process(control, xtea) begin case control.rtype is - when r3type_c => -- R3-type instructions + when r3type_c => -- R3-type instructions; function select via "funct7" and "funct3" -- ---------------------------------------------------------------------- - -- This is a simple ALU that implements four pure-combinatorial instructions. - -- The actual function is selected by the "funct3" bit-field. - case control.funct3 is - when "000" => -- funct3 = "000": bit-reversal of rs1 - control.result <= bit_rev_f(rs1_i); + case control.funct3 is -- Just check "funct3" here; "funct7" bit-field is ignored + when xtea_enc_v0_c | xtea_enc_v1_c | xtea_dec_v0_c | xtea_dec_v1_c => -- encryption/decryption + control.result <= xtea.res; -- processing result + control.done <= xtea.done(xtea.done'left); -- multi-cycle processing done when set + when others => -- initialization and all further unspecified operations + control.result <= (others => '0'); -- just output zero control.done <= '1'; -- pure-combinatorial, so we are done "immediately" - when "001" => -- funct3 = "001": XNOR input operands - control.result <= not (rs1_i xor rs2_i); - control.done <= '1'; -- pure-combinatorial, so we are done "immediately" - when others => -- not implemented - control.result <= (others => '0'); - control.done <= '0'; -- this will cause an illegal instruction exception after timeout end case; - when r4type_c => -- R4-type instructions + when r4type_c => -- R4-type instructions; function select via "funct3" -- ---------------------------------------------------------------------- - -- This is an iterative multiply-and-add unit that requires several cycles for processing. - -- The actual function is selected by the lowest bit of the "funct3" bit-field. - case control.funct3 is - when "000" => -- funct3 = "000": multiply-add low-part result: rs1*rs2+r3 [31:0] - control.result <= madd.res(31 downto 0); - control.done <= madd.done; -- iterative, wait for unit to finish - when "001" => -- funct3 = "001": multiply-add high-part result: rs1*rs2+r3 [63:32] - control.result <= madd.res(63 downto 32); - control.done <= madd.done; -- iterative, wait for unit to finish - when others => -- not implemented - control.result <= (others => '0'); - control.done <= '0'; -- this will cause an illegal instruction exception after timeout - end case; + control.result <= (others => '0'); -- no logic implemented + control.done <= '0'; -- this will cause an illegal instruction exception after timeout when r5typeA_c => -- R5-type instruction A -- ---------------------------------------------------------------------- -- No function/immediate bit-fields are available for this instruction type. -- Hence, there is just one operation that can be implemented. - control.result <= rs1_i and rs2_i and rs3_i and rs4_i; -- AND-all - control.done <= '1'; -- pure-combinatorial, so we are done "immediately" + control.result <= (others => '0'); -- no logic implemented + control.done <= '0'; -- this will cause an illegal instruction exception after timeout when r5typeB_c => -- R5-type instruction B -- ---------------------------------------------------------------------- -- No function/immediate bit-fields are available for this instruction type. -- Hence, there is just one operation that can be implemented. - control.result <= rs1_i xor rs2_i xor rs3_i xor rs4_i; -- XOR-all - control.done <= '1'; -- pure-combinatorial, so we are done "immediately" + control.result <= (others => '0'); -- no logic implemented + control.done <= '0'; -- this will cause an illegal instruction exception after timeout when others => -- undefined -- ---------------------------------------------------------------------- - control.result <= (others => '0'); - control.done <= '0'; + control.result <= (others => '0'); -- no logic implemented + control.done <= '0'; -- this will cause an illegal instruction exception after timeout end case; - end process out_select; + end process result_select; end neorv32_cpu_cp_cfu_rtl; diff --git a/rtl/core/neorv32_package.vhd b/rtl/core/neorv32_package.vhd index ca7ee24d4..bf72c4270 100644 --- a/rtl/core/neorv32_package.vhd +++ b/rtl/core/neorv32_package.vhd @@ -52,7 +52,7 @@ package neorv32_package is -- Architecture Constants ----------------------------------------------------------------- -- ------------------------------------------------------------------------------------------- - constant hw_version_c : std_ulogic_vector(31 downto 0) := x"01090608"; -- hardware version + constant hw_version_c : std_ulogic_vector(31 downto 0) := x"01090609"; -- hardware version constant archid_c : natural := 19; -- official RISC-V architecture ID constant XLEN : natural := 32; -- native data path width diff --git a/sw/example/demo_cfu/main.c b/sw/example/demo_cfu/main.c index e81c88492..e74325923 100644 --- a/sw/example/demo_cfu/main.c +++ b/sw/example/demo_cfu/main.c @@ -36,7 +36,7 @@ /**********************************************************************//** * @file demo_cfu/main.c * @author Stephan Nolting - * @brief Example program showing how to use the CFU's custom instructions. + * @brief Example program showing how to use the CFU's custom instructions (XTEA example). * @note Take a look at the highly-commented "hardware-counterpart" of this CFU * example in 'rtl/core/neorv32_cpu_cp_cfu.vhd'. **************************************************************************/ @@ -49,8 +49,59 @@ /**@{*/ /** UART BAUD rate */ #define BAUD_RATE 19200 -/** Number of test cases per CFU instruction */ -#define TESTCASES 4 +/** Number XTEA cycles */ +#define XTEA_CYCLES 20 +/** Input data size (in number of 32-bit words), has to be even */ +#define DATA_NUM 64 +/**@}*/ + + +/**********************************************************************//** + * @name Define macros for easy custom instruction wrapping + **************************************************************************/ +/**@{*/ +#define xtea_hw_init(sum) neorv32_cfu_r3_instr(0b0000000, 0b100, sum, 0) +#define xtea_hw_enc_v0_step(v0, v1) neorv32_cfu_r3_instr(0b0000000, 0b000, v0, v1) +#define xtea_hw_enc_v1_step(v0, v1) neorv32_cfu_r3_instr(0b0000000, 0b001, v0, v1) +#define xtea_hw_dec_v0_step(v0, v1) neorv32_cfu_r3_instr(0b0000000, 0b010, v0, v1) +#define xtea_hw_dec_v1_step(v0, v1) neorv32_cfu_r3_instr(0b0000000, 0b011, v0, v1) +/**@}*/ + +// The CFU custom instructions can be used as plain C functions as they are simple "intrinsics". +// There are 4 "prototype primitives" for the CFU instructions (define in sw/lib/include/neorv32_cfu.h): +// +// > neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) - for r3-type instructions (custom-0 opcode) +// > neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) - for r4-type instructions (custom-1 opcode) +// > neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4) - for r5-type instruction A (custom-2 opcode) +// > neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4) - for r5-type instruction B (custom-3 opcode) +// +// Every instance of these functions is converted into a single 32-bit RISC-V instruction word +// without any calling overhead at all (see the generated assembly code). +// +// The "rs*" source operands can be literals, variables, function return values, ... - you name it. +// The 7-bit immediate ("funct7") and the 3-bit immediate ("funct3") values can be used to pass +// compile-time static literal data to the CFU or to do a fine-grained function selection. +// +// Each "neorv32_cfu_r*" function returns a 32-bit data word of type uint32_t that represents +// the processing result of the according instruction. + + +/**********************************************************************//** + * @name Global variables + **************************************************************************/ +/**@{*/ +/** XTEA delta (round-key update) */ +const uint32_t xtea_delta = 0x9e3779b9; +/** Encryption/decryption key (128-bit) */ +const uint32_t key[4] = {0x207230ba, 0x1ffba710, 0xc45271ef, 0xdd01768a}; +/** Encryption input data */ +uint32_t input_data[DATA_NUM]; +/** Encryption results */ +uint32_t cypher_data_sw[DATA_NUM], cypher_data_hw[DATA_NUM]; +/** Decryption results */ +uint32_t plain_data_sw[DATA_NUM], plain_data_hw[DATA_NUM]; +/** Timing data */ +uint32_t time_enc_sw, time_enc_hw, time_dec_sw, time_dec_hw; /**@}*/ @@ -72,172 +123,234 @@ uint32_t xorshift32(void) { /**********************************************************************//** - * Main function + * XTEA encryption - software reference + * + * Source: https://de.wikipedia.org/wiki/Extended_Tiny_Encryption_Algorithm + * + * @param[in] num_cycles Number of encryption cycles. + * @param[in,out] v Encryption data/result array (2x32-bit). + * @param[in] k Encryption key array (4x32-bit). + **************************************************************************/ +void xtea_sw_encipher(uint32_t num_cycles, uint32_t *v, const uint32_t k[4]) { + + uint32_t i = 0; + uint32_t v0 = v[0]; + uint32_t v1 = v[1]; + uint32_t sum = 0; + + for (i=0; i < num_cycles; i++) { + v0 += (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + k[sum & 3]); + sum += xtea_delta; + v1 += (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + k[(sum>>11) & 3]); + } + + v[0] = v0; + v[1] = v1; +} + + +/**********************************************************************//** + * XTEA decryption - software reference + * + * Source: https://de.wikipedia.org/wiki/Extended_Tiny_Encryption_Algorithm + * + * @param[in] num_cycles Number of encryption cycles. + * @param[in,out] v Decryption data/result array (2x32-bit). + * @param[in] k Decryption key array (4x32-bit). + **************************************************************************/ +void xtea_sw_decipher(unsigned int num_cycles, uint32_t *v, const uint32_t k[4]) { + + uint32_t i = 0; + uint32_t v0 = v[0]; + uint32_t v1 = v[1]; + uint32_t sum = xtea_delta * num_cycles; + + for (i=0; i < num_cycles; i++) { + v1 -= (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + k[(sum>>11) & 3]); + sum -= xtea_delta; + v0 -= (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + k[sum & 3]); + } + + v[0] = v0; + v[1] = v1; +} + + +/**********************************************************************//** + * Main function: run pure-SW XTEA and compare with HW-XTEA * - * @note This program requires the CFU and UART0. + * @note This program requires the CFU, UART0 and the Zicntr ISA extension. * * @return 0 if execution was successful **************************************************************************/ int main() { - uint32_t i, rs1, rs2, rs3, rs4; + uint32_t i, j; + uint32_t v[2]; // initialize NEORV32 run-time environment neorv32_rte_setup(); - // setup UART at default baud rate, no interrupts - neorv32_uart0_setup(BAUD_RATE, 0); - // check if UART0 is implemented if (neorv32_uart0_available() == 0) { - return 1; // UART0 not available, exit + return -1; // UART0 not available, exit } - // check if the CFU is implemented at all (the CFU is wrapped in the core's "Zxcfu" ISA extension) + // setup UART0 at default baud rate, no interrupts + neorv32_uart0_setup(BAUD_RATE, 0); + + // check if the CFU is implemented (the CFU is wrapped in the core's "Zxcfu" ISA extension) if (neorv32_cpu_cfu_available() == 0) { neorv32_uart0_printf("ERROR! CFU ('Zxcfu' ISA extensions) not implemented!\n"); - return 1; + return -1; + } + + // check if the CPU base counters are implemented + if ((neorv32_cpu_csr_read(CSR_MXISA) & (1 << CSR_MXISA_ZICNTR)) == 0) { + neorv32_uart0_printf("ERROR! Base counters ('Zicntr' ISA extensions) not implemented!\n"); + return -1; } + // check if data size configuration is even + if ((DATA_NUM & 1) != 0) { + neorv32_uart0_printf("ERROR! DATA_NUM has to be even!\n"); + return -1; + } // intro neorv32_uart0_printf("\n<<< NEORV32 Custom Functions Unit (CFU) - Custom Instructions Example >>>\n\n"); - neorv32_uart0_printf("[NOTE] This program assumes the _default_ CFU hardware module, which\n" - " implements simple and exemplary data processing instructions.\n\n"); - -/* - The CFU custom instructions can be used as plain C functions as they are simple "intrinsics". - - There are 4 "prototype primitives" for the CFU instructions (define in sw/lib/include/neorv32_cfu.h): - - > neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) - for r3-type instructions (custom-0 opcode) - > neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) - for r4-type instructions (custom-1 opcode) - > neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4) - for r5-type instruction A (custom-2 opcode) - > neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4) - for r5-type instruction B (custom-3 opcode) - - Every "call" of these functions is turned into a single 32-bit ISC-V instruction word - without any calling overhead at all (see the generated assembly code). - - The "rs*" operands can be literals, variables, function return values, ... - you name it. - The 7-bit immediate ("funct7") and the 3-bit immediate ("funct3") values can be used to pass - _compile-time static_ literals to the CFU or to do a fine-grained function selection. - - Each "neorv32_cfu_r*" function returns a 32-bit data word of type uint32_t that represents - the result of the according instruction. -*/ + neorv32_uart0_printf("[NOTE] This program assumes the default CFU hardware module that\n" + " implements the Extended Tiny Encryption Algorithm (XTEA).\n\n"); // ---------------------------------------------------------- - // R3-type instructions (up to 1024 custom instructions) + // XTEA example // ---------------------------------------------------------- - neorv32_uart0_printf("\n--- CFU R3-Type: Bit-Reversal Instruction ---\n"); - for (i=0; i