Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add CFU R4-type instructions #449

Merged
merged 12 commits into from
Dec 5, 2022
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ mimpid = 0x01040312 => Version 01.04.03.12 => v1.4.3.12

| Date (*dd.mm.yyyy*) | Version | Comment |
|:-------------------:|:-------:|:--------|
| 03.12.2022 | 1.7.8.2 | :sparkles: new option to add custom R4-type RISC-V instructions to **CFU**; rework CFU hardware module, intrinsic library and example program; [#449](https://github.com/stnolting/neorv32/pull/449) |
| 01.12.2022 | 1.7.8.1 | package cleanup; [#447](https://github.com/stnolting/neorv32/pull/447) |
| 28.11.2022 | [**:rocket:1.7.8**](https://github.com/stnolting/neorv32/releases/tag/v1.7.8) | **New release** |
| 14.11.2022 | 1.7.7.9 | minor rtl edits and code optimizations; [#442](https://github.com/stnolting/neorv32/pull/442) |
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ and *Privileged Architecture Specification* ([pdf](https://github.com/stnolting/
* implements **all** standard RISC-V exceptions and interrupts (including MTI, MEI & MSI)
* 16 fast interrupt request channels as NEORV32-specific extension
* custom functions unit ([CFU](https://stnolting.github.io/neorv32/#_custom_functions_unit_cfu) as `Zxcfu` ISA extension)
for up to 1024 _custom RISC-V instructions_
for up to 1024 R3-type and up to 8 R4-type _custom RISC-V instructions_
* _intrinsic_ libraries for the `B` and `Zfinx` extensions

**Memory**
Expand Down
16 changes: 9 additions & 7 deletions docs/datasheet/cpu.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -654,24 +654,26 @@ Any additional flags within the `fence.i` instruction word are ignore by the har

==== **`Zxcfu`** Custom Instructions Extension (CFU)

The `Zxcfu` presents a NEORV32-specific _custom RISC-V_ ISA extension (`Z` = sub-extension, `x` = platform-specific
The `Zxcfu` presents a NEORV32-specific extension to the RISC-V ISA (`Z` = sub-extension, `x` = platform-specific
custom extension, `cfu` = name of the custom extension). When enabled via the <<_cpu_extension_riscv_zxcfu>> configuration
generic, this ISA extensions adds the <<_custom_functions_unit_cfu>> to the CPU core. The CFU is a module that
allows to add **custom RISC-V instructions** to the processor core.

The CPU is implemented as ALU co-processor and is integrated right into the CPU's pipeline providing minimal data
transfer latency as it has direct access to the core's register file. Up to 1024 custom instructions can be
implemented within the CFU. These instructions are mapped to an OPCODE space that has been explicitly reserved by
The CPU is implemented as additional ALU co-processor and is integrated right into the CPU's pipeline providing minimal
data transfer latency as it has direct access to the core's register file. The CFU supports **RISC-V R3-type** instructions
as well as **RISC-V R4-type** instructions. Up to 1024 custom R3-type instructions and up to 8 custom R4-type instruction
can be implemented within the CFU. These instructions are mapped to an opcode space that has been explicitly reserved by
the RISC-V spec for custom extensions.

Software can utilize the custom instructions by using _intrinsic functions_, which are inline assembly functions that
behave like "regular" C functions.
behave like regular C functions.

[TIP]
For more information regarding the CFU see section <<_custom_functions_unit_cfu>>.
For more detailed information regarding the CFU, it's hardware and the according software interface
see section <<_custom_functions_unit_cfu>>.

[TIP]
The CFU / `Zxcfu` ISA extension is intended for application-specific _instructions_.
The CFU module / `Zxcfu` ISA extension is intended for user-defined **instructions**.
If you like to add more complex accelerators or interfaces that can also operate independently of
the CPU take a look at the memory-mapped <<_custom_functions_subsystem_cfs>>.

Expand Down
197 changes: 117 additions & 80 deletions docs/datasheet/cpu_cfu.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,104 +4,161 @@

The Custom Functions Unit is the central part of the <<_zxcfu_custom_instructions_extension_cfu>> and represents
the actual hardware module, which is used to implement _custom RISC-V instructions_. The concept of the NEORV32
CFU has been highly inspired by https://github.com/google/CFU-Playground[google's CFU-Playground].
CFU has been highly inspired by https://github.com/google/CFU-Playground[Google's CFU-Playground].

The CFU is intended for operations that are inefficient in terms of performance, latency, energy consumption or
program memory requirements when implemented in pure software. Some potential application fields and exemplary
program memory requirements when implemented entirely in software. Some potential application fields and exemplary
use-cases might include:

* **AI:** sub-word / vector / SIMD operations like adding all four bytes of a 32-bit data word
* **Cryptographic:** bit substitution and permutation
* **Communication:** conversions like binary to gray-code
* **Communication:** conversions like binary to gray-code; multiply-add operations
* **Image processing:** look-up-tables for color space transformations
* implementing instructions from other RISC-V ISA extensions that are not yet supported by the NEORV32
* implementing instructions from **other RISC-V ISA extensions** that are not yet supported by the NEORV32

[NOTE]
The CFU is not intended for complex and autonomous functional units that implement complete accelerators
like block-based AES de-/encoding). Such accelerator can be implemented within the <<_custom_functions_subsystem_cfs>>.
The CFU is not intended for complex and CPU-independent functional units that implement complete accelerators
(like block-based AES encryption). These kind of accelerators should be better implemented within the
<<_custom_functions_subsystem_cfs>>.
A comparison of all chip-internal hardware extension options is provided in the user guide section
https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules].


:sectnums:
==== Custom CFU Instructions - General
==== CFU Instruction Formats

The custom instruction utilize a specific instruction space that has been explicitly reserved for user-defined
extensions by the RISC-V specifications ("_Guaranteed Non-Standard Encoding Space_"). The NEORV32 CFU uses the
_CUSTOM0_ opcode to identify custom instructions. The binary encoding of this opcode is `0001011`.
The custom instructions executed by the CFU utilize a specific instruction space in the total `rv32` 32-bit instruction
space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("_Guaranteed Non-Standard
Encoding Space_"). The NEORV32 CFU uses the `custom-0` and `custom-1` opcodes to identify the custom instructions implemented
by the CFU and to differentiate between two instruction formats (note: these formats are common RISC-V instruction format types).
The custom-0 opcode is used to implement custom **R3-type** instructions while the custom-1 opcode is used to
implement custom **R4-type** instructions. The according binary encoding of these opcodes is shown below:

The custom instructions processed by the CFU use the 32-bit **R2-type** RISC-V instruction format, which consists
of six bit-fields:
* `custom-0`: `0001011` (R3-type instructions)
* `custom-1`: `0101011` (R4-type instructions)

.CFU Instructions - Exceptions
[NOTE]
The CPU control logic will only analyze the opcode of the custom instructions to check if the
instruction word is valid. All remaining bit-fields are **not checked** by the CPU instruction decoding logic.
Hence, a custom CFU instruction can never raise an illegal instruction exception. If the CFU is not
implemented at all (`Zxcfu` ISA extension is not enabled) any instruction with opcode custom-0 or custom-1
will raise an illegal instruction exception.


:sectnums:
==== CFU R3-Type Instructions

The R3-type CFU instructions operate on two source registers and return the processing result to the destination register.
The actual operation can be defined by using the `funct7` and `funct3` bit fields. These immediates can also be used to
pass additional data to the CFU like offsets, look-up-tables addresses or shift-amounts. However, the actual
functionality is entirely user-defined.

Example operation: `rd <= rs1 xnor rs2`

.CFU R3-type instruction format
image::cfu_r3type_instruction.png[align=center]

* `funct7`: 7-bit immediate
* `rs2`: address of second source register
* `rs1`: address of first source register
* `funct3`: 3-bit immediate
* `rd`: address of destination register
* `opcode`: always `0001011` to identify custom instructions
* `opcode`: always `0001011` (RISC-V "custom-0" opcode)

.CFU instruction format (RISC-V R2-type)
image::cfu_r2type_instruction.png[align=center]
.RISC-V compatibility
[NOTE]
The CFU R3-type instruction format is compliant to the RISC-V ISA specification.

.Instruction encoding space
[NOTE]
Obviously, all bit-fields including the immediates have to be static at compile time.
By using the `funct7` and `funct3` entirely for selecting the actual operation a total of 1024 custom R3-type instructions
can be implemented (7-bit + 3-bit = 10 bit -> 1024 different values).

.Custom Instructions - Exceptions

:sectnums:
==== CFU R4-Type Instructions

The R4-type CFU instructions operate on three source registers and return the processing result to the destination register.
The actual operation can be defined by using the `funct3` bit field. Alternatively, this immediates can also be used to
pass additional data to the CFU like offsets, look-up-tables addresses or shift-amounts. However, the actual
functionality is entirely user-defined.

Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]`

.CFU R4-type instruction format
image::cfu_r4type_instruction.png[align=center]

* `rs3`: address of third source register
* `rs2`: address of second source register
* `rs1`: address of first source register
* `funct3`: 3-bit immediate
* `rd`: address of destination register
* `opcode`: always `0101011` (RISC-V "custom-1" opcode)

.RISC-V compatibility
[NOTE]
The CPU control logic can only check the _CUSTOM0_ opcode of the custom instructions to check if the
instruction word is valid. It cannot check the `funct3` and `funct7` bit-fields since they are
implementation-defined. Hence, a custom CFU instruction can never raise an illegal instruction exception.
However, custom will raise an illegal instruction exception if the CFU is not enabled/implemented
(i.e. `Zxcfu` ISA extension is not enabled).
The CFU R4-type instruction format is compliant to the RISC-V ISA specification.

The CFU operates on the two source operands and return the processing result to the destination register.
The actual instruction to be performed can be defined by using the `funct7` and `funct3` bit fields.
These immediate bit-fields can also be used to pass additional data to the CFU like offsets, look-up-tables
addresses or shift-amounts. However, the actual functionality is completely user-defined.
.Unused instruction bits
[NOTE]
The RISC-V ISA specification defines bits [26:25] of the R4-type instruction word to be all-zero. These bit are ignored
by the hardware (CFU and illegal instruction check logic) and should be set to all-zero to preserve compatibility with
future implementations.

.Instruction encoding space
[NOTE]
By using the `funct3` entirely for selecting the actual operation a total of 8 custom R4-type instructions
can be implemented (3-bit -> 8 different values).

.Hardware resource requirements
[WARNING]
Enabling the CFU and actually implementing R4-type instruction (or more precisely, using `rs3` inside the CFU hardware
module) will add another read port to the core's register file increasing resource requirements. For example, on a
FPGA platform that supports dual-port RAMs this will _double_ the number of required BRAMs for implementing the register
file.


:sectnums:
==== Using Custom Instructions in Software

The custom instructions provided by the CFU are included into plain C code by using **intrinsics**. Intrinsics
behave like "normal" functions but under the hood they are a set of macros that hide the complexity of inline assembly.
Using such intrinsics removes the need to modify the compiler, built-in libraries and the assembler when including custom
Using intrinsics removes the need to modify the compiler, built-in libraries or the assembler when including custom
instructions.

The NEORV32 software framework provides 8 pre-defined custom instructions macros, which are defined in
`sw/lib/include/neorv32_cpu_cfu.h`. Each intrinsic provides an implicit definition of the instruction word's
`funct3` bit-field:
The NEORV32 software framework provides two pre-defined prototypes for custom instructions, which are defined in
`sw/lib/include/neorv32_cpu_cfu.h` - one for R3-type instruction and one for R4-type instructions:

.CFU instruction prototypes
[source,c]
----
neorv32_cfu_cmd0(funct7, rs1, rs2) // funct3 = 000
neorv32_cfu_cmd1(funct7, rs1, rs2) // funct3 = 001
neorv32_cfu_cmd2(funct7, rs1, rs2) // funct3 = 010
neorv32_cfu_cmd3(funct7, rs1, rs2) // funct3 = 011
neorv32_cfu_cmd4(funct7, rs1, rs2) // funct3 = 100
neorv32_cfu_cmd5(funct7, rs1, rs2) // funct3 = 101
neorv32_cfu_cmd6(funct7, rs1, rs2) // funct3 = 110
neorv32_cfu_cmd7(funct7, rs1, rs2) // funct3 = 111
neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) // R3-type instruction
neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) // R4-type instruction
----

Each intrinsic functions always returns a 32-bit value (the processing result). Furthermore,
each intrinsic function requires three arguments:
The intrinsic functions always return a 32-bit value of type `uint32_t` (the processing result), which can be discarded
when not needed. Each intrinsic function requires several arguments depending on the instruction type:

* `funct7` - 7-bit immediate (r3-type)
* `funct3` - 3-bit immediate (r3-type, r4-type)
* `rs3` - source operand 2, 32-bit (r4-type)
* `rs2` - source operand 2, 32-bit (r3-type, r4-type)
* `rs1` - source operand 1, 32-bit (r3-type, r4-type)

* `funct7` - 7-bit immediate
* `rs2` - source operand 2, 32-bit
* `rs1` - source operand 1, 32-bit
[NOTE]
The literals (immediate bit-fields `funct3` and `funct7`) have to be **static at compile time**.

The `funct7` bit-field is used to pass a 7-bit literal to the CFU. The `rs1` and `rs2` arguments to pass the
actual data to the CFU. These arguments can be populated with variables or literals. The following example
show how to pass arguments when executing `neorv32_cfu_cmd6`: `funct7` is set to all-zero, `rs1` is given
the literal _2751_ and `rs2` is given a variable that contains the return value from `some_function()`.
The `funct3` and `funct7` bit-fields are used to pass 3-bit or 7-bit literals to the CFU. The `rs1`, `rs2` and `rs3`
arguments pass the actual data to the CFU. These register arguments can be populated with variables or literals.
The following example shows how to pass arguments when executing both CFU instruction types:

.CFU instruction usage example
[source,c]
----
uint32_t opb = some_function();
uint32_t res = neorv32_cfu_cmd6(0b0000000, 2751, opb);
uint32_t tmp = some_function();
uint32_t res = neorv32_cfu_r3_instr(0b0000000, 0b101, tmp, 123);
uint32_t foo = neorv32_cfu_r4_instr(0b011, tmp, res, some_array[i]);
----

.CFU Example Program
Expand All @@ -113,42 +170,22 @@ The example program is located in `sw/example/demo_cfu`.
:sectnums:
==== Custom Instructions Hardware

The actual functionality of the CFU's custom instruction is defined by the logic in the CFU itself.
It is the responsibility of the designer to implement this logic within the CFU hardware module
`rtl/core/neorv32_cpu_cp_cfu.vhd`.

The CFU hardware module receives the data from instruction word's immediate bit-fields and also
the operation data, which is fetched from the CPU's register file.

.CFU instruction data passing example
[source,c]
----
uint32_t opb = 0x12345678UL;
uint32_t res = neorv32_cfu_cmd6(0b0100111, 0x00cafe00, opb);
----

In this example the CFU hardware module receives the two source operands as 32-bit signal
and the immediate values as 7-bit and 3-bit signals:

* `rs1_i` (32-bit) contains the data from the `rs1` register (here = `0x00cafe00`)
* `rs2_i` (32-bit) contains the data from the `rs2` register (here = 0x12345678)
* `control.funct3` (3-bit) contains the immediate value from the `funct3` bit-field (here = `0b110`; "cmd6")
* `control.funct7` (7-bit) contains the immediate value from the `funct7` bit-field (here = `0b0100111`)

The CFU executes the according instruction (for example this is selected by the `control.funct3` signal)
and provides the operation result in the 32-bit `control.result` signal. The processing can be entirely
combinatorial, so the result is available at the end of the current clock cycle. Processing can also
take several clock cycles and may also include internal states and memories. As soon as the CFU has
completed operations it sets the `control.done` signal high.
The actual functionality of the CFU's custom instructions is defined by the user-defined logic inside
the CFU hardware module `rtl/core/neorv32_cpu_cp_cfu.vhd`.

.CFU Hardware Example & More Details
[TIP]
The default CFU module already implement some exemplary instructions that are used for illustration
The default CFU hardware module already implement some exemplary instructions that are used for illustration
by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which
is highly commented to explain the available signals and the handshake with the CPU pipeline.

CFU operations can be entirely combinatorial (like bit-reversal) so the result is available at the end of
the current clock cycle. Operations can also take several clock cycles to complete (like multiplications)
and may also include internal states and memories. The CFU's internal controller unit takes care of
interfacing the custom user logic to the CPU's pipeline.

.CFU Execution Time
[NOTE]
The CFU is not required to finish processing within a bound time.
However, the designer should keep in mind that the CPU is **stalled** until the CFU has finished processing.
This also means the CPU cannot react to pending interrupts. Nevertheless, interrupt requests will still be queued.
The CFU is not required to finish processing within a bound time. However, you should keep in mind that the
CPU is _stalled_ until the CFU has finished processing. This also means the CPU cannot react to pending
interrupts during this time affecting real-time behavior (interrupt requests will still be queued).
Binary file removed docs/figures/cfu_r2type_instruction.png
Binary file not shown.
Binary file added docs/figures/cfu_r3type_instruction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figures/cfu_r4type_instruction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading