Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add "ASIC style" register file option #736

Merged
merged 8 commits into from
Nov 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ mimpid = 0x01040312 -> Version 01.04.03.12 -> v1.4.3.12

| Date (*dd.mm.yyyy*) | Version | Comment |
|:-------------------:|:-------:|:--------|
| 25.11.2023 | 1.9.1.6 | :sparkles: add option for "ASIC style" register file that provides a full/dedicated hardware reset; [#736](https://github.com/stnolting/neorv32/pull/736) |
| 23.11.2023 | 1.9.1.5 | clean-up & rework CPU branch logic; [#735](https://github.com/stnolting/neorv32/pull/735) |
| 21.11.2023 | 1.9.1.4 | :bug: fix bug in handling of "misaligned instruction exception"; [#734](https://github.com/stnolting/neorv32/pull/734) |
| 20.11.2023 | 1.9.1.3 | :bug: fix wiring of FPU exception flags; [#733](https://github.com/stnolting/neorv32/pull/733) |
Expand Down
87 changes: 62 additions & 25 deletions docs/datasheet/cpu.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -87,44 +87,66 @@ caused by speculative execution (like Spectre or Meltdown).
:sectnums:
==== CPU Register File

The data register file contains the general purpose "`x`" architecture registers. For the `rv32i` ISA there are 32 32-bit registers
and for the `rv32e` ISA there are 16 32-bit registers. Register zero (`x0`/`zero`) always read as zero and any write access to it
is discarded.
The data register file contains the general purpose architecture registers `x0` to `x31`. For the `rv32e` ISA only the lower
16 registers are implemented. Register zero (`x0`/`zero`) always read as zero and any write access to it has no effect.
Up to four individual synchronous read ports allow to fetch up to 4 register operands at once. The write and read accesses
are mutually exclusive as they happen in separate cycles. Hence, there is no need to consider things like "read-during-write"
behavior.

The register file is implemented as synchronous memory with synchronous read and write accesses. Register `zero` is also mapped to
a _physical memory location_ in the register file. By this, there is no need to add a further multiplexer to "insert" zero if reading
from register `zero` reducing logic requirements and shortening the critical path. Furthermore, the whole register file can be mapped
entirely to FPGA block RAM.
The register file provides two different implementation options configured via the top's `REGFILE_HW_RST` generic.

The memory of the register file uses two access ports: a read-only port for reading register `rs2` (second source operand) and a
read/write port for reading register `rs1` (first source operand) or for writing processing results to register `rd` (destination register).
Hence, a simple dual-port RAM can be used to implement the entire register file. From a functional point of view, read and write accesses to
the register file do never occur in the same clock cycle, so no bypass logic is required at all.
* `REGFILE_HW_RST = false` (default): In this configuration the register file is implemented as plain memory array without a
dictated hardware reset. This architecture allows to infer FPGA block RAM for the entire register file resulting in minimal
logic utilization and optimal timing.
* `REGFILE_HW_RST = true`: This configuration is based on individual FFs that do provide a dedicated hardware reset.
Hence, the register cannot be mapped to FPGA block RAM. This optional should only be selected if the application requires a
reset of the register file (e.g. for security reasons) or if the design shall be synthesized for an **ASIC** implementation.

.Register File Reset
[IMPORTANT]
The CPU register file does **not** provide any reset capabilities (in order to allow mapping to block RAM).
Hence, all integer registers (`x1` to `x15`/`x31`) have unknown values after a hardware reset and can still contain
sensitive data like encryption keys.
The state of this configuration generic can be checked by software via the <<_mxisa>> CSR.

.FPGA Implementation
[WARNING]
Enabling the `REGFILE_HW_RST` option for FPGA implementation is not recommended as this will massively increase the amount
of required logic resources.

.Implementation of the `zero` Register within FPGA Block RAM
[NOTE]
Register `zero` is also mapped to a _physical memory location_ within the register file's block RAM. By this, there is no need
to add a further multiplexer to "insert" zero if reading from register `zero` reducing logic requirements and shortening the
critical path. However, this also requires that the physical storage bits of register `zero` are explicitly initialized (set
to zero) by the hardware. This is done transparently by the CPU control requiring no additional processing overhead.

.Block RAM Ports
[NOTE]
The default register file configuration uses two access ports: a read-only port for reading register `rs2` (second source operand)
and a read/write port for reading register `rs1` (first source operand) and for writing processing results to register `rd`
(destination register). Hence, a simple dual-port RAM can be used to implement the entire register file. From a functional point
of view, read and write accesses to the register file do never occur in the same clock cycle, so no bypass logic is required at all.


:sectnums:
==== CPU Arithmetic Logic Unit

The arithmetic/logic unit (ALU) is used for processing data from the register file and also for memory and branch address computations.
All simple <<_i_isa_extension>> processing operations (`add`, `and`, ...) are implemented as combinatorial logic requiring only a single cycle to
complete. More sophisticated instructions (shift operations from the base ISA and all further ISA extensions) are processed by so-called
"ALU co-processors".
The arithmetic/logic unit (ALU) is used for actual data processing as well as generating memory and branch addresses.
All "simple" <<_i_isa_extension>> computational instructions (like `add` and `or`) are implemented as plain combinatorial logic
requiring only a single cycle to complete. More sophisticated instructions like shift operations or multiplications are processed
by so-called "ALU co-processors".

The co-processors are implemented as iterative units that require several cycles to complete processing. Besides the base ISA's shift instructions,
the co-processors are used to implement all further processing-based ISA extensions (e.g. <<_m_isa_extension>> and
<<_b_isa_extension>>).
The co-processors are implemented as iterative units that require several cycles to complete processing. Besides the base ISA's
shift instructions, the co-processors are used to implement all further processing-based ISA extensions (e.g. <<_m_isa_extension>>
and <<_b_isa_extension>>).

.Multi-Cycle Execution Monitor
[NOTE]
The CPU control will raise an illegal instruction exception if a multi-cycle functional unit (like the <<_custom_functions_unit_cfu>>)
does not complete processing in a bound amount of time (configured via the package's `monitor_mc_tmo_c` constant; default = 512 clock cycles).

.Tuning Options
[TIP]
The ALU architecture can be tuned for an application-specific area-vs-performance trade-off. The `FAST_MUL_EN` and `FAST_SHIFT_EN`
generics can be used to implement performance-optimized barrel shifters and DSP blocks, respectively. See sections <<_i_isa_extension>>,
<<_b_isa_extension>> and <<_m_isa_extension>> for specific examples.


:sectnums:
==== CPU Bus Unit
Expand Down Expand Up @@ -494,6 +516,11 @@ the following sub-extensions:
| Carry-less multiply | `clmul` `clmulh` `clmulr` | 36
|=======================

.Barrel Shifter
[TIP]
Shift operations can be accelerated (at the cost of additional logic resources) by enabling the `FAST_SHIFT_EN`
configuration option that will replace the (time-variant) bit-serial shifter by a (time-constant) barrel shifter.


==== `C` ISA Extension

Expand Down Expand Up @@ -561,6 +588,11 @@ will clear/flush the data cache and resynchronize it with main memory.
The `wfi` instruction is used to enter <<_sleep_mode>>. Executing the `wfi` instruction in user-mode
will raise an illegal instruction exception if the `TW` bit of <<_mstatus>> is set.

.Barrel Shifter
[TIP]
Shift operations can be accelerated (at the cost of additional logic resources) by enabling the `FAST_SHIFT_EN`
configuration option that will replace the (time-variant) bit-serial shifter by a (time-constant) barrel shifter.


==== `M` ISA Extension

Expand All @@ -575,6 +607,11 @@ Hardware-accelerated integer multiplication and division operations are availabl
| Division | `div` `divu` `rem` `remu` | 36
|=======================

.DSP Blocks
[TIP]
Multiplication operations can be accelerated (at the cost of additional logic resources) by enabling the `FAST_MUL_EN`
configuration option that will replace the (time-variant) bit-serial multiplier by (time-constant) FPGA DSP blocks.


==== `U` ISA Extension

Expand Down Expand Up @@ -852,10 +889,10 @@ defined by the NEORV32 core library (the runtime environment _RTE_) and can be u
with the pre-defined RTE function. The <<_mcause>>, <<_mepc>>, <<_mtval>> and <<_mtinst>> columns show the value being
written to the according CSRs when a trap is triggered:

* **I-PC** - address of interrupted instruction (instruction has _not_ been executed yet)
* **I-PC** - address of intercepted instruction (instruction has _not_ been executed yet)
* **PC** - address of instruction that caused the trap (instruction has been executed)
* **ADR** - bad data memory access address that caused the trap
* **INS** - the (decompressed) instruction word that caused the trap
* **INS** - the transformed/decompressed instruction word that caused the trap
* **0** - zero

.NEORV32 Trap Listing
Expand Down
3 changes: 2 additions & 1 deletion docs/datasheet/cpu_csr.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -1005,7 +1005,8 @@ discover ISA sub-extensions and CPU configuration options
| 11 | `CSR_MXISA_SDTRIG` | r/- | <<_sdtrig_isa_extension>> available
| 19:12 | - | r/- | hardwired to zero
| 20 | `CSR_MXISA_IS_SIM` | r/- | set if CPU is being **simulated** (⚠️ not guaranteed)
| 31:21 | - | r/- | hardwired to zero
| 28:21 | - | r/- | hardwired to zero
| 29 | `CSR_MXISA_RFHWRST` | r/- | full hardware reset of register file available when set (`REGFILE_HW_RST`)
| 30 | `CSR_MXISA_FASTMUL` | r/- | fast multiplication available when set (`FAST_MUL_EN`)
| 31 | `CSR_MXISA_FASTSHIFT` | r/- | fast shifts available when set (`FAST_SHIFT_EN`)
|=======================
13 changes: 4 additions & 9 deletions docs/datasheet/soc.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -183,12 +183,6 @@ If optional modules (like CPU extensions or peripheral devices) are not enabled
will not be synthesized at all. Hence, the disabled modules do not increase area and power requirements
and do not impact timing.

.Configuration Check
[NOTE]
Not all configuration combinations are valid. The processor RTL code provides sanity checks to inform the user
during synthesis/simulation if an invalid combination has been detected. It is recommended to run a quick simulation
using the provided simulation/GHDL scripts to verify the configuration of the processor generics.

.Table Abbreviations
[NOTE]
The generic type "`suv(x:y)`" is an abbreviation for "`std_ulogic_vector(x downto y)`".
Expand Down Expand Up @@ -219,9 +213,10 @@ The generic type "`suv(x:y)`" is an abbreviation for "`std_ulogic_vector(x downt
| `CPU_EXTENSION_RISCV_Zihpm` | boolean | false | Enable <<_zihpm_isa_extension>> (hardware performance monitors).
| `CPU_EXTENSION_RISCV_Zmmul` | boolean | false | Enable <<_zmmul_isa_extension>> (hardware-based integer multiplication).
| `CPU_EXTENSION_RISCV_Zxcfu` | boolean | false | Enable NEORV32-specific <<_zxcfu_isa_extension>> (custom RISC-V instructions).
4+^| **CPU Tuning Options**
| `FAST_MUL_EN` | boolean | false | Implement fast (but large) full-parallel multipliers (trying to infer DSP blocks).
| `FAST_SHIFT_EN` | boolean | false | Implement fast (but large) full-parallel barrel shifters.
4+^| **CPU <<_architecture>> Tuning Options**
| `FAST_MUL_EN` | boolean | false | Implement fast but large full-parallel multipliers (trying to infer DSP blocks); see section <<_cpu_arithmetic_logic_unit>>.
| `FAST_SHIFT_EN` | boolean | false | Implement fast but large full-parallel barrel shifters; see section <<_cpu_arithmetic_logic_unit>>.
| `REGFILE_HW_RST` | boolean | false | Implement full hardware reset for register file (prevent inferring of BRAM); see section <<_cpu_register_file>>.
4+^| **Physical Memory Protection (<<_pmp_isa_extension>>)**
| `PMP_NUM_REGIONS` | natural | 0 | Number of implemented PMP regions (0..16).
| `PMP_MIN_GRANULARITY` | natural | 4 | Minimal region granularity in bytes. Has to be a power of two, min 4.
Expand Down
6 changes: 5 additions & 1 deletion rtl/core/neorv32_cpu.vhd
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ entity neorv32_cpu is
-- Tuning Options --
FAST_MUL_EN : boolean; -- use DSPs for M extension's multiplier
FAST_SHIFT_EN : boolean; -- use barrel shifter for shift operations
REGFILE_HW_RST : boolean; -- implement full hardware reset for register file
-- Physical Memory Protection (PMP) --
PMP_NUM_REGIONS : natural range 0 to 16; -- number of regions (0..16)
PMP_MIN_GRANULARITY : natural; -- minimal region granularity in bytes, has to be a power of 2, min 4 bytes
Expand Down Expand Up @@ -191,6 +192,7 @@ begin
-- Tuning Options --
FAST_MUL_EN => FAST_MUL_EN, -- use DSPs for M extension's multiplier
FAST_SHIFT_EN => FAST_SHIFT_EN, -- use barrel shifter for shift operations
REGFILE_HW_RST => REGFILE_HW_RST, -- implement full hardware reset for register file
-- Physical memory protection (PMP) --
PMP_EN => pmp_enable_c, -- physical memory protection enabled
-- Hardware Performance Monitors (HPM) --
Expand Down Expand Up @@ -256,13 +258,15 @@ begin
-- -------------------------------------------------------------------------------------------
neorv32_cpu_regfile_inst: entity neorv32.neorv32_cpu_regfile
generic map (
RVE_EN => CPU_EXTENSION_RISCV_E, -- implement embedded RF extension?
RST_EN => REGFILE_HW_RST, -- enable dedicated hardware reset ("ASIC style")
RVE_EN => CPU_EXTENSION_RISCV_E, -- implement embedded RF extension
RS3_EN => regfile_rs3_en_c, -- enable 3rd read port
RS4_EN => regfile_rs4_en_c -- enable 4th read port
)
port map (
-- global control --
clk_i => clk_i, -- global clock, rising edge
rstn_i => rstn_i, -- global reset, low-active, async
ctrl_i => ctrl, -- main control bus
-- data input --
alu_i => alu_res, -- ALU result
Expand Down
12 changes: 7 additions & 5 deletions rtl/core/neorv32_cpu_control.vhd
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ entity neorv32_cpu_control is
-- Tuning Options --
FAST_MUL_EN : boolean; -- use DSPs for M extension's multiplier
FAST_SHIFT_EN : boolean; -- use barrel shifter for shift operations
REGFILE_HW_RST : boolean; -- implement full hardware reset for register file
-- Physical memory protection (PMP) --
PMP_EN : boolean; -- physical memory protection enabled
-- Hardware Performance Monitors (HPM) --
Expand Down Expand Up @@ -611,7 +612,7 @@ begin
if (execute_engine.ir(instr_funct3_msb_c) = '0') then -- beq / bne
execute_engine.branch_taken <= cmp_i(cmp_equal_c) xor execute_engine.ir(instr_funct3_lsb_c);
else -- blt(u) / bge(u)
execute_engine.branch_taken <= cmp_i(cmp_less_c) xor execute_engine.ir(instr_funct3_lsb_c);
execute_engine.branch_taken <= cmp_i(cmp_less_c) xor execute_engine.ir(instr_funct3_lsb_c);
end if;
else -- unconditional branch
execute_engine.branch_taken <= '1';
Expand Down Expand Up @@ -649,7 +650,7 @@ begin
execute_engine.pc <= execute_engine.next_pc(XLEN-1 downto 1) & '0';
end if;

-- next PC: address of next logic instruction --
-- next PC: address of next instruction --
case execute_engine.state is

when TRAP_ENTER => -- starting trap environment
Expand Down Expand Up @@ -1015,10 +1016,10 @@ begin
when BRANCHED => -- delay cycle to wait for reset of pipeline front-end (instruction fetch)
-- ------------------------------------------------------------
execute_engine.state_nxt <= DISPATCH;
-- house keeping: use this state to (re-)initialize the register file's x0/zero register --
if (reset_x0_c = true) then -- if x0 is a "real" register that has to be initialized to zero
-- house keeping: use this state also to (re-)initialize the register file's x0/zero register --
if (REGFILE_HW_RST = false) then -- x0 does not provide a dedicated hardware reset
ctrl_nxt.rf_mux <= rf_mux_csr_c; -- this will return 0 since csr.re_nxt is zero
ctrl_nxt.rf_zero_we <= '1'; -- allow/force write access to x0
ctrl_nxt.rf_zero_we <= '1'; -- force write access to x0
end if;

when MEM_REQ => -- trigger memory request
Expand Down Expand Up @@ -2094,6 +2095,7 @@ begin
-- misc --
csr_rdata(20) <= bool_to_ulogic_f(is_simulation_c); -- is this a simulation?
-- tuning options --
csr_rdata(29) <= bool_to_ulogic_f(REGFILE_HW_RST); -- full hardware reset of register file
csr_rdata(30) <= bool_to_ulogic_f(FAST_MUL_EN); -- DSP-based multiplication (M extensions only)
csr_rdata(31) <= bool_to_ulogic_f(FAST_SHIFT_EN); -- parallel logic for shifts (barrel shifters)

Expand Down
Loading