stnolting · stnolting · Jul 29, 2023 · Jul 29, 2023 · Jul 29, 2023 · Jul 29, 2023
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -32,6 +32,7 @@ mimpid = 0x01040312 -> Version 01.04.03.12 -> v1.4.3.12
 
 | Date (*dd.mm.yyyy*) | Version | Comment |
 |:-------------------:|:-------:|:--------|
+| 29.07.2023 | 1.8.7.4 | RTL cleanup and optimizations (less synthesis warnings, less resource requirements); [#660](https://github.com/stnolting/neorv32/pull/660) |
 | 28.07.2023 | 1.8.7.3 | :warning: reworked **SYSINFO** module; clean-up address space layout; clean-up assertion notes; [#659](https://github.com/stnolting/neorv32/pull/659) |
 | 27.07.2023 | 1.8.7.2 | :bug: make sure that IMEM/DMEM size is always a power of two; [#658](https://github.com/stnolting/neorv32/pull/658) |
 | 27.07.2023 | 1.8.7.1 | :warning: remove `CUSTOM_ID` generic; cleanup and re-layout `NEORV32_SYSINFO.SOC` bits; (:bug:) fix gateway's generics (`positive` -> `natural` as these generics are allowed to be zero); [#657](https://github.com/stnolting/neorv32/pull/657) |

diff --git a/docs/datasheet/cpu.adoc b/docs/datasheet/cpu.adoc
@@ -145,7 +145,8 @@ The control unit is split into a "front-end" and a "back-end".
 
 The front-end is responsible for fetching instructions in chunks of 32-bits. This can be a single aligned 32-bit instruction,
 two aligned 16-bit instructions or a mixture of those. The instructions including control and exception information are stored
-to a FIFO queue - the instruction prefetch buffer (IPB). The depth of this FIFO can be configured by the `CPU_IPB_ENTRIES` top generic.
+to a FIFO queue - the instruction prefetch buffer (IPB). This FIFO has a depth of two entries by default but can be customized
+via the `ipb_depth_c` VHDL package constant.
 
 The FIFO allows the front-end to do "speculative" instruction fetches, as it keeps fetching the next consecutive instruction
 all the time. This also allows to decouple front-end (instruction fetch) and back-end (instruction execution) so both modules
@@ -695,7 +696,6 @@ Auto-increment of the HPMs can be deactivated individually via the <<_mcountinhi
 This is a sub-extension of the <<_m_isa_extension>> ISA extension. It implements only the multiplication operations
 of the `M` extensions and is intended for size-constrained setups that require hardware-based
 integer multiplications but not hardware-based divisions, which will be computed entirely in software.
-This extension requires only ~50% of the hardware utilization of the "full" `M` extension.
 
 
 ==== `Zxcfu` ISA Extension

diff --git a/docs/datasheet/soc.adoc b/docs/datasheet/soc.adoc
@@ -221,7 +221,6 @@ The generic type "`suv(x:y)`" is an abbreviation for "`std_ulogic_vector(x downt
 4+^| **CPU Tuning Options**
 | `FAST_MUL_EN`           | boolean   | false      | Implement fast (but large) full-parallel multipliers (trying to infer DSP blocks).
 | `FAST_SHIFT_EN`         | boolean   | false      | Implement fast (but large) full-parallel barrel shifters.
-| `CPU_IPB_ENTRIES`       | natural   | 1          | Number of entries in the CPU's instruction prefetch buffer.
 4+^| **Physical Memory Protection (<<_pmp_isa_extension>>)**
 | `PMP_NUM_REGIONS`       | natural   | 0          | Number of implemented PMP regions (0..16).
 | `PMP_MIN_GRANULARITY`   | natural   | 4          | Minimal region granularity in bytes. Has to be a power of two, min 4.
@@ -459,23 +458,29 @@ A pending FIRQ has to be explicitly cleared by writing zero to the according <<_
 
 As a 32-bit architecture the NEORV32 can access a 4GB physical address space. By default, this address space is
 split into six main regions. Each region provides specific _physical memory attributes_ ("PMAs") that define
-the access capabilities.
+the access capabilities (`rwxac`; `r` = read permission, `w` = execute permission, `x` - execute permission,
+`a` = atomic access support, `c` = cached CPU access).
 
 .NEORV32 Processor Address Space (Default Configuration)
 image::address_space.png[900]
 
+.Main Address Regions
 [cols="<1,^4,^2,<7"]
 [options="header",grid="rows"]
 |=======================
-| # | Region                      | PMAs  | Description
-| 1 | Internal IMEM address space | `rwx` | For instructions (=code) and constants; mapped to the internal <<_instruction_memory_imem>>.
-| 2 | Internal DMEM address space | `rwx` | For application runtime data (heap, stack, etc.); mapped to the internal <<_data_memory_dmem>>).
-| 3 | Memory-mapped XIP flash     | `r-x` | Memory-mapped access to the <<_execute_in_place_module_xip>> SPI flash.
-| 4 | Bootloader address space    | `r-x` | Read-only memory for the internal <<_bootloader_rom_bootrom>> containing the default <<_bootloader>>.
-| 5 | IO/peripheral address space | `rwx` | Processor-internal peripherals / IO devices.
-| 6 | The "**void**"              | `rwx` | Unmapped address space. All accesses to this region(s) are redirected to the <<_processor_external_memory_interface_wishbone>> (if implemented).
+| # | Region                      | PMAs    | Description
+| 1 | Internal IMEM address space | `rwxac` | For instructions (=code) and constants; mapped to the internal <<_instruction_memory_imem>>.
+| 2 | Internal DMEM address space | `rwxac` | For application runtime data (heap, stack, etc.); mapped to the internal <<_data_memory_dmem>>).
+| 3 | Memory-mapped XIP flash     | `r-xac` | Memory-mapped access to the <<_execute_in_place_module_xip>> SPI flash.
+| 4 | Bootloader address space    | `r-xa-` | Read-only memory for the internal <<_bootloader_rom_bootrom>> containing the default <<_bootloader>>.
+| 5 | IO/peripheral address space | `rwxa-` | Processor-internal peripherals / IO devices.
+| 6 | The "**void**"              | `rwxac` | Unmapped address space. All accesses to this region(s) are redirected to the <<_processor_external_memory_interface_wishbone>> (if implemented).
 |=======================
 
+.Custom PMAs
+[NOTE]
+Physical memory attributes can be customized (constrained) using the CPU's <<_pmp_isa_extension>>.
+
 The CPU can access all of the 32-bit address space from the instruction fetch interface and also from the data access
 interface. Both interfaces can be equipped with optional caches (<<_processor_internal_data_cache_dcache>> and
 <<_processor_internal_instruction_cache_icache>>). The two CPU interfaces are multiplexed by a simple bus switch into

diff --git a/docs/datasheet/software.adoc b/docs/datasheet/software.adoc
@@ -265,6 +265,8 @@ The following default compiler flags are used for compiling an application. Thes
 | `-lgcc`               | Make sure we have no unresolved references to internal GCC library subroutines.
 | `-mno-fdiv`           | Use built-in software functions for floating-point divisions and square roots (since the according instructions are not supported yet).
 | `-g`                  | Include debugging information/symbols in ELF.
+| `-mstrict-align`      | Unaligned memory accesses cannot be resolved by the hardware and require emulation.
+| `-mbranch-cost=...`   | Branches cost a lot cycles on a multi-cycle architecture.
 |=======================
 
 :sectnums:

diff --git a/docs/figures/address_space.png b/docs/figures/address_space.png
diff --git a/docs/userguide/application_specific_configuration.adoc b/docs/userguide/application_specific_configuration.adoc
@@ -23,8 +23,6 @@ multiplications, `FAST_SHIFT_EN => true` use a fast barrel shifter for shift ope
 * Implement the instruction cache: `ICACHE_EN => true`
 * Use as many _internal_ memory as possible to reduce memory access latency: `MEM_INT_IMEM_EN => true` and
 `MEM_INT_DMEM_EN => true`, maximize `MEM_INT_IMEM_SIZE` and `MEM_INT_DMEM_SIZE`
-* Increase the CPU's instruction prefetch buffer size: if **no** instruction cache is implemented `CPU_IPB_ENTRIES` should be
-quite large
 * _To be continued..._
 
 
@@ -53,7 +51,6 @@ also reduces program code size by approximately 30%.
 * If not explicitly used/required, exclude the CPU standard counters `[m]instret[h]`
 (number of instruction) and `[m]cycle[h]` (number of cycles) from synthesis by disabling the `Zicntr` ISA extension
 (note, this is not RISC-V compliant).
-* Reduce the CPU's prefetch buffer size (`CPU_IPB_ENTRIES`) to its minimum (=1).
 * Map CPU shift operations to a small and iterative shifter unit (`FAST_SHIFT_EN => false`).
 * If you have unused DSP block available, you can map multiplication operations to those slices instead of
 using LUTs to implement the multiplier (`FAST_MUL_EN => true`).

diff --git a/rtl/core/neorv32_cpu.vhd b/rtl/core/neorv32_cpu.vhd
@@ -66,7 +66,6 @@ entity neorv32_cpu is
     -- Extension Options --
     FAST_MUL_EN                  : boolean; -- use DSPs for M extension's multiplier
     FAST_SHIFT_EN                : boolean; -- use barrel shifter for shift operations
-    CPU_IPB_ENTRIES              : natural; -- entries in instruction prefetch buffer, has to be a power of 2, min 1
     -- Physical Memory Protection (PMP) --
     PMP_NUM_REGIONS              : natural; -- number of regions (0..16)
     PMP_MIN_GRANULARITY          : natural; -- minimal region granularity in bytes, has to be a power of 2, min 4 bytes
@@ -99,14 +98,10 @@ end neorv32_cpu;
 
 architecture neorv32_cpu_rtl of neorv32_cpu is
 
-  -- local constants: additional register file read ports --
+  -- auto-configuration --
   constant regfile_rs3_en_c : boolean := CPU_EXTENSION_RISCV_Zxcfu or CPU_EXTENSION_RISCV_Zfinx; -- 3rd register file read port (rs3)
   constant regfile_rs4_en_c : boolean := CPU_EXTENSION_RISCV_Zxcfu; -- 4th register file read port (rs4)
 
-  -- local constant: instruction prefetch buffer depth --
-  constant ipb_override_c : boolean := (CPU_EXTENSION_RISCV_C = true) and (CPU_IPB_ENTRIES < 2); -- override IPB size: set to 2?
-  constant ipb_depth_c    : natural := cond_sel_natural_f(ipb_override_c, 2, CPU_IPB_ENTRIES);
-
   -- local signals --
   signal ctrl        : ctrl_bus_t; -- main control bus
   signal imm         : std_ulogic_vector(XLEN-1 downto 0); -- immediate
@@ -120,7 +115,7 @@ architecture neorv32_cpu_rtl of neorv32_cpu is
   signal mem_rdata   : std_ulogic_vector(XLEN-1 downto 0); -- memory read data
   signal cp_done     : std_ulogic; -- ALU co-processor operation done
   signal alu_exc     : std_ulogic; -- ALU exception
-  signal bus_d_wait  : std_ulogic; -- wait for current bus data access
+  signal bus_d_wait  : std_ulogic; -- wait for current data bus access
   signal csr_rdata   : std_ulogic_vector(XLEN-1 downto 0); -- csr read data
   signal mar         : std_ulogic_vector(XLEN-1 downto 0); -- memory address register
   signal ma_load     : std_ulogic; -- misaligned load data address
@@ -143,7 +138,7 @@ begin
   -- -------------------------------------------------------------------------------------------
   -- say hello --
   assert false report
-    "The NEORV32 RISC-V Processor (Version 0x" & to_hstring32_f(hw_version_c) & ") - github.com/stnolting/neorv32" severity note;
+    "The NEORV32 RISC-V Processor Version 0x" & to_hstring32_f(hw_version_c) & " - github.com/stnolting/neorv32" severity note;
 
   -- CPU ISA configuration --
   assert false report
@@ -175,12 +170,6 @@ begin
   assert not (CPU_BOOT_ADDR(1 downto 0) /= "00") report
     "NEORV32 CPU CONFIG ERROR! <CPU_BOOT_ADDR> has to be 32-bit aligned." severity error;
 
-  -- Instruction prefetch buffer --
-  assert not (is_power_of_two_f(CPU_IPB_ENTRIES) = false) report
-    "NEORV32 CPU CONFIG ERROR! Number of entries in instruction prefetch buffer <CPU_IPB_ENTRIES> has to be a power of two." severity error;
-  assert not (ipb_override_c = true) report
-    "NEORV32 CPU CONFIG WARNING! Overriding <CPU_IPB_ENTRIES> configuration (setting =2) because C ISA extension is enabled." severity warning;
-
   -- PMP --
   assert not (PMP_NUM_REGIONS > 16) report
     "NEORV32 CPU CONFIG ERROR! Number of PMP regions <PMP_NUM_REGIONS> out of valid range (0..16)." severity error;
@@ -233,7 +222,6 @@ begin
     -- Tuning Options --
     FAST_MUL_EN                  => FAST_MUL_EN,                  -- use DSPs for M extension's multiplier
     FAST_SHIFT_EN                => FAST_SHIFT_EN,                -- use barrel shifter for shift operations
-    CPU_IPB_ENTRIES              => ipb_depth_c,                  -- entries is instruction prefetch buffer, has to be a power of 2, min 1
     -- Physical memory protection (PMP) --
     PMP_NUM_REGIONS              => PMP_NUM_REGIONS,              -- number of regions (0..16)
     PMP_MIN_GRANULARITY          => PMP_MIN_GRANULARITY,          -- minimal region granularity in bytes, has to be a power of 2, min 4 bytes
@@ -323,10 +311,10 @@ begin
     csr_i  => csr_rdata, -- CSR read data
     pc2_i  => next_pc,   -- next PC
     -- data output --
-    rs1_o  => rs1,       -- operand 1
-    rs2_o  => rs2,       -- operand 2
-    rs3_o  => rs3,       -- operand 3
-    rs4_o  => rs4        -- operand 4
+    rs1_o  => rs1,       -- rs1
+    rs2_o  => rs2,       -- rs2
+    rs3_o  => rs3,       -- rs3
+    rs4_o  => rs4        -- rs4
   );