Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neorv32_fifo vivado implementation issue #753

Closed
doud3157 opened this issue Dec 18, 2023 · 13 comments · Fixed by #754 or #827
Closed

neorv32_fifo vivado implementation issue #753

doud3157 opened this issue Dec 18, 2023 · 13 comments · Fixed by #754 or #827
Assignees
Labels
enhancement New feature or request HW hardware-related

Comments

@doud3157
Copy link

Describe the bug
With vivado 2023.1, the implementation of neorv32_fifo is incorrect. It uses CLB registers instead of BRAM.

To Reproduce
Synthesize the neorv32_fifo in standalone with a depth of 1024 bytes (for example)

Expected behavior
The neorv32_fifo should use BRAM and not CLB registers when DEPTH > 1.

Screenshots
NA

Environment:

  • Windows 11
  • vivado 2023.1

Hardware:

  • Zynq ultrascale+

Additional context
The problem is due to the usage of the record fifo_t. The parameter data (type fifo_data_t) should be define outside of the type fifo_t to help vivado. The processes (sync_read and async_read) should be modify with the new definition of data.

@stnolting
Copy link
Owner

Hey @doud3157. Thanks for finding this!

I know Vivado has some issues with multi-dimensional arrays sometimes (and a reocord could be seen as suchy), but I think the problem is somewhere else. This "dual-access" to the FIFO RAM might be an issue:

if (fifo_depth_c = 1) then
  rdata_o <= fifo.data(0);
else
  rdata_o <= fifo.data(to_integer(unsigned(fifo.r_pnt(fifo.r_pnt'left-1 downto 0))));
end if;

Anyway, I will do some tests and update the FIFO module.

@stnolting stnolting added HW hardware-related optimization Make things faster, smaller and more efficient enhancement New feature or request and removed optimization Make things faster, smaller and more efficient labels Dec 18, 2023
@stnolting stnolting self-assigned this Dec 18, 2023
@stnolting stnolting linked a pull request Dec 19, 2023 that will close this issue
@robhancocksed
Copy link
Contributor

It looks like this might still be an issue, at least with respect to the SLINK module where it generates a 33 bit wide FIFO. If I set the FIFO depths to 1024 words for RX and 512 for TX, for example, it seems to not be inferring block RAM but 33-bit registers, which uses a large amount of logic resources.

@stnolting stnolting reopened this Feb 22, 2024
@stnolting
Copy link
Owner

I've just tested that using Vivado - and you are right, Vivado is trying to build the SLINK's FIFOs from FF primitives. It seems like Vivado is not happy with these size checks:

if rising_edge(clk_i) then
if (fifo_depth_c > 1) then -- prevent a NULL assertion for fifo_depth_c of 1
rdata_o <= fifo_mem(to_integer(unsigned(r_pnt(r_pnt'left-1 downto 0))));
else
rdata_o <= fifo_mem(0);
end if;
end if;

I think that somehow Vivado is not capable of identifying these checks as "runtime static". Maybe it is better to use if-generate here? 🤔

@mikaelsky what do you think (I think we've added those checks in #778)?

@umarcor
Copy link
Collaborator

umarcor commented Feb 22, 2024

@stnolting when implementing TLAST, did you attach/concat it to each word/value before sending it to the FIFOs? If that is the case, the problem may be Vivado not being able to use BRAMs as 33 bit words.

Since I believe the main use case of TLAST is signaling the end of vector, maybe it can modelled as a peripheral attached to the end of SLINK FIFOs, which enables TLAST every N transfers, where N is configurable through a memory mapped (wishbone) control bus. So, the peripheral would have three interfaces:

  • FIFO slave.
  • SLINK master.
  • Wishbone (set of control registers).

Then, when sending a matrix (or set of vectors of equal length) the user does only need to define the width/height.

EDIT

Indeed:

@stnolting
Copy link
Owner

If that is the case, the problem may be Vivado not being able to use BRAMs as 33 bit words.

That "should" be no problem as the BRAM primitives can operate in 512*36-bit mode - so they are wide enough. The "tlast" identifier is just another "data" bit written to the FIFO.

Since I believe the main use case of TLAST is signaling the end of vector, maybe it can modelled as a peripheral attached to the end of SLINK FIFOs, which enables TLAST every N transfers, where N is configurable through a memory mapped (wishbone) control bus.

That would be a nice work-around here, but at the cost of additional hardware and (software) complexity 🙈

@umarcor
Copy link
Collaborator

umarcor commented Feb 22, 2024

@stnolting check pages 16 and 17 of https://docs.xilinx.com/v/u/en-US/ug473_7Series_Memory_Resources:

(32) DI[A|B] Data input bus.
(4) DIP[A|B] (1) Data input parity bus. Can be used for additional data inputs.

So, it is true that it can hold up to 36 bits per word, but I'm not sure it can deal with them unless you use two separated ports.

Notes:

  1. The Data-In Buses - DIADI, DIPADIP, DIBDI, and DIPBDIP section has more information on data
    parity pins.

Then, in page 31:

Data-In Buses - DIADI, DIPADIP, DIBDI, and DIPBDIP
Data-in buses provide the new data value to be written into RAM. The regular data-in bus
(DI), plus the parity data-in bus (DIP) when available, have a total width equal to the port
width. For example the 36-bit port data width is represented by DI[31:0] and DIP[3:0], as
shown in Table 1-11 through Table 1-14. See Table 1-15 for SDP mode port name mapping.

@mikaelsky
Copy link
Collaborator

I would be surprised if its the boundary checks. Mostly as they fall out as compile time constants. If we believe that is the case we would need to wrap them in a conditional generate statement that results in either a N deep memory or a 1 deep memory.

What I did do is just read up on inferred block ram https://docs.xilinx.com/r/en-US/ug901-vivado-synthesis/Inferring-UltraRAM-in-Vivado-Synthesis which provide examples for how to infer a RAM vs direct.

vivado will try and map into a plurality of block RAMs with a width resolution of 18 bits. E.g 18, 36, 54 etc.

The one thing vivado can't do is infer a FIFO, which would have been neat for this use case.

@umarcor
Copy link
Collaborator

umarcor commented Feb 22, 2024

@mikaelsky I would expect BRAMs to be used, rather than UltraRAM. https://docs.xilinx.com/r/en-US/am007-versal-memory/Block-RAM-Resources

@mikaelsky
Copy link
Collaborator

@umarcor True true :) The ultraRam is more an example of how you can insure that Vivado can infer RAMs (block or ultra).

The chapter you are referring to just provides examples of direct block RAM usage. From the application note you will find the following limitations on block RAM inference.
image

In the same application note you will find multiple examples of how to infer a block ram e.g. https://docs.xilinx.com/r/en-US/ug901-vivado-synthesis/Single-Port-Block-RAM-with-Resettable-Data-Output-VHDL

What is important here is that there are limits to how you can write your code to ensure that Vivado infers a block RAM. If Vivado can't understand the code if will instead infer flops.

@umarcor
Copy link
Collaborator

umarcor commented Feb 22, 2024

What is important here is that there are limits to how you can write your code to ensure that Vivado infers a block RAM. If Vivado can't understand the code if will instead infer flops.

My point is that I believe using a single 32 bit port allows Vivado to infer a BRAM, but using a 33 bit port does not. That's because the block and the port description in the chapter I referred explains that there is a 32 bit port and a 4 bit port (which is meant for parity but can also be used for extra bits). Unfortunately, I don't have Vivado at hand at the moment, so I cannot test it.

@mikaelsky
Copy link
Collaborator

I get where you are going. But if you read the little snippet I copied in Vivado does support "any size and data width" and "Vivado synthesis maps the memory description into one or several RAM"

If we look at the 7-series block RAM data sheet: https://docs.xilinx.com/v/u/en-US/ug473_7Series_Memory_Resources

From the block RAM data sheet we get:
image

Basically what I'm hinting at is that I think you are running down a wrong path with the 33-bit vs 32-bit. Vivado will do what-ever and waste block RAM bits. So e.g. for a 33-bit it will indeed fly in a 36-bit block RAM. The 4 bits are there for ECC or Parity, but only if you enable it, otherwise they are generally available.

Now where things do get hairy with the "33 bits vs 32 bits" is if we are using byte write enables. As the byte writes are tied into the parity bit as well resulting in a 9-bit byte. In this case you are absolutely right that you only get 32-bits per 36-bits. Which will result in a 32bit block RAM and an additional 1 or 2 bit block RAM to get a combined 33/34 bit wide block RAM.
image

I will highlight that they always refer to "maximum bit width" for the data ports. This is because you can indeed freely define the port width with the above mentioned caveat that if the chosen bit width fits badly into the power 2 scheme that Vivado uses you will waste RAM bits. e.g. a width of 3 will result in a 4-bit wide block ram, wasting 25% of the FPGA block RAM in the process.

@stnolting
Copy link
Owner

What a great discussion that dives deep into FPGAs! :)

I think that the synthesis tool should have no issues with mapping an NxM RAM if

  1. the HDL is written proberly
  2. if the RAM depth exceeds a certain threshold

I just did some experiments.

If we believe that is the case we would need to wrap them in a conditional generate statement that results in either a N deep memory or a 1 deep memory.

That is the point! We need to specify two distinct signals here. One NxM wide array being used if the FIFO depth is > 1 and one M wide array being used if the FIFO size is = 1. With these changes synthesis happily infers an Nx33-bit blockRAM:

|arty7_35t_wrapper | neorv32_inst/io_system.neorv32_slink_inst_true.neorv32_slink_inst/rx_fifo_inst/memory_no_reset.fifo_mem_reg        | 1 K x 33(READ_FIRST)   | W |   | 1 K x 33(WRITE_FIRST)  |   | R | Port A and B     | 0      | 1      | 
|arty7_35t_wrapper | neorv32_inst/io_system.neorv32_slink_inst_true.neorv32_slink_inst/tx_fifo_inst/memory_no_reset.fifo_mem_reg        | 512 x 33(READ_FIRST)   | W |   | 512 x 33(WRITE_FIRST)  |   | R | Port A and B     | 1      | 0      | 

I'll draft a PR for this.

@umarcor
Copy link
Collaborator

umarcor commented Feb 22, 2024

@mikaelsky you are correct. I could run some tests with Vivado v2022.2 on some CI and I can confirm that main, v1.9.5 and v1.9.0 with SLINK FIFO length set to 1024 cannot finish implementation due to resource starvation. However, by applying umarcor@f7946d1 as a dirty test based on #753 (comment), implementation suceeds and the following summary is produced:

+-------------------+------+-------+------------+-----------+-------+
|     Site Type     | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile    |    9 |     0 |          0 |        50 | 18.00 |
|   RAMB36/FIFO*    |    7 |     0 |          0 |        50 | 14.00 |
|     RAMB36E1 only |    7 |       |            |           |       |
|   RAMB18          |    4 |     0 |          0 |       100 |  4.00 |
|     RAMB18E1 only |    4 |       |            |           |       |
+-------------------+------+-------+------------+-----------+-------+

For completeness, I run the test with the dirty fix on top of main using SLINK depth 4:

+-------------------+------+-------+------------+-----------+-------+
|     Site Type     | Used | Fixed | Prohibited | Available | Util% |
+-------------------+------+-------+------------+-----------+-------+
| Block RAM Tile    |    7 |     0 |          0 |        50 | 14.00 |
|   RAMB36/FIFO*    |    5 |     0 |          0 |        50 | 10.00 |
|     RAMB36E1 only |    5 |       |            |           |       |
|   RAMB18          |    4 |     0 |          0 |       100 |  4.00 |
|     RAMB18E1 only |    4 |       |            |           |       |
+-------------------+------+-------+------------+-----------+-------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request HW hardware-related
Projects
None yet
5 participants