Updated Performance test #844

mikaelsky · 2024-03-10T15:44:15Z

Now with functional makefiles and tests

mikaelsky · 2024-03-10T15:59:33Z

Created a "clone" of common.mk that makes the software test work.
In the local performance test directory the makefile will run across all tests, meaning make sim will run all the performance tests for all the testsuites as default

stnolting · 2024-03-11T19:51:52Z

Hey @mikaelsky!

Thanks for sharing your CPI evaluation setup!

I'm still looking through your code. I do not quite understand why we need exclusive testbenches here. 🤔 Couldn't we just use the default one together with the existing sim target? To run the entire script we could use a simple script that calls all the benchmark's makefiles.

mikaelsky · 2024-03-11T23:54:07Z

The "exclusive" testbench was made to ensure that when we run the CPI test we only run with the core core and not all sorts of fun peripherals like cache and external memory. Basically this means you get just the core performance and not the "core+random peripheral that degrades performance" performance.
In theory this could be achieved by adding more variables to the simple test. But my thinking here was: lets keep the simple test simple :)

The change in common.mk, even though small, does allow for an easy add of different testbenches. So e.g. lets say I want to make a testbench that had a new feature in it, I could make a new area "newFeatureTB" that has my new feature setup. Then I can change the NEORV32_SIM_FOLDER in my local makefile to point to my new feature directory and hey presto I can reuse all of the existing make and sim infrastructure while also testing my new feature :)

There is a simple script that runs all the benchmarks btw. "run_performance.sh".

I had to add the makefile in there as the github pull request script runs through all the examples and call make clean/sim something. If that make target doesn't exist in that first hierarchy level the pull request script fails out.
To make the pull request script pass I had to basically bang together a makefile whose only purpose is to run across a list of CPI tests and call the equivalent make target inside each of those directories.

The upside is you can now just call make sim from the performance_test directory and it will run all the performance tests automatically. So there is really no need for a simple script anymore.
I did change the default "all" target in each of the CPI test to just be "rv32_all" so the makefile "just works".

stnolting · 2024-03-12T19:44:17Z

The "exclusive" testbench was made to ensure that when we run the CPI test we only run with the core core and not all sorts of fun peripherals like cache and external memory. Basically this means you get just the core performance and not the "core+random peripheral that degrades performance" performance.

That's a good point.

However, the instruction timing description in https://stnolting.github.io/neorv32/#_instruction_sets_and_extensions already shows the architectural "best case" CPI values. These are of course heavily dependent on which ISA extensions are switched on and what the memory system looks like.

So, wouldn't it be better to use your benchmarks to profile a particular / application-specific for its average CPI values?

That's were we could stick to the default (simple) testbench: adjust all the configuration options (e.g. CPU extensions but also memory latencies) according to the real setup to get a full-featured simulation model that now could also be used to benchmark average CPI.

Don't get me wrong, I really like your proposal. I am just afraid of maintaining all these different testbench setups. 🙈

mikaelsky · 2024-03-12T20:21:02Z

:best Hermione impression: "Actually..." ;) The documentation is outdated. The 2 motivations for the CPI was to

Update our internal simulation model with 'actual' CPI vs just a CPI of 1.
To update the documentation as e.g. the Zfinx numbers are fairly dated, but I even see some of the I and M stuff being a tad off. With a CPI test suite one could imagine the documentation numbers could be automated.

I do agree the CPI should be measured on a given implementation. Which is what we are doing internally :)

The documentation should be theoretical best case, with the caveat that the theoretical best case is highly dependent on the execute engine state machine and all its cross dependencies. From there it can only go downhill based on the system implementer.

stnolting · 2024-03-12T21:24:26Z

The documentation is outdated.

That's true (unfortunately) 🙈

Update our internal simulation model with 'actual' CPI vs just a CPI of 1.

Actually (😅), the optimum CPI (cycle per instruction) for the fastest instructions is 2.

To update the documentation as e.g. the Zfinx numbers are fairly dated, but I even see some of the I and M stuff being a tad off. With a CPI test suite one could imagine the documentation numbers could be automated.

Agreed!

But I think we should do this by hand for now (as we have no way to move up-to-date CI results to the documentation yet). So a local run with an adjusted configuration of the default testbench should be fine for this.

I do agree the CPI should be measured on a given implementation. Which is what we are doing internally :)

👍

The documentation should be theoretical best case, with the caveat that the theoretical best case is highly dependent on the execute engine state machine and all its cross dependencies. From there it can only go downhill based on the system implementer.

Right. We should add a note about this to the documentation.

Again, I like this simple-to-use setup: just run your proposed script and get all the interesting CPI stuff. But it feels oversized to have a separate testbench for this (I find it difficult that we already have a VUNIT and a "boring normal" testbench). 🤔

But that's just my opinion. If this is something that can help people to easily benchmark their setups then I am fine with it. 😉

mikaelsky · 2024-03-12T23:47:33Z

Oh and I added a new feature to the testbench, which is used by the CPI testing system. If you look at the bottom I added a "finish" tester that looks at GPIO(32) assertion and when that happens it calls std.env.finish which stops the test from running at cleanly exits GHDL.

To use this I had to also update the GHDL call scripts to use VHDL-2008 vs the default VHDL-92/93. This as std.env is a VHDL-2008 thing.

Beyond the above modifications to the GHDL calling scripts we would also need add in some parameters we pass into GHDL for controlling features. Not sure how to achieve that with GHDL as well. Having e.g. a "turn off all the stuff" parameter allows us to set enabled core extensions n such in the simple testbench.

stnolting · 2024-03-14T20:22:58Z

Beyond the above modifications to the GHDL calling scripts we would also need add in some parameters we pass into GHDL for controlling features. Not sure how to achieve that with GHDL as well. Having e.g. a "turn off all the stuff" parameter allows us to set enabled core extensions n such in the simple testbench.

I really like that because that would be a very clean solution. We already do something like this in the RISCOF testbench:

entity neorv32_riscof_tb is
  generic (
    MEM_FILE : string;            -- memory initialization file
    MEM_SIZE : natural := 8*1024; -- total memory size in bytes
    RISCV_E  : boolean := false   -- embedded ISA extension
  );
end neorv32_riscof_tb;

https://github.com/stnolting/neorv32-riscof/blob/88f135e0f9bba7c57a1652460328f802168aa6a3/sim/neorv32_riscof_tb.vhd#L62-L68

There we have all the relevant configuration options exposed as generics (including default values) and we simply override them when calling GHDL, e.g. -gRISCV_E=true.

I would love to have something for the default testbench (wink 😉) but that would be a lot of coding effort... 🙈

Oh and I added a new feature to the testbench, which is used by the CPI testing system. If you look at the bottom I added a "finish" tester that looks at GPIO(32) assertion and when that happens it calls std.env.finish which stops the test from running at cleanly exits GHDL.

This is a nice feature we are also using for the RISCOF testbench. Unfortunately, this is a VHDL-2008 feature and some users might have issues with this (e.g. #847) so I owuld suggest to keep this for the VUnit testbench only.

Anyway, I'm still not fully convinced that we should add another testbench here. The maintaining overhead is quite heavy already. 🙈 Can't we find a practical compromise? 🤔

mikaelsky · 2024-03-14T23:04:03Z

I'll take a look at it, should be okay to add a few extra generics that I can pass in the via GHDL_PARAMS?. For GHDL I assume we are ok with me adding the vhdl 2008 option to the default script?

The one good news is that vivado isim does indeed support some vhdl-2008 features.. in particular it supports what I added which is finish.. assuming folks are on at least 2020.2.. which in the year 2024 we can somewhat assume or tell them to upgrade to at least 2020.2.
One note for Vivado/xsim is for the latest 2023.2 version it automatically detects the need for 2008, for earlier version you might have to push a button/add -2008 on read in.

https://docs.xilinx.com/r/2020.2-English/ug900-vivado-logic-simulation/Supported-Features

…e. Added generics to the simple testcase to be able to control performance features. Updated the GHDL scripts to accept more than 1 command line parameter. Updated the performance test makefiles to call the simple test bench with appropriate generics and GHDL settings

mikaelsky · 2024-03-17T14:33:31Z

I've updated the performance test to use the simple test bench without vhdl2008 features. I use a tuned timeout instead by setting the timeout value from the makefile. Its less flexible but works.
I had to update the ghdl scripts to pass down more than 1 command line parameter to be able to push down generics from the makefile. $1 just passed the first parameter down, $@ passes $1 to $N down instead.
The simple tb now has a few generics that set various settings related to measuring CPI. The default value is what the existing value was.
I clean common.mk to be what it was. I removed the benchmark testbench folder.

Should be a lot easier to maintain now :)

stnolting · 2024-03-17T18:33:56Z

Hey @mikaelsky

I'll take a look at it, should be okay to add a few extra generics that I can pass in the via GHDL_PARAMS?. For GHDL I assume we are ok with me adding the vhdl 2008 option to the default script?

The one good news is that vivado isim does indeed support some vhdl-2008 features.. in particular it supports what I added which is finish.. assuming folks are on at least 2020.2.. which in the year 2024 we can somewhat assume or tell them to upgrade to at least 2020.2.

Well, you are right... The truth is, that many people are still using older toolchains (like Xilinx ISE). 🙈 So I would suggest to keep VHDL2008 out of the default simulation setup - at least for the "simple" testbench. The VUnit testbench requires VHDL2008 anyway.

I've updated the performance test to use the simple test bench without vhdl2008 features.

Oh, that's nice! I'll have a look at that!

I had to update the ghdl scripts to pass down more than

That should be just fine I think.

The simple tb now has a few generics that set various settings related to measuring CPI. The default value is what the existing value was.

Maybe we could add a matrix-style set of default configurations and use a single "ID" generic to select one specific configuration - just like the LiteX wrapper:

neorv32/rtl/system_integration/neorv32_litex_core_complex.vhd

Lines 137 to 152 in 65c30d2

    
           -- core complex configurations -- 
        
           constant configs_c : configs_t := ( 
        
             --               minimal   lite    standard  full 
        
             riscv_c      => ( false,   true,    true,    true  ), -- RISC-V compressed instructions 'C' 
        
             riscv_m      => ( false,   true,    true,    true  ), -- RISC-V hardware mul/div 'M' 
        
             riscv_u      => ( false,   false,   true,    true  ), -- RISC-V user mode 'U' 
        
             riscv_zicntr => ( false,   false,   true,    true  ), -- RISC-V standard CPU counters 'Zicntr' 
        
             riscv_zihpm  => ( false,   false,   false,   true  ), -- RISC-V hardware performance monitors 'Zihpm' 
        
             fast_ops     => ( false,   false,   true,    true  ), -- use DSPs and barrel-shifters 
        
             pmp_num      => ( 0,       0,       0,       8     ), -- number of PMP regions (0..16) 
        
             hpm_num      => ( 0,       0,       0,       8     ), -- number of HPM counters (0..29) 
        
             xcache_en    => ( false,   false,   true,    true  ), -- external bus cache enabled 
        
             xcache_nb    => ( 0,       0,       32,      64    ), -- number of cache blocks (lines), power of two 
        
             xcache_bs    => ( 0,       0,       32,      32    ), -- size of cache clock (lines) in bytes, power of two 
        
             mtime        => ( false,   true,    true,    true  )  -- RISC-V machine system timer 
        
           );

🤔

Thanks again for all your work! :)

stnolting · 2024-03-17T18:37:12Z

Btw, please revert neorv32_application_image.vhd to its original state.
It also happens to me all the time that I check in my local version... 🙈 😅

…style

…into Performance_tests

Signed-off-by: Mikael Mortensen <119539842+mikaelsky@users.noreply.github.com>

…into Performance_tests

mikaelsky · 2024-03-17T21:35:59Z

After a bit of a typo battle I've changed the performance options to be matrix style. Different than I normally do this, but hey its a style thing :)
Should have undone the default application image 🤞

stnolting · 2024-03-17T21:42:40Z

After a bit of a typo battle I've changed the performance options to be matrix style.

Looks really great!

Different than I normally do this, but hey its a style thing :)

How would you have done this?

I think we have a simulation time-out issue right now because of the reworked cache system. Updating the cache sizes should fix this:

Upstream cache configuration: ✔️

Your current cache configuration: ❌

mikaelsky · 2024-03-17T22:10:03Z

Got it :) Note to self: when github is faster than your home server time to upgrade :)
Its different styles. I typically spell it out with individual options. This allows DV engineers to control everything themselves vs I have to go in and write/update test benches for them :) Also makes it easier to do a boat load of sweeps.

As an example our internal neorv32 regression suite runs ~520 randomized test runs .. every night that test the core in design and the core by itself. The main "test" areas are riscv_dv driven randomizes instruction suites for our used extensions vs an ISA model, riscv-arch compliance suite vs an ISA model, randomized IRQ, reset etc hits.
By exposing key parameters to my DV team, they where off to the races and began hammering the core in design :)

This is part of what we call "maturing" the core by pushing it into all the little corners that aren't normally tested. I'm working on what can be shared, like coverage numbers and such, as that will help provide confidence in the quality of the core. At least for the features we've tested.

…into Performance_tests

into Performance_tests

The test passes locally with 10ms (default) sim time but for some reason fails on github

into Performance_tests

stnolting · 2024-03-18T16:53:41Z

Its different styles. I typically spell it out with individual options.

Right, that seems to be the cleaner approach. Maybe we can implement that somewhere in the future 😉

This is part of what we call "maturing" the core by pushing it into all the little corners that aren't normally tested. I'm working on what can be shared, like coverage numbers and such, as that will help provide confidence in the quality of the core. At least for the features we've tested.

I'm curious. Do you only use the CPU core or do you also use parts from the SoC (interconnect, caches, peripherals)?

Btw, I think there is still a cache configuration issue that causes a simulation timeout. I'll try to fix that so wen can finally merge this. :)

to avoid simulation timeout

stnolting · 2024-03-18T18:27:23Z

Looks ready to me 👍

mikaelsky · 2024-03-19T03:16:26Z

I'll need to double check what I can disclose of features, beyond that yes we indeed use Zfinx.

As we are a verilog house I've had to limit the exposure to VHDL so its really just the core (neorv32_cpu.vhd) and down. We are also using the 2 OCD modules with a few tweaks to fit into the verilog interconnect. The amount of push back I get from the core being VHDL.. lets just say, a lot. Its definitely an adoption challenge.

Everything outside neorv32_cpu is homegrown verilog. There are a few modules that I converted (mtime/wdt) to verilog and fixed some feature choices. We are using the peripheral bus definition though, so we got that going for us :)
Not sure whether those feature tunings are worth talking about. Highlights are not using an IRQ for the wdt.. as that breaks the concept of a wdt and adding support for SoC lowest power states which impact had the wdt and mtime behaves during power state transitions.

stnolting · 2024-03-19T16:22:24Z

The amount of push back I get from the core being VHDL.. lets just say, a lot. Its definitely an adoption challenge.

I know these kinds of problems only too well... 🙈 😅

Highlights are not using an IRQ for the wdt.. as that breaks the concept of a wdt

Hmm interesting point. I thought that having a "WDT timeout alarm IRQ" was quite common?! Do you think this is something that should not be implemented?

Apart from that... ready to merge? ;)

mikaelsky · 2024-03-19T16:43:25Z

Just did a sync of the branch to get the PR up to date with HOT. It was blocking merging. So it will need to complete a new check run. Sorry.

"Hmm interesting point. I thought that having a "WDT timeout alarm IRQ" was quite common?! Do you think this is something that should not be implemented?"

If we look at the purpose of a watch dog timer its to "stop a core run wild". So if the core doesn't ping the watch dog before the timer runs out it will stop the core (either by reset or force clock gate depending on the system in which the core exists).

When you use an IRQ to trigger the watchdog update, then it is quite likely that a "core run wild" will still correctly respond to the ISR and keep pinging the WDT even though its currently executing code out of a random location.

So for WDTs you normally don't use IRQs to trigger the WDT service event as it is likely to be functional even with the firmware itself being dysfunctional. Instead the firmware itself has to proactively "ping" the watchdog timer at certain intervals in the firmware itself. Imagine you have an application running that takes X time. Before jumping into the application you ping the WDT, after you come back from the application you also ping the WDT. If you applications breaks and never returns the WDT will trigger.

The other change was to allow the firmware to write to the count down timer. This is more a feature specific to SoC firmware applications as depending on the current state of the firmware you might want to timeout every second, but for a lower power state where e.g. you are supposed to wake up every 10s you would want to reconfigure the WDT to time out after >10s.
So this way the firmware will "ping" the watchdog by writing a new countdown value into the watchdog timer register versus writing to a "reset timer" registers. A countdown is used as that allows firmware to calculate the timeout as WDT(clock) * timer_value.
This is quite conventional btw. not claiming any inventions here :)

stnolting · 2024-03-19T19:56:19Z

Just did a sync of the branch to get the PR up to date with HOT. It was blocking merging. So it will need to complete a new check run. Sorry.

👍

So for WDTs you normally don't use IRQs to trigger the WDT service event as it is likely to be functional even with the firmware itself being dysfunctional.

Good point. The intention was to prevent some kind of warning before a hard reset happens so the software can take actions like storing relevant data to nonvolatile memory. But I see your point here. Maybe we should remove that interrupt feature from the WDT? 🤔

(That would be nice because then we could use the WDT's FIRQ for another peripheral device 😉)

The other change was to allow the firmware to write to the count down timer.

You can update the timeout register already. Couldn't that be used to implement the same feature?

Anyway, we should continue this discussion in a separate issue 😉

stnolting · 2024-03-21T19:44:31Z

If there are no objections, I would like to merge this. 😉

mikaelsky added 2 commits March 10, 2024 15:40

Updated Performance test

3a93c5e

Now with functional makefiles and tests

Added more targets to make file

8ed13c7

stnolting added the SW software-related label Mar 11, 2024

stnolting mentioned this pull request Mar 11, 2024

Performance tests #840

Closed

stnolting added DOC Improvements or additions to documentation enhancement New feature or request labels Mar 12, 2024

mikaelsky and others added 2 commits March 17, 2024 12:08

Merge branch 'stnolting:main' into Performance_tests

4e49b89

mikaelsky and others added 7 commits March 17, 2024 21:06

Revert the default neorv32_application. Changed parameters to matrix …

2d6efab

…style

Revert the default neorv32_application. Changed parameters to matrix …

0bb0936

…style

Merge remote-tracking branch 'refs/remotes/origin/Performance_tests' …

125d19f

…into Performance_tests

Merge branch 'main' into Performance_tests

0c2bdc2

Signed-off-by: Mikael Mortensen <119539842+mikaelsky@users.noreply.github.com>

Merge remote-tracking branch 'refs/remotes/origin/Performance_tests' …

21f976e

…into Performance_tests

Merge remote-tracking branch 'refs/remotes/origin/Performance_tests' …

331aac0

…into Performance_tests

Fixed typo

dec8a4c

stnolting assigned mikaelsky Mar 17, 2024

mikaelsky added 7 commits March 17, 2024 22:11

Updated cache sizes

5f6297d

Merge remote-tracking branch 'refs/remotes/origin/Performance_tests' …

3b4fd0a

…into Performance_tests

Merge remote-tracking branch 'refs/remotes/origin/Performance_tests' …

f314ca8

…into Performance_tests

Merge branch 'Performance_tests' of https://github.com/mikaelsky/neorv32

f8fe8ea

into Performance_tests

Trying again to set I$ and D$ block sizes to 64 and 32 respectively

f45a216

Increasing sim time for the processor check to 15ms as a work around.

952c2c5

The test passes locally with 10ms (default) sim time but for some reason fails on github

Merge branch 'Performance_tests' of https://github.com/mikaelsky/neorv32

d746079

into Performance_tests

[sim] update cache configuration

4589175

to avoid simulation timeout

stnolting approved these changes Mar 18, 2024

View reviewed changes

Merge branch 'main' into Performance_tests

6d28a62

Merge branch 'stnolting:main' into Performance_tests

5f0736d

Merge branch 'main' into Performance_tests

f297cd4

stnolting merged commit 3d7012b into stnolting:main Mar 22, 2024
7 checks passed

stnolting mentioned this pull request Mar 23, 2024

⚠️ remove WDT + TRNG interrupts; 🐛 fix bug in core-complex clocking during sleep #858

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Performance test #844

Updated Performance test #844

mikaelsky commented Mar 10, 2024

mikaelsky commented Mar 10, 2024

stnolting commented Mar 11, 2024

mikaelsky commented Mar 11, 2024

stnolting commented Mar 12, 2024

mikaelsky commented Mar 12, 2024

stnolting commented Mar 12, 2024

mikaelsky commented Mar 12, 2024

stnolting commented Mar 14, 2024

mikaelsky commented Mar 14, 2024

mikaelsky commented Mar 17, 2024 •

edited

Loading

stnolting commented Mar 17, 2024

stnolting commented Mar 17, 2024

mikaelsky commented Mar 17, 2024

stnolting commented Mar 17, 2024 •

edited

Loading

mikaelsky commented Mar 17, 2024 •

edited

Loading

stnolting commented Mar 18, 2024

stnolting commented Mar 18, 2024

mikaelsky commented Mar 19, 2024

stnolting commented Mar 19, 2024

mikaelsky commented Mar 19, 2024

stnolting commented Mar 19, 2024

stnolting commented Mar 21, 2024

Updated Performance test #844

Updated Performance test #844

Conversation

mikaelsky commented Mar 10, 2024

mikaelsky commented Mar 10, 2024

stnolting commented Mar 11, 2024

mikaelsky commented Mar 11, 2024

stnolting commented Mar 12, 2024

mikaelsky commented Mar 12, 2024

stnolting commented Mar 12, 2024

mikaelsky commented Mar 12, 2024

stnolting commented Mar 14, 2024

mikaelsky commented Mar 14, 2024

mikaelsky commented Mar 17, 2024 • edited Loading

stnolting commented Mar 17, 2024

stnolting commented Mar 17, 2024

mikaelsky commented Mar 17, 2024

stnolting commented Mar 17, 2024 • edited Loading

mikaelsky commented Mar 17, 2024 • edited Loading

stnolting commented Mar 18, 2024

stnolting commented Mar 18, 2024

mikaelsky commented Mar 19, 2024

stnolting commented Mar 19, 2024

mikaelsky commented Mar 19, 2024

stnolting commented Mar 19, 2024

stnolting commented Mar 21, 2024

mikaelsky commented Mar 17, 2024 •

edited

Loading

stnolting commented Mar 17, 2024 •

edited

Loading

mikaelsky commented Mar 17, 2024 •

edited

Loading