Variation in counter values #773

mahdi259 · 2024-01-24T15:39:17Z

Describe the bug
I use neorv32 to determine cycle counts that my custom instruction takes in a loop. My intention is to show the benefits that it apply to a certain application instead of using the conventional method.

The problem is that when I add some neorv32_uart0_printf before my experiment block, the total clock cycles changes. It looks storage because start_time() should work independently after printing.

Expected behavior
The total cycle count of a piece of code be constant, regardless of other codes of the program.

Screenshots
If applicable, add screenshots to help explain your problem.

Hardware:
Extensions: M, C, zicntr, zihpm, zxcfu.

As pictures show, removing one printf increases measured clock cycles.

// neorv32_uart0_printf("----------------\n");
neorv32_uart0_printf("---- First method:\n");
// ------------------------
res = 0;
iterations = 40;
start_time();
for (i=0; i<iterations; i++) {
res += neorv32_cfu_r3_instr(0b0000000, 0b000, rs1, rs2);
}
stop_time();
t0 = get_time();
neorv32_uart0_printf("Iterations: %d, Cycles: %d\n", iterations, t0);
// ------------------------

NikLeberg · 2024-01-24T17:38:50Z

Hi @mahdi259 interesting problem!
Have you enabled the cache(s)? Please share your full generics settings of your neorv32 VHDL instance.
I learned the hard way that re shuffling the code order (or commenting something out) can have seemingly very strange impact on code execution when a cache is involved.
Otherwise I'd recommend you to compare the generated assembly from GCC that is output with make asm for both versions.
I hope that helps,
Nik

mahdi259 · 2024-01-24T17:50:00Z

Hi @NikLeberg,
I don't use caches in my design. The enabled memories are 20KB IMEM and 8KB DMEM.
The top module is attached to my reply. I will try make asm. Thanks
topmodule.zip

mahdi259 · 2024-01-24T18:05:51Z

I checked make asm and there is no difference.
The benchmark with extra printf has an additional code block for it(before start_time()).

There is one asm code that calls start_time.
With extra printf:
1f6:576000ef jal ra,76c <start_time>
Without extra printf:
204:574000ef jal ra,778 <start_time>

stnolting · 2024-01-24T22:10:02Z

Hey @mahdi259

Which counters are you using to benchmark your code? The standard CPU counters cycle and instret, the hardware performance monitors (HPMs) or the machine timer (MTIME)? How do you setup these counters?

Could show the code for your start_time(), stop_time() and get_time() functions?

mahdi259 · 2024-01-25T02:52:10Z

Hi
I use functions of core_portme.c from coremark directory. I think they use hardware performance monitor counters.


void start_time(void) {
    GETMYTIME(&start_time_val);
    neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0); // start all counters
}

void stop_time(void) {
    neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, -1); // stop all counters
    GETMYTIME(&stop_time_val);
}

CORE_TICKS get_time(void) {
    CORE_TICKS elapsed
        = (CORE_TICKS)(MYTIMEDIFF(stop_time_val, start_time_val));
    return elapsed;
}

By the way, does get_time() return milliseconds?

stnolting · 2024-01-25T20:34:55Z

I use functions of core_portme.c from coremark directory.

Ah I see. These functions use the CPU-internal cycle counter cycle.

Do you need the cycle counter for anything else than benchmarking? E.g. for some delays?

I would recommend to use a HPM here so there is no risk that this specific counter might be overridden by some other software part. This is what I would do:

  // enable ALL counters (including HPMs)
  neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0);

  // configure HPM3 to count clock ticks
  neorv32_cpu_csr_write(CSR_MHPMEVENT3, 1 << HPMCNT_EVENT_CY);


  // clear HPM3 (low word only), could be put into a 'start_time' function
  neorv32_cpu_csr_write(CSR_MHPMCOUNTER3, 0):

  ...your application code you want to profile...

  // read current HPM3 value (low word only), could be put into a `get_time` function
  uint32_t elapsed_cycles = neorv32_cpu_csr_read(CSR_MHPMCOUNTER3);

If your application codes runs for a "long" time you might also need to use the high-word of HPM3.

mahdi259 · 2024-01-26T08:44:53Z

Unfortunately the problem persists. Just try the following code snippet by commenting/uncommenting the specified printf:

neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, -1);// Enable CPU counters

//  neorv32_uart0_printf("----------------\n"); // **Here**
neorv32_uart0_printf("---- First method:\n");


// ------------------------
neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0);// Enable CPU counters
neorv32_cpu_csr_write(CSR_MHPMEVENT3, 1 << HPMCNT_EVENT_CY); 
neorv32_cpu_csr_write(CSR_MHPMCOUNTER3, 0);
for (i=0; i<40; i++) {
asm("nop");
}
stop_time_val = neorv32_cpu_csr_read(CSR_MHPMCOUNTER3);
neorv32_uart0_printf("Iterations: %d, Cycles: %d\n", iterations, stop_time_val);

stnolting · 2024-01-26T09:27:40Z

I've tested both version. The one without the extra printf:

----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

And the one with the extra printf:

----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

In both cases the HPM reading is identical. I am using a simple rv32i_zicsr_zifencei CPU + HPMs but without any caches. Are you sure that caches are disabled in your setup? Is the file your have attached above the actual top entity?

You could check the caches with this piece of code:

  if (NEORV32_SYSINFO->SOC & (1 << SYSINFO_SOC_ICACHE)) { neorv32_uart0_printf("ICACHE enabled\n"); }
  else { neorv32_uart0_printf("ICACHE disabled\n"); }

  if (NEORV32_SYSINFO->SOC & (1 << SYSINFO_SOC_DCACHE)) { neorv32_uart0_printf("DCACHE enabled\n"); }
  else { neorv32_uart0_printf("DCACHE disabled\n"); }

mahdi259 · 2024-01-26T12:12:27Z

You are right. Seems that the variation occurs when compiling code with MARCH=rv32imc.
I'm sure that caches are not implemented. The top module is exactly what I sent you.

MARCH=rv32imc:

ICACHE disabled
DCACHE disabled
----ULPPACK 1-bit:
Iterations: 0, Cycles: 517

CFU demo program completed.

ICACHE disabled
DCACHE disabled
----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 400

CFU demo program completed.

MARCH=rv32i:

ICACHE disabled
DCACHE disabled
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

CFU demo program completed.

ICACHE disabled
DCACHE disabled
----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

CFU demo program completed.

stnolting · 2024-01-26T20:16:37Z

I have re-run my simulations also using a rv32i and a rv32ic compiler configuration. And I can reproduce your findings.

The generated assembly code is identical - at least the loop that is used for profiling. The general outcome of the code is also correct. So there is no "actual" bug in the core (phew 😅).

However, this is a really interesting scenario. I've looked through some waveforms and finally found the reason for this strange behavior:

The CPU's font-end keeps fetching new instructions independently of the actual instruction execution. As soon as a branch instruction is executed (like at the end of your loop) the front-end is reset to start fetching instruction from the branch destination.

The code version running for 517 the instruction fetch was quite fast and the entire instruction prefetch buffer is full. The front-ent is waiting do for some free space in that buffer before it can continue fetching. At this point the end-of-loop branch kicks in. Unfortunately, the front-end is not reset (yet) because it first waits for free space in the prefetch buffer.

Long story short, this is the problem:

neorv32/rtl/core/neorv32_cpu_control.vhd

Lines 379 to 385 in 5281589

    
           when IF_REQUEST => -- request next 32-bit-aligned instruction word 
        
           -- ------------------------------------------------------------ 
        
             if (ipb.free = "11") then -- wait for free IPB space 
        
               fetch_engine.state <= IF_PENDING; 
        
             elsif (execute_engine.state = SLEEP) then -- halt request (sleep)? 
        
               fetch_engine.state <= IF_PARKED; 
        
             end if;

In this state the front-end reset should also be evaluated to allow a faster restart. So it should be something like this:

 when IF_REQUEST => -- request next 32-bit-aligned instruction word 
 -- ------------------------------------------------------------ 
   if (ipb.free = "11") then -- wait for free IPB space 
     fetch_engine.state <= IF_PENDING;
   elsif (fetch_engine.restart = '1') or (fetch_engine.reset = '1') then -- restart request due to branch
      fetch_engine.state <= IF_RESTART;
   elsif (execute_engine.state = SLEEP) then -- halt request (sleep)? 
     fetch_engine.state <= IF_PARKED; 
   end if;

I will do some more tests and if everything works fine I'll open a PR to update the front-end logic.

Thanks for finding this ~~bug~~ bottleneck! 😉

mahdi259 · 2024-01-27T03:11:03Z

Thanks

stnolting · 2024-01-28T18:26:41Z

The variations issue should be fixed now - at least when no caches are implemented 😉

mahdi259 · 2024-01-28T18:48:24Z

Yes. There is no variation in the discussed code now. Thanks for your effort.
I also learned from your replies that the C extension potentially makes variations in the clock cycles so I disabled it to get more stable results. Thanks

mahdi259 · 2024-01-28T18:52:42Z

Actually I got some further variations in my benchmark experimenting CFU custom instructions. Finally I removed C extension in compilation and got stable results. Thanks again for your attention. Best regards for you...

stnolting · 2024-01-28T19:11:01Z

The C extension can cause minor variations as it also supports unaligned 32-bit instructions words (that are 16-bit aligned) requiring an additional bus access to fetch the "second half of the instruction".

In large programs this variation should not be noticeable. However, you could try to increase the size of the instruction prefetch buffer:

neorv32/rtl/core/neorv32_package.vhd

Lines 51 to 52 in c95e039

    
           -- instruction prefetch buffer depth -- 
        
           constant ipb_depth_c : natural := 2; -- hast to be a power of two, min 2, default 2

This might reduce this kind of variations. Furthermore, large blocks of linear code can benefit from this is terms of increased execution speed.

mahdi259 · 2024-01-29T03:40:47Z

Thanks for your detailed description.

stnolting added the troubleshooting Something is not working as expected label Jan 24, 2024

stnolting added enhancement New feature or request HW hardware-related and removed troubleshooting Something is not working as expected labels Jan 26, 2024

stnolting mentioned this issue Jan 27, 2024

[rtl] improve CPU front end #777

Merged

stnolting closed this as completed in #777 Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variation in counter values #773

Variation in counter values #773

mahdi259 commented Jan 24, 2024 •

edited

Loading

NikLeberg commented Jan 24, 2024

mahdi259 commented Jan 24, 2024

mahdi259 commented Jan 24, 2024 •

edited

Loading

stnolting commented Jan 24, 2024

mahdi259 commented Jan 25, 2024 •

edited

Loading

stnolting commented Jan 25, 2024

mahdi259 commented Jan 26, 2024 •

edited

Loading

stnolting commented Jan 26, 2024

mahdi259 commented Jan 26, 2024 •

edited

Loading

stnolting commented Jan 26, 2024

mahdi259 commented Jan 27, 2024

stnolting commented Jan 28, 2024

mahdi259 commented Jan 28, 2024

mahdi259 commented Jan 28, 2024

stnolting commented Jan 28, 2024

mahdi259 commented Jan 29, 2024

Variation in counter values #773

Variation in counter values #773

Comments

mahdi259 commented Jan 24, 2024 • edited Loading

NikLeberg commented Jan 24, 2024

mahdi259 commented Jan 24, 2024

mahdi259 commented Jan 24, 2024 • edited Loading

stnolting commented Jan 24, 2024

mahdi259 commented Jan 25, 2024 • edited Loading

stnolting commented Jan 25, 2024

mahdi259 commented Jan 26, 2024 • edited Loading

stnolting commented Jan 26, 2024

mahdi259 commented Jan 26, 2024 • edited Loading

stnolting commented Jan 26, 2024

mahdi259 commented Jan 27, 2024

stnolting commented Jan 28, 2024

mahdi259 commented Jan 28, 2024

mahdi259 commented Jan 28, 2024

stnolting commented Jan 28, 2024

mahdi259 commented Jan 29, 2024

mahdi259 commented Jan 24, 2024 •

edited

Loading

mahdi259 commented Jan 24, 2024 •

edited

Loading

mahdi259 commented Jan 25, 2024 •

edited

Loading

mahdi259 commented Jan 26, 2024 •

edited

Loading

mahdi259 commented Jan 26, 2024 •

edited

Loading