Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variation in counter values #773

Closed
mahdi259 opened this issue Jan 24, 2024 · 16 comments · Fixed by #777
Closed

Variation in counter values #773

mahdi259 opened this issue Jan 24, 2024 · 16 comments · Fixed by #777
Labels
enhancement New feature or request HW hardware-related

Comments

@mahdi259
Copy link

mahdi259 commented Jan 24, 2024

Describe the bug
I use neorv32 to determine cycle counts that my custom instruction takes in a loop. My intention is to show the benefits that it apply to a certain application instead of using the conventional method.

The problem is that when I add some neorv32_uart0_printf before my experiment block, the total clock cycles changes. It looks storage because start_time() should work independently after printing.

Expected behavior
The total cycle count of a piece of code be constant, regardless of other codes of the program.

Screenshots
If applicable, add screenshots to help explain your problem.

Hardware:
Extensions: M, C, zicntr, zihpm, zxcfu.

As pictures show, removing one printf increases measured clock cycles.

// neorv32_uart0_printf("----------------\n");
neorv32_uart0_printf("---- First method:\n");
// ------------------------
res = 0;
iterations = 40;
start_time();
for (i=0; i<iterations; i++) {
res += neorv32_cfu_r3_instr(0b0000000, 0b000, rs1, rs2);
}
stop_time();
t0 = get_time();
neorv32_uart0_printf("Iterations: %d, Cycles: %d\n", iterations, t0);
// ------------------------

@NikLeberg
Copy link
Collaborator

Hi @mahdi259 interesting problem!
Have you enabled the cache(s)? Please share your full generics settings of your neorv32 VHDL instance.
I learned the hard way that re shuffling the code order (or commenting something out) can have seemingly very strange impact on code execution when a cache is involved.
Otherwise I'd recommend you to compare the generated assembly from GCC that is output with make asm for both versions.
I hope that helps,
Nik

@mahdi259
Copy link
Author

Hi @NikLeberg,
I don't use caches in my design. The enabled memories are 20KB IMEM and 8KB DMEM.
The top module is attached to my reply. I will try make asm. Thanks
topmodule.zip

@mahdi259
Copy link
Author

mahdi259 commented Jan 24, 2024

I checked make asm and there is no difference.
The benchmark with extra printf has an additional code block for it(before start_time()).

There is one asm code that calls start_time.
With extra printf:
1f6:576000ef jal ra,76c <start_time>
Without extra printf:
204:574000ef jal ra,778 <start_time>

@stnolting
Copy link
Owner

Hey @mahdi259

Which counters are you using to benchmark your code? The standard CPU counters cycle and instret, the hardware performance monitors (HPMs) or the machine timer (MTIME)? How do you setup these counters?

Could show the code for your start_time(), stop_time() and get_time() functions?

@stnolting stnolting added the troubleshooting Something is not working as expected label Jan 24, 2024
@mahdi259
Copy link
Author

mahdi259 commented Jan 25, 2024

Hi
I use functions of core_portme.c from coremark directory. I think they use hardware performance monitor counters.


void start_time(void) {
    GETMYTIME(&start_time_val);
    neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0); // start all counters
}

void stop_time(void) {
    neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, -1); // stop all counters
    GETMYTIME(&stop_time_val);
}

CORE_TICKS get_time(void) {
    CORE_TICKS elapsed
        = (CORE_TICKS)(MYTIMEDIFF(stop_time_val, start_time_val));
    return elapsed;
}

By the way, does get_time() return milliseconds?

@stnolting
Copy link
Owner

I use functions of core_portme.c from coremark directory.

Ah I see. These functions use the CPU-internal cycle counter cycle.

Do you need the cycle counter for anything else than benchmarking? E.g. for some delays?

I would recommend to use a HPM here so there is no risk that this specific counter might be overridden by some other software part. This is what I would do:

  // enable ALL counters (including HPMs)
  neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0);

  // configure HPM3 to count clock ticks
  neorv32_cpu_csr_write(CSR_MHPMEVENT3, 1 << HPMCNT_EVENT_CY);


  // clear HPM3 (low word only), could be put into a 'start_time' function
  neorv32_cpu_csr_write(CSR_MHPMCOUNTER3, 0):

  ...your application code you want to profile...

  // read current HPM3 value (low word only), could be put into a `get_time` function
  uint32_t elapsed_cycles = neorv32_cpu_csr_read(CSR_MHPMCOUNTER3); 

If your application codes runs for a "long" time you might also need to use the high-word of HPM3.

@mahdi259
Copy link
Author

mahdi259 commented Jan 26, 2024

Unfortunately the problem persists. Just try the following code snippet by commenting/uncommenting the specified printf:

neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, -1);// Enable CPU counters

//  neorv32_uart0_printf("----------------\n"); // **Here**
neorv32_uart0_printf("---- First method:\n");


// ------------------------
neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0);// Enable CPU counters
neorv32_cpu_csr_write(CSR_MHPMEVENT3, 1 << HPMCNT_EVENT_CY); 
neorv32_cpu_csr_write(CSR_MHPMCOUNTER3, 0);
for (i=0; i<40; i++) {
asm("nop");
}
stop_time_val = neorv32_cpu_csr_read(CSR_MHPMCOUNTER3);
neorv32_uart0_printf("Iterations: %d, Cycles: %d\n", iterations, stop_time_val);

@stnolting
Copy link
Owner

I've tested both version. The one without the extra printf:

----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

And the one with the extra printf:

----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

In both cases the HPM reading is identical. I am using a simple rv32i_zicsr_zifencei CPU + HPMs but without any caches. Are you sure that caches are disabled in your setup? Is the file your have attached above the actual top entity?

You could check the caches with this piece of code:

  if (NEORV32_SYSINFO->SOC & (1 << SYSINFO_SOC_ICACHE)) { neorv32_uart0_printf("ICACHE enabled\n"); }
  else { neorv32_uart0_printf("ICACHE disabled\n"); }

  if (NEORV32_SYSINFO->SOC & (1 << SYSINFO_SOC_DCACHE)) { neorv32_uart0_printf("DCACHE enabled\n"); }
  else { neorv32_uart0_printf("DCACHE disabled\n"); }

@mahdi259
Copy link
Author

mahdi259 commented Jan 26, 2024

You are right. Seems that the variation occurs when compiling code with MARCH=rv32imc.
I'm sure that caches are not implemented. The top module is exactly what I sent you.

MARCH=rv32imc:

ICACHE disabled
DCACHE disabled
----ULPPACK 1-bit:
Iterations: 0, Cycles: 517

CFU demo program completed.
ICACHE disabled
DCACHE disabled
----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 400

CFU demo program completed.

MARCH=rv32i:

ICACHE disabled
DCACHE disabled
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

CFU demo program completed.
ICACHE disabled
DCACHE disabled
----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

CFU demo program completed.

@stnolting
Copy link
Owner

I have re-run my simulations also using a rv32i and a rv32ic compiler configuration. And I can reproduce your findings.

The generated assembly code is identical - at least the loop that is used for profiling. The general outcome of the code is also correct. So there is no "actual" bug in the core (phew 😅).

However, this is a really interesting scenario. I've looked through some waveforms and finally found the reason for this strange behavior:

The CPU's font-end keeps fetching new instructions independently of the actual instruction execution. As soon as a branch instruction is executed (like at the end of your loop) the front-end is reset to start fetching instruction from the branch destination.

The code version running for 517 the instruction fetch was quite fast and the entire instruction prefetch buffer is full. The front-ent is waiting do for some free space in that buffer before it can continue fetching. At this point the end-of-loop branch kicks in. Unfortunately, the front-end is not reset (yet) because it first waits for free space in the prefetch buffer.

Long story short, this is the problem:

when IF_REQUEST => -- request next 32-bit-aligned instruction word
-- ------------------------------------------------------------
if (ipb.free = "11") then -- wait for free IPB space
fetch_engine.state <= IF_PENDING;
elsif (execute_engine.state = SLEEP) then -- halt request (sleep)?
fetch_engine.state <= IF_PARKED;
end if;

In this state the front-end reset should also be evaluated to allow a faster restart. So it should be something like this:

 when IF_REQUEST => -- request next 32-bit-aligned instruction word 
 -- ------------------------------------------------------------ 
   if (ipb.free = "11") then -- wait for free IPB space 
     fetch_engine.state <= IF_PENDING;
   elsif (fetch_engine.restart = '1') or (fetch_engine.reset = '1') then -- restart request due to branch
      fetch_engine.state <= IF_RESTART;
   elsif (execute_engine.state = SLEEP) then -- halt request (sleep)? 
     fetch_engine.state <= IF_PARKED; 
   end if; 

I will do some more tests and if everything works fine I'll open a PR to update the front-end logic.

Thanks for finding this bug bottleneck! 😉

@stnolting stnolting added enhancement New feature or request HW hardware-related and removed troubleshooting Something is not working as expected labels Jan 26, 2024
@mahdi259
Copy link
Author

Thanks

@stnolting
Copy link
Owner

The variations issue should be fixed now - at least when no caches are implemented 😉

@mahdi259
Copy link
Author

Yes. There is no variation in the discussed code now. Thanks for your effort.
I also learned from your replies that the C extension potentially makes variations in the clock cycles so I disabled it to get more stable results. Thanks

@mahdi259
Copy link
Author

Actually I got some further variations in my benchmark experimenting CFU custom instructions. Finally I removed C extension in compilation and got stable results. Thanks again for your attention. Best regards for you...

@stnolting
Copy link
Owner

The C extension can cause minor variations as it also supports unaligned 32-bit instructions words (that are 16-bit aligned) requiring an additional bus access to fetch the "second half of the instruction".

In large programs this variation should not be noticeable. However, you could try to increase the size of the instruction prefetch buffer:

-- instruction prefetch buffer depth --
constant ipb_depth_c : natural := 2; -- hast to be a power of two, min 2, default 2

This might reduce this kind of variations. Furthermore, large blocks of linear code can benefit from this is terms of increased execution speed.

@mahdi259
Copy link
Author

Thanks for your detailed description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request HW hardware-related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants