Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible issue with FIRQ pending interrupt clearing #818

Closed
robhancocksed opened this issue Feb 20, 2024 · 14 comments · Fixed by #821 or #829
Closed

Possible issue with FIRQ pending interrupt clearing #818

robhancocksed opened this issue Feb 20, 2024 · 14 comments · Fixed by #821 or #829
Assignees
Labels
bug Something isn't working HW hardware-related

Comments

@robhancocksed
Copy link
Contributor

Describe the bug
I am working on some changes to add support for the NEORV32 XIRQ interrupt controller to Zephyr, and have a setup that is mostly working. (I am planning to submit it upstream once it's debugged.) However, I'm running into an issue where XIRQ interrupts seem to stop being processed, and have narrowed it down what I think is an issue with how the CSRC instruction is being handled. I added some debug probes into the synthesized design in Vivado to try to see what is going on, and I suspect the CPU is not handling the FIRQ pending interrupt clearing properly.

The PC value shown corresponds to the following instructions which are part of the XIRQ interrupt handler:

35774: 010007b7 lui a5,0x1000
35778: 3447b073 csrc mip,a5

which should be clearing bit 24 in the MIP register, corresponding to FIRQ 8 for XIRQ, as far as I can tell. However, the logic capture seems to be showing that this results in ALL pending FIRQs being cleared out. So I suspect what is happening is that when another FIRQ handler runs, such as for the UART, it can sometimes end up clearing an XIRQ interrupt before it has a chance to be processed, resulting in an XIRQ interrupt hang.

image

In the Vivado ILA capture above, the neorv32_SystemTop_ax_0/U0/neorv32_top_inst/core_complex.neorv32_cpu_inst/neorv32_cpu_control_inst/csr[mip_firq_nclr][8]_i_1_n_0 signal corresponds (I think) to the state of the mip_firq_nclr value for FIRQ 8 which this instruction should be clearing, and it does, but you can see that mip_firq_nclr is also being pulsed low for the other used FIRQs (0, 2, 3, 12, 15) as well, and presumably shouldn't be.

Environment:

  • Zephyr RTOS (main branch with extra patches)
  • Zephyr SDK 0.16.5 with RISC-V GCC 12.2.0

Hardware:

  • NEORV32 version: Git commit beab81e from Feb. 16
  • Using top-level AXI Lite instantiation
  • Extensions A, B, C, M, Zicntr enabled, fast multiply, fast shift
  • GPTMR, MTIME, TRNG, UART0, WDT, on-chip debug enabled
  • XIRQ enabled with 8 channels
  • Building with AMD Vivado 2023.2 for a Kintex-7 FPGA
@robhancocksed
Copy link
Contributor Author

I suspect this code, in neorv32_cpu_control.vhd inside csr_write_data, may not be doing what was intended:


    -- tiny ALU to compute CSR write data --
    case execute_engine.ir(instr_funct3_msb_c-1 downto instr_funct3_lsb_c) is
      when "10"   => csr.wdata <= csr.rdata or tmp_v; -- set
      when "11"   => csr.wdata <= csr.rdata and (not tmp_v); -- clear
      when others => csr.wdata <= tmp_v; -- write
    end case;

It seems csr.rdata is set to 0 on cycles when no CSR read operation is in progress, so it's possible that it does not have the right data when this is executing. The symptoms would match a value of 0 being used rather than the actual MIP register content, causing the entire register to be zeroed.

@robhancocksed
Copy link
Contributor Author

It looks like doing a csr_write for the MIP register with the bit for the FIRQ being cleared as 0 and all other bits 1, rather than using csr_clear, can work around the problem in software, since writing 1 to bits in MIP has no effect. With this change I am only seeing the mip_firq_nclr bit for the desired interrupt going to 0 rather than all of them.

@stnolting
Copy link
Owner

Hey @robhancocksed

I am working on some changes to add support for the NEORV32 XIRQ interrupt controller to Zephyr, and have a setup that is mostly working. (I am planning to submit it upstream once it's debugged.)

That sounds great - thank you very much!

which should be clearing bit 24 in the MIP register, corresponding to FIRQ 8 for XIRQ, as far as I can tell. However, the logic capture seems to be showing that this results in ALL pending FIRQs being cleared out.

I will have a closer look at this, but I think you might have found a severe bug here 🙈

@stnolting
Copy link
Owner

This is really a mean bug... Thanks for identifying this!

@stnolting stnolting self-assigned this Feb 20, 2024
@stnolting stnolting added bug Something isn't working HW hardware-related labels Feb 20, 2024
@stnolting
Copy link
Owner

This issue should be fixed now.

Thanks again for finding this bug!

@robhancocksed
Copy link
Contributor Author

@stnolting It looks like the problem is not solved unfortunately - if I remove my previous workaround to use csr_write rather than csr_clear, and update the core, the same XIRQ hang eventually occurred, whereas with the workaround the system survived a test run for a significant period of time. And the change would actually break the workaround I was using as well, since writing 1 bits into MIP would now actually trigger those interrupts..

@stnolting stnolting reopened this Feb 22, 2024
@stnolting
Copy link
Owner

stnolting commented Feb 22, 2024

I've just tested the read/write/clear/set capabilities of mip's FIRQ bits and that seems to work fine.

Maybe this is just a software issue?

CSR write

neorv32_cpu_csr_write(CSR_MIP, data):

mip <= data;

This will override all writable bits of mip with data potentially clearing all pending FIRQs (if the according bits of data are zero).

CSR clear

neorv32_cpu_csr_clr(CSR_MIP, data):

mip <= mip & (~data);

This will clear bit i of mip if the according bit of data is 1. This can be used to clear specific pending FIRQs.

CSR set

neorv32_cpu_csr_set(CSR_MIP, data):

mip <= mip | data;

This will set bit i of mip if the according bit of data is 1.

⚠️ Non-atomic CSR access

uint32_t tmp;
tmp = neorv32_cpu_csr_read(CSR_MIP);
tmp &= ~(1 << some_firq_bit); // clear pending FIRQ
neorv32_cpu_csr_write(CSR_MIP, tmp);

Performing such a "non-atomic" access here is no good idea as mip might be updated with new pending IRQs "in background" (by the hardware). So the final write access here (neorv32_cpu_csr_write) could potentially override these background changes.

@robhancocksed
Copy link
Contributor Author

The code was using the csr_clear function in Zephyr which generates the assembly:

35774: 010007b7 lui a5,0x1000
35778: 3447b073 csrc mip,a5

which should be correct in terms of using an atomic clear instruction. It seems like the problem might be fairly timing dependent however, so a basic sequential test might not pick up whatever is happening?

@stnolting
Copy link
Owner

35778: 3447b073 csrc mip,a5

This seems fine as this is an atomic CSR access.

Even if the according FIRQ source is firing right in that moment where mip is updated by software the FIRQ request should still be registered:

if (csr.mip_firq_we = '1') then -- write access to MIP(.FIRQ) CSR
trap_ctrl.irq_pnd(irq_firq_0_c+i) <= firq_i(i) or csr.mip_firq_wdata(i); -- keep buffering incoming FIRQs
else

What is the actual "symptom" of your use case? Does the CPU miss an incoming FIRQ trigger?

@stnolting
Copy link
Owner

I made some more tests and now I can confirm that a FIRQ trigger gets lost when it arrives exactly 1 or 2 cycles right before doing any mip write/set/clear access. This is because a CSR clear/set instruction has 2 cycles latency (read CSR -> modify the data -> write back CSR).

I'll try to come up with another (better) fix 😅

I'll also check all further CSRw that have implicit updates (by the hardware).

@robhancocksed
Copy link
Contributor Author

@stnolting Basically yes, it appears that an XIRQ interrupt gets raised very close to the moment that a CSR clear to clear a different interrupt is happening, and that write incorrectly clears the XIRQ interrupt as well. That means the XIRQ ISR never runs, and so because the first interrupt source is never cleared in XIRQ, it never raises any more interrupts. I think the latest issue you discovered likely explains what is going on there.

@stnolting
Copy link
Owner

stnolting commented Feb 23, 2024

This was harder to fix then I have expected... 🙈 But I think mip works now as expected. Btw, mip is now clear-only again - so writing ones has no effect.

-> #829

@robhancocksed
Copy link
Contributor Author

My stress test ran over the weekend with no errors without the software workaround in place, so I think it's safe to say this is now resolved.

@stnolting
Copy link
Owner

Thanks for the feedback! I'm very glad to hear that 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HW hardware-related
Projects
None yet
2 participants