Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected external Interrupt/ XIRQ behavior in Indirect Boot from SPI Flash #893

Closed
ucycg opened this issue May 3, 2024 · 8 comments
Closed
Assignees
Labels
troubleshooting Something is not working as expected

Comments

@ucycg
Copy link
Contributor

ucycg commented May 3, 2024

The Problem

I don't know where to fit this information inside this issue so let me put this bit of information first, before I give this huge description of my projects and related issue. During the problematic simulation that I further describe in details the NeoRV32 RTE spits out the following: Store address misaligned @ PC=0x000007D6, MTINST=0x04C4A221, MTVAL=0x00000045

As a start I give an overview of the problem first. When I run a software simulation with a testbench in Direct Boot, it is working as intended. On the other hand when I run it after loading it from an external flash memory via the bootloader in autoboot mode it doesn't work as intended anymore. Somehow the same external Interrupt signals that I use to trigger different behaviors of the internal Statemachine in software are not detected correctly by the NeoRV32 in the Indirect Boot scenario.

So I have this bigger project with the NeoRV32 were I try to implement an ASIC with the processor. In the process of that I also created a sort of test software that emulates the later functionality of that Chip in the project.

In the project the NeoRV32 should collect 16bytes of data via the SLINK interface and depending on some other signals add a 4 byte EvtID that is received via SDI to that data package or not. In any case the newly created data package is sent further via SDI.

The test software should do 3 test cases/ variations of that process. The first one is that both the 16 Byte Data Package and the EvtPackage is sent. Next only the Data Package is sent without the EvtID. Finally in the last test only an EvtID is sent without a data Package.
grafik

The Hit Signal which is connected via the external interrupt controller signals that a new DataPackage is arriving via SLINK.
The EvtID is only added to the Data Package when inside the Hit-Frame (Hit == HIGH) a TrigAck Signal is also received via another external Interrupt.

To Reproduce
Run the neoipe_test_setup_bootloader_full_tb.vhd testbench together with the respective processor config neoipe_test_setup_bootloader_autoboot.vhd. Use the spi_flash_mem_emu.sv as the module to simulate the upload of the executable.
Finally the executable is stored inside the neorv32_exe.mem file in the hex format required for simulation with the flash module.

In contrast the expected behavior occurs with the neoipe_test_setup_approm_full_tb.vhd testbench and the respective processor config
neoipe_approm_comp2_test.vhd processor config. For that the neorv32_application_image.vhd is also provided.

Expected behavior
I would expect that the same executable results in the same behavior no matter if its uploaded via an SPI flash in Indirect Boot or in the Direct Boot where its stored directly in the IMEM. In Detail that means that the 3 test cases from the testbench result in the expected behavior from the processor. Which is

  1. When the Hit Signal arrives together with a TrigAck signal inside, then the CPU should create a package with Data+EvtID. Next this should be sent via SDI
  2. When the Hit Signal arrives only, then the Processor should create and forward a package with Data only.
  3. When the TrigAck arrives only, then the Processor should create and forward a package with EvtID only.

This shows my simulation screenshot when i simulate the Direct Boot testbench of my software simulation. As expected the 3 testcases lead finally to different data output on the w_spi_miso signal line.
grafik

Screenshots
Here the problematic behaviour becomes visible. In contrast to the Direct Boot behavior in the Indirect Boot simulation the software state machine transitions in the DataPackage + EvtID case from the Wait_Evt State directly into the Gen_Empty_Pck state. Although the external Interrupt from the Hit signal that comes first, which would stop this from happening normally.(this is visible in the Wave Diagram on the right. I created "Debug Information" via the gpios representing the current state).
grafik
This results in the behavior of sending an empty package without data. which would normally only occur, when no hit signal is send together with the TrigAck Signal.

In the next testcase no EvtID is send and no TrigAck signal is generated. Nevertheless the statemachine transitions into the Wait_Evt_Id state as if it had actually received an TrigAck Interrupt. Thus the testbench gets stuck, because no EvtID is actually send in this testcase and the statemachine rests in this state.
grafik
See the application software that is executed in the main.c file for full code.

Environment:

  • Simulator: Xcelium 22.03
  • OS: Ubuntu 22.04 / 18.04 on simulation server
  • GCC Version (RISC-V and native): gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, I just used the standard make cmds to create .bin files

Hardware:

  • Hardware version (1.9.5.4)
  • CPU extensions: A, B, C, M, U, Zfinx, Zicntr, Zicond, Zihpm, Fast_Mul, Fast_Shift
  • CPU peripherals: GPIO, MTIME, UART0, SPI, SDI, SLINK, XIRQ, OCD, Bootloader

Additional context

I have tested the flash module used in the simulation separately here #878 with another testbench that I also provide together with a respective blink_led_neorv32.mem. Here the upload of the executable is functional and the GPIO starts to generate the expected counting behavior from the blink_led example software.

Just to be sure I also simulated the Direct Boot version also with a UART Module and here the NEORV32 RTE doesn't create any output via UART, so the mysterious store address misaligned is only present in the Indirect Boot scenario.

I also wonder if this issue with the NeoRV32 RTE is related to #626 somehow as I can't find the mentioned instruction inside the main.asm file

Full Simulation Files
ProblemFiles.zip

@ucycg ucycg changed the title Unexpected external Interrupt/ XIRQ behavior in Indirect Boot from SPI Flash Unexpected external Interrupt/ XIRQ behavior in Indirect Boot from SPI Flash label: troubleshooting May 3, 2024
@ucycg ucycg changed the title Unexpected external Interrupt/ XIRQ behavior in Indirect Boot from SPI Flash label: troubleshooting Unexpected external Interrupt/ XIRQ behavior in Indirect Boot from SPI Flash May 3, 2024
@stnolting
Copy link
Owner

Hey @ucycg!

Oh wow, this is a big one... 😅 Thanks for the detailed description!

So I have this bigger project with the NeoRV32 were I try to implement an ASIC with the processor.

Now that sounds interesting! Can you tell us a bit more about it? 😉

I would expect that the same executable results in the same behavior no matter if its uploaded via an SPI flash in Indirect Boot or in the Direct Boot where its stored directly in the IMEM.

You are right, there should be no difference at all. I am using both mechanism all the time and I have never seen any difference. However, there might be a difference regarding the timing of the processor-external logic; or more precise: the time when your XIRQ interrupt fires.

Hardware version (1.9.5.4)

I see that you are using an older version of the core. We identified a bug in the handling of the FIRQs where some interrupt requests just got lost. See #818.

We tried to fix that in #821 and #829, but ultimately we reworked the entire FIRQ trigger mechanism (#858, #859, #860 and especially #864). Long story short, the FIRQ are now triggered by high-level and not just by a single edge. With this modification, a FIRQ remains pending as long as the according interrupt signal is high - making it "impossible" to loose a triggering event.

I have not looked into your sources yet, but does this sound like it might be the reason for your issue?

@stnolting stnolting added the troubleshooting Something is not working as expected label May 4, 2024
@stnolting stnolting self-assigned this May 4, 2024
@ucycg
Copy link
Contributor Author

ucycg commented May 6, 2024

Hey Stephan,

thank you for your quick reply, I will try my simulations with the latest NeoRV32 version and report back when I'm ready. I'm not sure that this is the reason for my problem, but I think its a good idea anyway to update my project. I actually noticed the udpates in the FIRQ, but I didn't really understand or maybe I forgot that was due to a bug!

One of the reasons why I delayed the update was that as far as I understand it right now a functionality in sw that I implemented isn't supported/possible in the latest CPU version anymore. I use the SDI_CTRL_IRQ_TX_EMPTY Interrupt, but I'm sort of transforming it into an edge based interrupt by turning it off with my own interrupt handler and clearing the mip. But I think I can just remove the interrupt for now and instead just read the SDI_CTRL register in an extra state of my state machine instead.

@ucycg
Copy link
Contributor Author

ucycg commented May 6, 2024

Now that sounds interesting! Can you tell us a bit more about it? 😉

Thank you for your interest, I might do that in the future, right now I don't feel comfortable to do that yet :)

@ucycg
Copy link
Contributor Author

ucycg commented May 6, 2024

Hello Stephan,

so I managed to update my project with the latest Hardware Version 1.9.9.1, but the behavior hasn't changed at all. I'm actually a bit suprised, that in the case of the Direct Boot scenario my old code is functional without any changes. As far as I understood write access to the MIP isn't possible anymore. Maybe I have a confusion in my code, but currently it seems that I can still execute this:
neorv32_cpu_csr_clr(CSR_MIP, 1 << CSR_MIP_FIRQ11P);

Anyway the signal of the XIRQ inside the FIRQ has changed as expected with the Processor Update, as far as I can see it right now all interrupt handlers are actually executed. I will go on with the debugging and see if I can solve or narrow done the problem.

@stnolting
Copy link
Owner

now a functionality in sw that I implemented isn't supported/possible in the latest CPU version anymore. I use the SDI_CTRL_IRQ_TX_EMPTY Interrupt, but I'm sort of transforming it into an edge based interrupt by turning it off with my own interrupt handler and clearing the mip

This interrupt (TX FIFO empty) is still available, but it will keep firing until the interrupt-causing condition is no longer met (i.e. data being written to the TX FIFO). You can still turn that into a single-shot interrupt within your interrupt handler by clearing SDI_CTRL_IRQ_TX_EMPTY until your program has written data to TX.

Maybe this does not perfectly fit your application, but the main goal of this entire FIRQ rework was to make all internal interrupts more consistent (and more RISC-V alike). 😉

so I managed to update my project with the latest Hardware Version 1.9.9.1, but the behavior hasn't changed at all.

Sorry to hear that.

The executable for both boot scenarios are completely identical, right? Or in other words: how did you compile each of these executables? Are both using the same memory layout configuration?

Are you using the NEORV32 Runtime Environment for handling your interrupts or do you do this "bare metal"?

Maybe I have a confusion in my code, but currently it seems that I can still execute this:
neorv32_cpu_csr_clr(CSR_MIP, 1 << CSR_MIP_FIRQ11P);

You can still write to mip without raising an exception as the CSR is read-write according to the RISC-V spec. However, all bits are read-only so writing any data to them will have no effect.

@ucycg
Copy link
Contributor Author

ucycg commented May 8, 2024

So I still don't know what the exact issue is with my old code, but I managed to find a solution for now.
I changed the code of my main.c appliaction. For the state transition from Wait_Evt to Wait_TrigAck normally i checked this with interrupt flags which are set by rising edge XIRQs. Although from my point of view this is triple checking the condition that I want which seems overkill the following change made the code work as expected.

Old Code:

if(hitFlag == true)             
{
...
state = Wait_Trig_Ack;
}                                
else if(trigAckFlag == true && hitFlag == false)      
{
...
state = Gen_Empty_Pck;
}

New Code:

if(hitFlag == true)             
{
...
state = Wait_Trig_Ack;
}                                
else if(trigAckFlag == true && hitFlag == false && clearHitFlag == false)      
{
...
state = Gen_Empty_Pck;
}

Its weird, because in theory the hitFlag according to my understanding is set by an interrupt with higher priority then the trigAckFlag and I think I can see from simulation that the according interrupt handler is also called before so it should work without the clearHitFlag extra condition

I made this White Board drawing to hopefully clarify my reasoning
WhatsApp Image 2024-05-08 at 16 34 33(1)

@ucycg
Copy link
Contributor Author

ucycg commented May 8, 2024

I try to tell appart the two cases A) and B) with the "Bug" my state machine behaved as if case B) occured although the input signals where as in case A)

@ucycg ucycg closed this as completed May 8, 2024
@stnolting
Copy link
Owner

Its weird, because in theory the hitFlag according to my understanding is set by an interrupt with higher priority then the trigAckFlag

Are both flags updated by interrupts? If so, are they both updated by the XIRQ interrupt?
Maybe there is a problem with the XIRQ prioritization (inside the interrupt handler?) resulting in some kind of race condition... 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
troubleshooting Something is not working as expected
Projects
None yet
Development

No branches or pull requests

2 participants