Added test case in sw/example to demonstrate floating point normalize… #528

mikaelsky · 2023-03-02T02:02:10Z

…r time out bug. Fixed the bug by adding a float exponent trigger for the float to int and addsub sections. The fix detects whether the normalization step is larger than the destination width and resolve to a default if so

mikaelsky · 2023-03-02T15:49:51Z

One of the core safety features, the multi-cycle instruction time-out, is set to 128 cycles. The root cause was determined by increasing the time to 256 cycles making FCVT.W*.S instructions pass and 512 cycles making FADD.S/FSUB.S instructions pass.

The issue was discovered with various FPGA builds where the FPU instructions timed out running the floating point test. The issue was seen on build server machines running RHEL and Ubuntu with Vivado 2020.2, 2022.2, 2022.2.1. FPGAs built in a non-server machine using Ubuntu with Vivado 2022.2 does not expose this issue.

The issue was root caused and determined to be a real issue. Applying the fix causes all FPGA build to pass.

The corner test case demonstrates the issue for the afflicted instructions.

For FCVT.W*.S the normalizer shift time for a positive exponent is equal to the size of the exponent. The result is a worst case shift time of 127 cycles (max. positive exponent). From the corner test the instructions fail from a positive exponent of 123 (0x250) and up. Which when adding the additional cycle delays in the FPU results in 128 cycles.
The resolution is to add an additional trigger that checks if the positive exponent is larger than XLEN+1. If it is larger than XLEN+1 then the normalization will always result in an overflow. When the condition is detected the overflow status is set and the state machine goes to the next stage.
Note: The +1 might not be needed as a '>' compare is used. It is mostly a precaution.
Result: The FCVT.W*.S instruction cycle time is now hard capped at XLEN + 1 + 5 (round trip delay), which for a 32 bit core is 38 cycles. Its a slight improvement from the reported cycle count of 47/48 cycles.

For FADD.S and FSUB.S the cross normalizer shift time is equivalent to the difference between the two exponents. This results in a worst case shift time of 127 - (-126) or 253 cycles. This far exceeds the 128 cycle limit.
The resolution is to add a trigger that determines if the difference between the large_exp and small_exp is more than the size of the mantissa 24 bits (23+hidden 1) + the 3 underflow bits GRS, or 27. If the difference in exponent is larger than 27 then the small_mant will always be 0 as it will shifted out. When the condition is detected we set the small_mant to 0 and the ext_s bit to 1, as this is the result. We then set exp_cnt to large_exp to trigger the machine to go to the next stage.
Note: The value could be 26 as that last shift will just set ext_s to 1.
Note: As opposed to the integer operations we really have no XLEN. So the fix will only work for single precision float numbers.
Result: The FADD.S/FSUB.S instructions are now hard capped at 27 + FPU overhead, so mid 30s. A side effect of the fix as that the cycle count for FADD.S/FSUB.S is down to 1/3 of the current reported 110 cycles.

stnolting · 2023-03-02T16:43:19Z

Hey @mikaelsky!

Thanks for this PR! Just let me try to recap this:

the default FPU raises an illegal instruction exception due to a co-processor timeout (processing stakes longer than 128 cycles); seems like this is a scenario I did not properly test 🙈
your modifications fixes this by improving the FPU itself (so it requires less cycles to complete processing)

Correct so far? 😅

mikaelsky · 2023-03-02T16:51:06Z

Correct. Sorry for my wordy comment :)
Yeah the two small changes fixes the issue in question. The code example just demonstrates the issue.

stnolting · 2023-03-02T17:03:53Z

This is really great!
I just tested your corner case program and I can confirm the bug in the FPU. I will just do some minor edits to your PR (we can discuss that if you like) before merging this.

Thank you very much for fixing! 👍

stnolting self-assigned this Mar 2, 2023

stnolting added the HW hardware-related label Mar 2, 2023

stnolting added 2 commits March 2, 2023 17:54

[sw/example] add linebreaks + credits

a5ab870

[sw/example] add link to according PR

a8e9eb5

stnolting added bug Something isn't working optimization Make things faster, smaller and more efficient labels Mar 2, 2023

update version to v1.8.1.4

bcfc830

stnolting approved these changes Mar 2, 2023

View reviewed changes

stnolting merged commit 878de68 into stnolting:main Mar 2, 2023

mikaelsky deleted the floating_point_normalizer_issue branch March 2, 2023 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added test case in sw/example to demonstrate floating point normalize… #528

Added test case in sw/example to demonstrate floating point normalize… #528

mikaelsky commented Mar 2, 2023

mikaelsky commented Mar 2, 2023

stnolting commented Mar 2, 2023

mikaelsky commented Mar 2, 2023

stnolting commented Mar 2, 2023

Added test case in sw/example to demonstrate floating point normalize… #528

Added test case in sw/example to demonstrate floating point normalize… #528

Conversation

mikaelsky commented Mar 2, 2023

mikaelsky commented Mar 2, 2023

stnolting commented Mar 2, 2023

mikaelsky commented Mar 2, 2023

stnolting commented Mar 2, 2023