Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added test case in sw/example to demonstrate floating point normalize… #528

Merged
merged 4 commits into from
Mar 2, 2023

Conversation

mikaelsky
Copy link
Collaborator

…r time out bug. Fixed the bug by adding a float exponent trigger for the float to int and addsub sections. The fix detects whether the normalization step is larger than the destination width and resolve to a default if so

…r time out bug. Fixed the bug by adding a float exponent trigger for the float to int and addsub sections. The fix detects whether the normalization step is larger than the destination width and resolve to a default if so
@mikaelsky
Copy link
Collaborator Author

One of the core safety features, the multi-cycle instruction time-out, is set to 128 cycles. The root cause was determined by increasing the time to 256 cycles making FCVT.W*.S instructions pass and 512 cycles making FADD.S/FSUB.S instructions pass.

The issue was discovered with various FPGA builds where the FPU instructions timed out running the floating point test. The issue was seen on build server machines running RHEL and Ubuntu with Vivado 2020.2, 2022.2, 2022.2.1. FPGAs built in a non-server machine using Ubuntu with Vivado 2022.2 does not expose this issue.

The issue was root caused and determined to be a real issue. Applying the fix causes all FPGA build to pass.

The corner test case demonstrates the issue for the afflicted instructions.

For FCVT.W*.S the normalizer shift time for a positive exponent is equal to the size of the exponent. The result is a worst case shift time of 127 cycles (max. positive exponent). From the corner test the instructions fail from a positive exponent of 123 (0x250) and up. Which when adding the additional cycle delays in the FPU results in 128 cycles.
The resolution is to add an additional trigger that checks if the positive exponent is larger than XLEN+1. If it is larger than XLEN+1 then the normalization will always result in an overflow. When the condition is detected the overflow status is set and the state machine goes to the next stage.
Note: The +1 might not be needed as a '>' compare is used. It is mostly a precaution.
Result: The FCVT.W*.S instruction cycle time is now hard capped at XLEN + 1 + 5 (round trip delay), which for a 32 bit core is 38 cycles. Its a slight improvement from the reported cycle count of 47/48 cycles.

For FADD.S and FSUB.S the cross normalizer shift time is equivalent to the difference between the two exponents. This results in a worst case shift time of 127 - (-126) or 253 cycles. This far exceeds the 128 cycle limit.
The resolution is to add a trigger that determines if the difference between the large_exp and small_exp is more than the size of the mantissa 24 bits (23+hidden 1) + the 3 underflow bits GRS, or 27. If the difference in exponent is larger than 27 then the small_mant will always be 0 as it will shifted out. When the condition is detected we set the small_mant to 0 and the ext_s bit to 1, as this is the result. We then set exp_cnt to large_exp to trigger the machine to go to the next stage.
Note: The value could be 26 as that last shift will just set ext_s to 1.
Note: As opposed to the integer operations we really have no XLEN. So the fix will only work for single precision float numbers.
Result: The FADD.S/FSUB.S instructions are now hard capped at 27 + FPU overhead, so mid 30s. A side effect of the fix as that the cycle count for FADD.S/FSUB.S is down to 1/3 of the current reported 110 cycles.

@stnolting
Copy link
Owner

Hey @mikaelsky!

Thanks for this PR! Just let me try to recap this:

  • the default FPU raises an illegal instruction exception due to a co-processor timeout (processing stakes longer than 128 cycles); seems like this is a scenario I did not properly test 🙈
  • your modifications fixes this by improving the FPU itself (so it requires less cycles to complete processing)

Correct so far? 😅

@stnolting stnolting self-assigned this Mar 2, 2023
@stnolting stnolting added the HW hardware-related label Mar 2, 2023
@mikaelsky
Copy link
Collaborator Author

Correct. Sorry for my wordy comment :)
Yeah the two small changes fixes the issue in question. The code example just demonstrates the issue.

@stnolting
Copy link
Owner

This is really great!
I just tested your corner case program and I can confirm the bug in the FPU. I will just do some minor edits to your PR (we can discuss that if you like) before merging this.

Thank you very much for fixing! 👍

@stnolting stnolting added bug Something isn't working optimization Make things faster, smaller and more efficient labels Mar 2, 2023
@stnolting stnolting merged commit 878de68 into stnolting:main Mar 2, 2023
@mikaelsky mikaelsky deleted the floating_point_normalizer_issue branch March 2, 2023 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HW hardware-related optimization Make things faster, smaller and more efficient
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants