Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref::quant_dot for int8 computes wrong result in test validation #1815

Closed
jerryyin opened this issue Jun 6, 2023 · 6 comments · Fixed by #1910
Closed

ref::quant_dot for int8 computes wrong result in test validation #1815

jerryyin opened this issue Jun 6, 2023 · 6 comments · Fixed by #1910
Assignees
Labels
bug Something isn't working

Comments

@jerryyin
Copy link
Member

jerryyin commented Jun 6, 2023

Refer to branch: bug-reference-dot-incorrect-result
The unit test is here: 1a6495a

I wrote this unit test such that it can function exactly the same as the original int8_quantization test.

Reproducing instructions:

$ make -j test_gpu_quantization && MIGRAPHX_ENABLE_MLIR=0 MIGRAPHX_TRACE_EVAL=2 ./bin/test_gpu_quantization &> compile.log

The failure looks like:

ref_result: 2.16254 0.625949 -1.01545 0.423965 1.68476 -0.632285 -1.48114 -0.31126 0.0567301 -2.37877 2.09053 0.0468488 -0.190883 0.875132 0.370431 0.881467 -1.69853 0.0260982 -0.874492 -0.846999 -0.261098 0.905008 -0.264992 0.496156 -1.76788 -0.878503 1.12251 0.69105 1.27282 -0.503538 2.07489 0.756498 -0.486856 0.412165 0.868273 0.486856 0.49354 0.11596 2.20812 0.532949
gpu_result: 2.16499 0.625949 -1.01097 0.406759 1.68476 -0.632285 -1.48114 -0.31126 0.0596945 -2.37877 2.09843 0.047837 -0.190883 0.875131 0.370431 0.881467 -1.70196 0.0260982 -0.876003 -0.853393 -0.261098 0.905008 -0.264992 0.496156 -1.7738 -0.878503 1.11315 0.696978 1.27282 -0.503538 2.07489 0.756498 -0.487321 0.412165 0.861879 0.495284 0.49354 0.11596 2.20811 0.532949
void int8_quantization_bug()
/root/AMDMIGraphX/test/gpu/quantization.cpp:225:
FAILED: migraphx::verify_range(ref_result, gpu_result) [ 0 ]
[ FAILED ] int8_quantization_bug: Test failure

Full log is here: compile.log

Analysis

Both cpu and gpu are computing the exact same gemm problem:

  • Input 1 of the gemm [5 x 16]:

59, 102, 25, 34, 34, 127, -102, 51, 25, 119, -42, 59, 0, 93, 42, 127, 127, -110, 127, 85, -127, 93, -34, -76, -102, -76, 25, 34, 119, -102, 51, -85, 102, 51, -34, 51, -8, -119, -59, -76, 85, -51, -127, -68, 51, -25, -59, 102, -51, 25, -68, 85, -8, 0, 76, -34, 85, 17, 0, 34, 110, 25, -102, -102, -17, 8, -68, 25, -8, 93, 51, -127, 59, 76, 119, 42, 25, -17, -8, -119

  • Input 2 of the gemm [16 x 8]:

56, 95, 24, 32, 32, 119, -95, 48, 24, 111, -40, 56, 0, 87, 40, 119, 119, -103, 119, 79, -119, 87, -32, -71, -95, -71, 24, 32, 111, -95, 48, -79, 95, 48, -32, 48, -8, -111, -56, -71, 79, -48, -119, -64, 48, -24, -56, 95, -48, 24, -64, 79, -8, 0, 71, -32, 79, 16, 0, 32, 103, 24, -95, -95, -16, 8, -64, 24, -8, 87, 48, -119, 56, 71, 111, 40, 24, -16, -8, -111, 71, 8, 95, 64, 79, 103, 119, -79, -32, -8, 79, 8, 40, -48, 24, -32, 24, -103, 71, 87, 95, -24, 111, 8, 48, -16, 0, 103, 71, -103, 0, 103, 64, 87, -119, 111, -24, -87, 40, -111, 8, -40, -127, -64, 56, 0, -16, -111

  • cpu result [5 x 8]:

37205, 10769, -17470, 7294, 28985, -10878, -25482, -5355, 976, -40925, 35966, 806, -3284, 15056, 6373, 15165, -29222, 449, -15045, -14572, -4492, 15570, -4559, 8536, -30415, -15114, 19312, 11889, 21898, -8663, 35697, 13015, -8376, 7091, 14938, 8376, 8491, 1995, 37989, 9169

  • gpu result [5 x 8]:

37247, 10769, -17393, 6998, 28985, -10878, -25482, -5355, 1027, -40925, 36102, 823, -3284, 15056, 6373, 15165, -29281, 449, -15071, -14682, -4492, 15570, -4559, 8536, -30517, -15114, 19151, 11991, 21898, -8663, 35697, 13015, -8384, 7091, 14828, 8521, 8491, 1995, 37989, 9169

  • Summary

cpu reference result (computed from migraphx) is wrong, and gpu reference (computed by rocblas) is correct. I have validated the result by computing it on third-party matrix multiplier.

Conclusion

  • cpu dot on int8 type computed wrong result, causing the new unit test to fail
  • int8_quantization test, which should also be failing, is instead passing
@jerryyin jerryyin added the bug Something isn't working label Jun 6, 2023
@kahmed10
Copy link
Collaborator

Working on reproducing test from commit, but I wrote the following ref_op_dot_test and printed the first 5 elements for sanity checking:

TEST_CASE(quant_dot_2d_test)
{
    migraphx::program p;

    auto* mm             = p.get_main_module();
    std::vector<int8_t> a     = {59, 102, 25, 34, 34, 127, -102, 51, 25, 119, -42, 59, 0, 93, 42, 127, 127, -110, 127, 85, -127, 93, -34, -76, -102, -76, 25, 34, 119, -102, 51, -85, 102, 51, -34, 51, -8, -119, -59, -76, 85, -51, -127, -68, 51, -25, -59, 102, -51, 25, -68, 85, -8, 0, 76, -34, 85, 17, 0, 34, 110, 25, -102, -102, -17, 8, -68, 25, -8, 93, 51, -127, 59, 76, 119, 42, 25, -17, -8, -119};
    std::vector<int8_t> b = {56, 95, 24, 32, 32, 119, -95, 48, 24, 111, -40, 56, 0, 87, 40, 119, 119, -103, 119, 79, -119, 87, -32, -71, -95, -71, 24, 32, 111, -95, 48, -79, 95, 48, -32, 48, -8, -111, -56, -71, 79, -48, -119, -64, 48, -24, -56, 95, -48, 24, -64, 79, -8, 0, 71, -32, 79, 16, 0, 32, 103, 24, -95, -95, -16, 8, -64, 24, -8, 87, 48, -119, 56, 71, 111, 40, 24, -16, -8, -111, 71, 8, 95, 64, 79, 103, 119, -79, -32, -8, 79, 8, 40, -48, 24, -32, 24, -103, 71, 87, 95, -24, 111, 8, 48, -16, 0, 103, 71, -103, 0, 103, 64, 87, -119, 111, -24, -87, 40, -111, 8, -40, -127, -64, 56, 0, -16, -111};
    
    migraphx::shape a_shape{migraphx::shape::int8_type, {5, 16}};
    auto al = mm->add_literal(migraphx::literal{a_shape, a});
    migraphx::shape b_shape{migraphx::shape::int8_type, {16, 8}};
    auto bl = mm->add_literal(migraphx::literal{b_shape, b});
    mm->add_instruction(migraphx::make_op("quant_dot"), al, bl);
    p.compile(migraphx::make_target("ref"));
    auto result = p.eval({}).back();
    std::vector<int32_t> results_vector;
    std::vector<int32_t> c = {37247, 10769, -17393, 6998, 28985, -10878, -25482, -5355, 1027, -40925, 36102, 823, -3284, 15056, 6373, 15165, -29281, 449, -15071, -14682, -4492, 15570, -4559, 8536, -30517, -15114, 19151, 11991, 21898, -8663, 35697, 13015, -8384, 7091, 14828, 8521, 8491, 1995, 37989, 9169};
    
    result.visit([&](auto output) { results_vector.assign(output.begin(), output.end()); });
    for(auto i = 0; i < 5; i++)
    {
        std::cout << results_vector[i] << std::endl;
    }
    EXPECT(migraphx::verify_range(c, results_vector));
}

This test passes on develop branch. The first 5 elements are the same as the GPU results printed above.

@jerryyin
Copy link
Member Author

@kahmed10 The important part of puzzle is why reference test will give different results under different scenarios. I admit it can yield correct result in mlir dedicated test too. But this one is very particular to quantization test. Please let me know if you can reproduce it in the branch I have posted.

@pfultz2
Copy link
Collaborator

pfultz2 commented Jun 21, 2023

This test passes on develop branch. The first 5 elements are the same as the GPU results printed above.

This should probably use EXPECT(c == results_vector) and not verify range.

@kahmed10
Copy link
Collaborator

kahmed10 commented Jun 21, 2023

@kahmed10 The important part of puzzle is why reference test will give different results under different scenarios. I admit it can yield correct result in mlir dedicated test too. But this one is very particular to quantization test. Please let me know if you can reproduce it in the branch I have posted.

Yes I can reproduce the difference in results (failure) from your branch. Will need to investigate further on what is causing the difference.

@kahmed10
Copy link
Collaborator

This test passes on develop branch. The first 5 elements are the same as the GPU results printed above.

This should probably use EXPECT(c == results_vector) and not verify range.

It still passes when checking c == results_vector instead. Meaning that the ref gemm is still giving the correct result.

@jerryyin jerryyin assigned kahmed10 and unassigned causten Jun 22, 2023
@kahmed10
Copy link
Collaborator

After further inspection, the reason that there are differences between CPU and GPU is because of unintended rounding on the GPU codegen side. We have a fix in place and will open a PR soon.

@kahmed10 kahmed10 linked a pull request Jun 30, 2023 that will close this issue
causten pushed a commit that referenced this issue Jul 5, 2023
Fixes the failing test case in #1815. Added a test that would otherwise fail without the change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants