ref::quant_dot for int8 computes wrong result in test validation #1815

jerryyin · 2023-06-06T19:25:19Z

Refer to branch: bug-reference-dot-incorrect-result
The unit test is here: 1a6495a

I wrote this unit test such that it can function exactly the same as the original int8_quantization test.

Reproducing instructions:

$ make -j test_gpu_quantization && MIGRAPHX_ENABLE_MLIR=0 MIGRAPHX_TRACE_EVAL=2 ./bin/test_gpu_quantization &> compile.log

The failure looks like:

ref_result: 2.16254 0.625949 -1.01545 0.423965 1.68476 -0.632285 -1.48114 -0.31126 0.0567301 -2.37877 2.09053 0.0468488 -0.190883 0.875132 0.370431 0.881467 -1.69853 0.0260982 -0.874492 -0.846999 -0.261098 0.905008 -0.264992 0.496156 -1.76788 -0.878503 1.12251 0.69105 1.27282 -0.503538 2.07489 0.756498 -0.486856 0.412165 0.868273 0.486856 0.49354 0.11596 2.20812 0.532949
gpu_result: 2.16499 0.625949 -1.01097 0.406759 1.68476 -0.632285 -1.48114 -0.31126 0.0596945 -2.37877 2.09843 0.047837 -0.190883 0.875131 0.370431 0.881467 -1.70196 0.0260982 -0.876003 -0.853393 -0.261098 0.905008 -0.264992 0.496156 -1.7738 -0.878503 1.11315 0.696978 1.27282 -0.503538 2.07489 0.756498 -0.487321 0.412165 0.861879 0.495284 0.49354 0.11596 2.20811 0.532949
void int8_quantization_bug()
/root/AMDMIGraphX/test/gpu/quantization.cpp:225:
FAILED: migraphx::verify_range(ref_result, gpu_result) [ 0 ]
[ FAILED ] int8_quantization_bug: Test failure

Full log is here: compile.log

Analysis

Both cpu and gpu are computing the exact same gemm problem:

Input 1 of the gemm [5 x 16]:

59, 102, 25, 34, 34, 127, -102, 51, 25, 119, -42, 59, 0, 93, 42, 127, 127, -110, 127, 85, -127, 93, -34, -76, -102, -76, 25, 34, 119, -102, 51, -85, 102, 51, -34, 51, -8, -119, -59, -76, 85, -51, -127, -68, 51, -25, -59, 102, -51, 25, -68, 85, -8, 0, 76, -34, 85, 17, 0, 34, 110, 25, -102, -102, -17, 8, -68, 25, -8, 93, 51, -127, 59, 76, 119, 42, 25, -17, -8, -119

Input 2 of the gemm [16 x 8]:

56, 95, 24, 32, 32, 119, -95, 48, 24, 111, -40, 56, 0, 87, 40, 119, 119, -103, 119, 79, -119, 87, -32, -71, -95, -71, 24, 32, 111, -95, 48, -79, 95, 48, -32, 48, -8, -111, -56, -71, 79, -48, -119, -64, 48, -24, -56, 95, -48, 24, -64, 79, -8, 0, 71, -32, 79, 16, 0, 32, 103, 24, -95, -95, -16, 8, -64, 24, -8, 87, 48, -119, 56, 71, 111, 40, 24, -16, -8, -111, 71, 8, 95, 64, 79, 103, 119, -79, -32, -8, 79, 8, 40, -48, 24, -32, 24, -103, 71, 87, 95, -24, 111, 8, 48, -16, 0, 103, 71, -103, 0, 103, 64, 87, -119, 111, -24, -87, 40, -111, 8, -40, -127, -64, 56, 0, -16, -111

cpu result [5 x 8]:

37205, 10769, -17470, 7294, 28985, -10878, -25482, -5355, 976, -40925, 35966, 806, -3284, 15056, 6373, 15165, -29222, 449, -15045, -14572, -4492, 15570, -4559, 8536, -30415, -15114, 19312, 11889, 21898, -8663, 35697, 13015, -8376, 7091, 14938, 8376, 8491, 1995, 37989, 9169

gpu result [5 x 8]:

37247, 10769, -17393, 6998, 28985, -10878, -25482, -5355, 1027, -40925, 36102, 823, -3284, 15056, 6373, 15165, -29281, 449, -15071, -14682, -4492, 15570, -4559, 8536, -30517, -15114, 19151, 11991, 21898, -8663, 35697, 13015, -8384, 7091, 14828, 8521, 8491, 1995, 37989, 9169

Summary

cpu reference result (computed from migraphx) is wrong, and gpu reference (computed by rocblas) is correct. I have validated the result by computing it on third-party matrix multiplier.

Conclusion

cpu dot on int8 type computed wrong result, causing the new unit test to fail
int8_quantization test, which should also be failing, is instead passing

The text was updated successfully, but these errors were encountered:

kahmed10 · 2023-06-21T16:17:01Z

Working on reproducing test from commit, but I wrote the following ref_op_dot_test and printed the first 5 elements for sanity checking:

TEST_CASE(quant_dot_2d_test)
{
    migraphx::program p;

    auto* mm             = p.get_main_module();
    std::vector<int8_t> a     = {59, 102, 25, 34, 34, 127, -102, 51, 25, 119, -42, 59, 0, 93, 42, 127, 127, -110, 127, 85, -127, 93, -34, -76, -102, -76, 25, 34, 119, -102, 51, -85, 102, 51, -34, 51, -8, -119, -59, -76, 85, -51, -127, -68, 51, -25, -59, 102, -51, 25, -68, 85, -8, 0, 76, -34, 85, 17, 0, 34, 110, 25, -102, -102, -17, 8, -68, 25, -8, 93, 51, -127, 59, 76, 119, 42, 25, -17, -8, -119};
    std::vector<int8_t> b = {56, 95, 24, 32, 32, 119, -95, 48, 24, 111, -40, 56, 0, 87, 40, 119, 119, -103, 119, 79, -119, 87, -32, -71, -95, -71, 24, 32, 111, -95, 48, -79, 95, 48, -32, 48, -8, -111, -56, -71, 79, -48, -119, -64, 48, -24, -56, 95, -48, 24, -64, 79, -8, 0, 71, -32, 79, 16, 0, 32, 103, 24, -95, -95, -16, 8, -64, 24, -8, 87, 48, -119, 56, 71, 111, 40, 24, -16, -8, -111, 71, 8, 95, 64, 79, 103, 119, -79, -32, -8, 79, 8, 40, -48, 24, -32, 24, -103, 71, 87, 95, -24, 111, 8, 48, -16, 0, 103, 71, -103, 0, 103, 64, 87, -119, 111, -24, -87, 40, -111, 8, -40, -127, -64, 56, 0, -16, -111};
    
    migraphx::shape a_shape{migraphx::shape::int8_type, {5, 16}};
    auto al = mm->add_literal(migraphx::literal{a_shape, a});
    migraphx::shape b_shape{migraphx::shape::int8_type, {16, 8}};
    auto bl = mm->add_literal(migraphx::literal{b_shape, b});
    mm->add_instruction(migraphx::make_op("quant_dot"), al, bl);
    p.compile(migraphx::make_target("ref"));
    auto result = p.eval({}).back();
    std::vector<int32_t> results_vector;
    std::vector<int32_t> c = {37247, 10769, -17393, 6998, 28985, -10878, -25482, -5355, 1027, -40925, 36102, 823, -3284, 15056, 6373, 15165, -29281, 449, -15071, -14682, -4492, 15570, -4559, 8536, -30517, -15114, 19151, 11991, 21898, -8663, 35697, 13015, -8384, 7091, 14828, 8521, 8491, 1995, 37989, 9169};
    
    result.visit([&](auto output) { results_vector.assign(output.begin(), output.end()); });
    for(auto i = 0; i < 5; i++)
    {
        std::cout << results_vector[i] << std::endl;
    }
    EXPECT(migraphx::verify_range(c, results_vector));
}

This test passes on develop branch. The first 5 elements are the same as the GPU results printed above.

jerryyin · 2023-06-21T19:08:07Z

@kahmed10 The important part of puzzle is why reference test will give different results under different scenarios. I admit it can yield correct result in mlir dedicated test too. But this one is very particular to quantization test. Please let me know if you can reproduce it in the branch I have posted.

pfultz2 · 2023-06-21T19:16:15Z

This test passes on develop branch. The first 5 elements are the same as the GPU results printed above.

This should probably use EXPECT(c == results_vector) and not verify range.

kahmed10 · 2023-06-21T20:05:25Z

@kahmed10 The important part of puzzle is why reference test will give different results under different scenarios. I admit it can yield correct result in mlir dedicated test too. But this one is very particular to quantization test. Please let me know if you can reproduce it in the branch I have posted.

Yes I can reproduce the difference in results (failure) from your branch. Will need to investigate further on what is causing the difference.

kahmed10 · 2023-06-22T11:46:00Z

This test passes on develop branch. The first 5 elements are the same as the GPU results printed above.

This should probably use EXPECT(c == results_vector) and not verify range.

It still passes when checking c == results_vector instead. Meaning that the ref gemm is still giving the correct result.

kahmed10 · 2023-06-28T13:27:21Z

After further inspection, the reason that there are differences between CPU and GPU is because of unintended rounding on the GPU codegen side. We have a fix in place and will open a PR soon.

Fixes the failing test case in #1815. Added a test that would otherwise fail without the change.

jerryyin added the bug Something isn't working label Jun 6, 2023

jerryyin assigned causten Jun 6, 2023

jerryyin assigned kahmed10 and unassigned causten Jun 22, 2023

kahmed10 mentioned this issue Jun 30, 2023

Fix literal rounding in codegen functions #1910

Merged

kahmed10 linked a pull request Jun 30, 2023 that will close this issue

Fix literal rounding in codegen functions #1910

Merged

causten closed this as completed in #1910 Jul 5, 2023

causten pushed a commit that referenced this issue Jul 5, 2023

Fix literal rounding in codegen functions (#1910)

697709a

Fixes the failing test case in #1815. Added a test that would otherwise fail without the change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref::quant_dot for int8 computes wrong result in test validation #1815

ref::quant_dot for int8 computes wrong result in test validation #1815

jerryyin commented Jun 6, 2023 •

edited

Loading

kahmed10 commented Jun 21, 2023

jerryyin commented Jun 21, 2023

pfultz2 commented Jun 21, 2023

kahmed10 commented Jun 21, 2023 •

edited

Loading

kahmed10 commented Jun 22, 2023

kahmed10 commented Jun 28, 2023

ref::quant_dot for int8 computes wrong result in test validation #1815

ref::quant_dot for int8 computes wrong result in test validation #1815

Comments

jerryyin commented Jun 6, 2023 • edited Loading

Reproducing instructions:

Analysis

Conclusion

kahmed10 commented Jun 21, 2023

jerryyin commented Jun 21, 2023

pfultz2 commented Jun 21, 2023

kahmed10 commented Jun 21, 2023 • edited Loading

kahmed10 commented Jun 22, 2023

kahmed10 commented Jun 28, 2023

jerryyin commented Jun 6, 2023 •

edited

Loading

kahmed10 commented Jun 21, 2023 •

edited

Loading