MIGRAPHX_TRACE_EVAL=1 with DLM test_parity_gelu changes behavior. #1877

TedThemistokleous · 2023-06-23T22:37:15Z

Ref: SWDEV-401749

DLM changes failures were related to issues with python package updates for newer UIF container

Currently the last test not working is with test_party_gelu, found in onnxruntime involked via the DLM tag migx_onnxrt_gelu_benchmarks

The test result seems to pass/fail depending on whether the MIGRAPHX_TRACE_EVAL=1 flag is set on Navi21.

Currently debugged this in DLM and manually built everything from the target branch

release/rocm-rel-5.6.0 and able to reproduce.

TedThemistokleous · 2023-06-23T22:41:32Z

Performing git bisect to determine point where this does not pass.

Originally thought this was an error with #1835 but that change is not in the target release/rocm-rel-5.6.0 branch

TedThemistokleous · 2023-07-09T05:17:49Z

Spent the last few days rebuild and rolling back containers and getting continuous nonsensical answers when running things. Got into the weeds with compile options thinking maybe I need to twiddle config for hiprtc/MLIR. That wasn't the case.

Took a step back after watching a football game, (Yay tiecats) and althought code between 5.5 and 5.6 rocm releases are identical, there must be some underlying part of the code that was broken then too and we;re likely running into an odd race condition here.

Realized that we have the following parts in the trace path of things.

       ret = generic_eval(*this,
                           ctx,
                           std::move(params),
                           with_check_context([&](auto& ins, auto f, auto&& check_context) {
                               ctx.finish();
                               std::cout << "Run instruction: " << ins_out.at(ins) << std::endl;
                               timer t{};
                               auto result = check_context(f);
                               double t1   = t.record<milliseconds>();
                               ctx.finish();
                               double t2 = t.record<milliseconds>();
                               std::cout << "Time: " << t1 << "ms, " << t2 << "ms" << std::endl;
                               if(trace_level > 1 and ins->name().front() != '@' and
                                  ins->name() != "load" and not result.empty())
                               {
                                   target tgt  = make_target(this->impl->target_name);
                                   auto buffer = tgt.copy_from(result);
                                   if(trace_level == 2)
                                   {
                                       std::cout << "Output has "
                                                 << to_string_range(classify_argument(buffer))
                                                 << std::endl;
                                       std::cout << "Output: ";
                                       preview_argument(std::cout, buffer);
                                       std::cout << std::endl;
                                   }
                                   else
                                   {
                                       std::cout << "Output: " << buffer << std::endl;
                                   }
                               }
                               return result;
                           }));

but in the side of non tracing we just merely run in generic_eval()

return check_context(f)

Originally this was done as it was assumed that since we're tracing and pulling data back from the card, we'll need to read thus requiring the synchronization

I've instead added the following and had success with the gelu test.

 ctx.finish();
auto result = check_context(f);
ctx.finish();
return result;

I think I'll need to remove this in the case of stream sync being enabled here so we're not always performing a stream.wait() underneath async calls thus blocking but I think we still require synchronization here in the non async cases and its potentially not implicit when calling through onnxruntime from what I've seen here.

I'm retesting this in the UIF container before I update the ticket in the morning but Ive gotten this to work in the default build options and seeing if I can repeat the success with the change before I refine something tomorrow morning for review.

TedThemistokleous · 2023-07-09T13:02:48Z

Seeing the same behavior mirroed on the remote system using the UIF container build as well.

It appears we're unable to properly use async streams right now on Navi21 without this throwing an error.

adding an async check around this yields the same behavior since the ORT driver is running async

TedThemistokleous · 2023-07-14T04:03:00Z

Ripped out the syncs in the trace path just to see what would happen. I can consistently fail when I use trace=1, and 0 now, but I'm able to get passes now with trace = 2

Interestingly enough I see a hipDeviceSync sintead of hipStreamSync which preceeds a memcpy before the first launched kernel finishes.

TedThemistokleous · 2023-07-14T04:14:09Z

I'm assuming this insetion of the HipDeviceSync is because of the copy_from instruction in the trace section of the code. Regardless, it looks like the other two kernels are happening after a hipMemCpy occurs.

taking a look at MI250 I see the following output

Which also has the execution of the kernel, before the hipStreamSteamWait and then the first hipMemCpy occurs.

TedThemistokleous · 2023-07-14T04:14:37Z

ping @pfultz2 Shouldn't we not be waiting on the first event before this memcpy occurs here?

TedThemistokleous · 2023-07-15T00:07:49Z

Looking at more hip command output between the MI250 and Navi using ltrace. Seeing variations between who is calling what hip calls in Navi libc10 and torch wherase on the MI250, only libtorch is being used. It seems like the Async calls are only on the MI250 and called by libtorch_hip.so but again this appears to be a red herring right now. No idea why we have additional calls but the perf seems to finish on time compared to navi

Got curious with the test itself and I decided to up the data size in the test on the Navi21 system for the hidden layer (2, 5, 10, 20, 100x) and found that with 100x larger hidden layer, for this parity test, Navi21 passes compared to the case that fails. Grabbing a capture of this and going to use it to compare to the failing case along with ltrace calls.

This is looking like a data race condition now since I can change the state based on how much data is used.

TedThemistokleous · 2023-07-15T00:12:20Z

test_navi21_gelu_100x_data_debug.txt
test_navi21_gelu_stock_data_debug.txt

TedThemistokleous · 2023-07-15T00:23:15Z

It appears the failing case is missing a hip malloc before the hipMemcpy before invoking some initial kernels before a set of hipMemCpyWithStream, and hipStreamCreateWithFlags

Not sure if this is related to data output or not.

Will post ltraces once finished capturing/gotten. Above is the first discrepancy between cases.

TedThemistokleous · 2023-07-15T01:49:35Z

ltrace_navi21_gelu_100x_data.txt
ltrace_navi21_gelu_stock_data.txt

TedThemistokleous · 2023-07-15T02:17:58Z

It appears I'm correct in thinking this is a data race.

It appears that on the failing case with stock hidden size (768) we attempt to perform a copyDeviceToHost() before the kernel is complete with the result. This isn't the case if we can up the workload as it appears the hipMemCopy stall out the run so the kernel completion and copy happen later.

What's interesting is why we're not seeing the hipMalloc on the stock size run. It appears when I inflact the hidden layer in this gelu test, there's a 4kB malloc at the beginning of the run.

Need to look more into the ltrace from this, although it appears when using ltrace now the failing run "passes" most likely due to added delay when tracking hipCalls

TedThemistokleous · 2023-07-17T15:22:24Z

Varied the size of the MI250 hidden layer to the point where I've gotten it down to 1 hidden layer. I don't see any large change in how the data is handled based on variable size.

Seems like stream waits and syncs work as intended here. I can't get MI250 to fail like the Navi21 system by reducing data size.

zipping since full run is 25 Mb

debug_mi250_1_hidden_layer_pass.log.tar.gz

TedThemistokleous · 2023-07-18T15:00:10Z

Looking like not an MIGraphX issue. We're doing everything as intended it seems.

TedThemistokleous · 2023-07-21T04:39:34Z

Found fix upstream in ORT. related to how we perform CopyTensorAsync() in onnxruntime.

Changed hipMemCpy() to a HipAsyncMemcpy() while syncing to the correct event and stream

has become:

Verified for both SWDEV-408437 and SWDEV-408438. Moving this to review state but this is entirely out of context for MIGraphX now.

TedThemistokleous · 2023-07-21T04:40:59Z

microsoft/onnxruntime#16787

TedThemistokleous · 2023-07-24T19:11:49Z

PR is now in ORT and will part of this week's update

TedThemistokleous self-assigned this Jun 23, 2023

TedThemistokleous mentioned this issue Jul 10, 2023

Add Onnxruntime parity checks to CI #1938

Closed

TedThemistokleous added bugfix Fixes a bug found in the code. onnxruntime PR changes interaction between MIGraphX and Onnxruntime labels Jul 21, 2023

TedThemistokleous closed this as completed Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MIGRAPHX_TRACE_EVAL=1 with DLM test_parity_gelu changes behavior. #1877

MIGRAPHX_TRACE_EVAL=1 with DLM test_parity_gelu changes behavior. #1877

TedThemistokleous commented Jun 23, 2023 •

edited

Loading

TedThemistokleous commented Jun 23, 2023

TedThemistokleous commented Jul 9, 2023 •

edited

Loading

TedThemistokleous commented Jul 9, 2023 •

edited

Loading

TedThemistokleous commented Jul 14, 2023

TedThemistokleous commented Jul 14, 2023

TedThemistokleous commented Jul 14, 2023

TedThemistokleous commented Jul 15, 2023

TedThemistokleous commented Jul 15, 2023 •

edited

Loading

TedThemistokleous commented Jul 15, 2023

TedThemistokleous commented Jul 15, 2023

TedThemistokleous commented Jul 15, 2023

TedThemistokleous commented Jul 17, 2023 •

edited

Loading

TedThemistokleous commented Jul 18, 2023

TedThemistokleous commented Jul 21, 2023 •

edited

Loading

TedThemistokleous commented Jul 21, 2023

TedThemistokleous commented Jul 24, 2023

MIGRAPHX_TRACE_EVAL=1 with DLM test_parity_gelu changes behavior. #1877

MIGRAPHX_TRACE_EVAL=1 with DLM test_parity_gelu changes behavior. #1877

Comments

TedThemistokleous commented Jun 23, 2023 • edited Loading

TedThemistokleous commented Jun 23, 2023

TedThemistokleous commented Jul 9, 2023 • edited Loading

TedThemistokleous commented Jul 9, 2023 • edited Loading

TedThemistokleous commented Jul 14, 2023

TedThemistokleous commented Jul 14, 2023

TedThemistokleous commented Jul 14, 2023

TedThemistokleous commented Jul 15, 2023

TedThemistokleous commented Jul 15, 2023 • edited Loading

TedThemistokleous commented Jul 15, 2023

TedThemistokleous commented Jul 15, 2023

TedThemistokleous commented Jul 15, 2023

TedThemistokleous commented Jul 17, 2023 • edited Loading

TedThemistokleous commented Jul 18, 2023

TedThemistokleous commented Jul 21, 2023 • edited Loading

TedThemistokleous commented Jul 21, 2023

TedThemistokleous commented Jul 24, 2023

TedThemistokleous commented Jun 23, 2023 •

edited

Loading

TedThemistokleous commented Jul 9, 2023 •

edited

Loading

TedThemistokleous commented Jul 9, 2023 •

edited

Loading

TedThemistokleous commented Jul 15, 2023 •

edited

Loading

TedThemistokleous commented Jul 17, 2023 •

edited

Loading

TedThemistokleous commented Jul 21, 2023 •

edited

Loading