Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIGRAPHX_TRACE_EVAL=1 with DLM test_parity_gelu changes behavior. #1877

Closed
TedThemistokleous opened this issue Jun 23, 2023 · 16 comments
Closed
Assignees
Labels
bugfix Fixes a bug found in the code. onnxruntime PR changes interaction between MIGraphX and Onnxruntime

Comments

@TedThemistokleous
Copy link
Collaborator

TedThemistokleous commented Jun 23, 2023

Ref: SWDEV-401749

DLM changes failures were related to issues with python package updates for newer UIF container

Currently the last test not working is with test_party_gelu, found in onnxruntime involked via the DLM tag migx_onnxrt_gelu_benchmarks

The test result seems to pass/fail depending on whether the MIGRAPHX_TRACE_EVAL=1 flag is set on Navi21.

Currently debugged this in DLM and manually built everything from the target branch

release/rocm-rel-5.6.0 and able to reproduce.

@TedThemistokleous TedThemistokleous self-assigned this Jun 23, 2023
@TedThemistokleous
Copy link
Collaborator Author

Performing git bisect to determine point where this does not pass.

Originally thought this was an error with #1835 but that change is not in the target release/rocm-rel-5.6.0 branch

@TedThemistokleous
Copy link
Collaborator Author

TedThemistokleous commented Jul 9, 2023

Spent the last few days rebuild and rolling back containers and getting continuous nonsensical answers when running things. Got into the weeds with compile options thinking maybe I need to twiddle config for hiprtc/MLIR. That wasn't the case.

Took a step back after watching a football game, (Yay tiecats) and althought code between 5.5 and 5.6 rocm releases are identical, there must be some underlying part of the code that was broken then too and we;re likely running into an odd race condition here.

Realized that we have the following parts in the trace path of things.

       ret = generic_eval(*this,
                           ctx,
                           std::move(params),
                           with_check_context([&](auto& ins, auto f, auto&& check_context) {
                               ctx.finish();
                               std::cout << "Run instruction: " << ins_out.at(ins) << std::endl;
                               timer t{};
                               auto result = check_context(f);
                               double t1   = t.record<milliseconds>();
                               ctx.finish();
                               double t2 = t.record<milliseconds>();
                               std::cout << "Time: " << t1 << "ms, " << t2 << "ms" << std::endl;
                               if(trace_level > 1 and ins->name().front() != '@' and
                                  ins->name() != "load" and not result.empty())
                               {
                                   target tgt  = make_target(this->impl->target_name);
                                   auto buffer = tgt.copy_from(result);
                                   if(trace_level == 2)
                                   {
                                       std::cout << "Output has "
                                                 << to_string_range(classify_argument(buffer))
                                                 << std::endl;
                                       std::cout << "Output: ";
                                       preview_argument(std::cout, buffer);
                                       std::cout << std::endl;
                                   }
                                   else
                                   {
                                       std::cout << "Output: " << buffer << std::endl;
                                   }
                               }
                               return result;
                           }));

but in the side of non tracing we just merely run in generic_eval()

return check_context(f)

Originally this was done as it was assumed that since we're tracing and pulling data back from the card, we'll need to read thus requiring the synchronization

I've instead added the following and had success with the gelu test.

 ctx.finish();
auto result = check_context(f);
ctx.finish();
return result;

I think I'll need to remove this in the case of stream sync being enabled here so we're not always performing a stream.wait() underneath async calls thus blocking but I think we still require synchronization here in the non async cases and its potentially not implicit when calling through onnxruntime from what I've seen here.

I'm retesting this in the UIF container before I update the ticket in the morning but Ive gotten this to work in the default build options and seeing if I can repeat the success with the change before I refine something tomorrow morning for review.

@TedThemistokleous
Copy link
Collaborator Author

TedThemistokleous commented Jul 9, 2023

Seeing the same behavior mirroed on the remote system using the UIF container build as well.

It appears we're unable to properly use async streams right now on Navi21 without this throwing an error.

adding an async check around this yields the same behavior since the ORT driver is running async

@TedThemistokleous
Copy link
Collaborator Author

Ripped out the syncs in the trace path just to see what would happen. I can consistently fail when I use trace=1, and 0 now, but I'm able to get passes now with trace = 2

Interestingly enough I see a hipDeviceSync sintead of hipStreamSync which preceeds a memcpy before the first launched kernel finishes.

Image

@TedThemistokleous
Copy link
Collaborator Author

I'm assuming this insetion of the HipDeviceSync is because of the copy_from instruction in the trace section of the code. Regardless, it looks like the other two kernels are happening after a hipMemCpy occurs.

taking a look at MI250 I see the following output

Image

Which also has the execution of the kernel, before the hipStreamSteamWait and then the first hipMemCpy occurs.

@TedThemistokleous
Copy link
Collaborator Author

ping @pfultz2 Shouldn't we not be waiting on the first event before this memcpy occurs here?

@TedThemistokleous
Copy link
Collaborator Author

Looking at more hip command output between the MI250 and Navi using ltrace. Seeing variations between who is calling what hip calls in Navi libc10 and torch wherase on the MI250, only libtorch is being used. It seems like the Async calls are only on the MI250 and called by libtorch_hip.so but again this appears to be a red herring right now. No idea why we have additional calls but the perf seems to finish on time compared to navi

Got curious with the test itself and I decided to up the data size in the test on the Navi21 system for the hidden layer (2, 5, 10, 20, 100x) and found that with 100x larger hidden layer, for this parity test, Navi21 passes compared to the case that fails. Grabbing a capture of this and going to use it to compare to the failing case along with ltrace calls.

This is looking like a data race condition now since I can change the state based on how much data is used.

@TedThemistokleous
Copy link
Collaborator Author

TedThemistokleous commented Jul 15, 2023

@TedThemistokleous
Copy link
Collaborator Author

image

It appears the failing case is missing a hip malloc before the hipMemcpy before invoking some initial kernels before a set of hipMemCpyWithStream, and hipStreamCreateWithFlags

Not sure if this is related to data output or not.

Will post ltraces once finished capturing/gotten. Above is the first discrepancy between cases.

@TedThemistokleous
Copy link
Collaborator Author

@TedThemistokleous
Copy link
Collaborator Author

It appears I'm correct in thinking this is a data race.

image

It appears that on the failing case with stock hidden size (768) we attempt to perform a copyDeviceToHost() before the kernel is complete with the result. This isn't the case if we can up the workload as it appears the hipMemCopy stall out the run so the kernel completion and copy happen later.

What's interesting is why we're not seeing the hipMalloc on the stock size run. It appears when I inflact the hidden layer in this gelu test, there's a 4kB malloc at the beginning of the run.

Need to look more into the ltrace from this, although it appears when using ltrace now the failing run "passes" most likely due to added delay when tracking hipCalls

@TedThemistokleous
Copy link
Collaborator Author

TedThemistokleous commented Jul 17, 2023

Varied the size of the MI250 hidden layer to the point where I've gotten it down to 1 hidden layer. I don't see any large change in how the data is handled based on variable size.

image

Seems like stream waits and syncs work as intended here. I can't get MI250 to fail like the Navi21 system by reducing data size.

zipping since full run is 25 Mb

debug_mi250_1_hidden_layer_pass.log.tar.gz

@TedThemistokleous
Copy link
Collaborator Author

Looking like not an MIGraphX issue. We're doing everything as intended it seems.

@TedThemistokleous
Copy link
Collaborator Author

TedThemistokleous commented Jul 21, 2023

Found fix upstream in ORT. related to how we perform CopyTensorAsync() in onnxruntime.

Changed hipMemCpy() to a HipAsyncMemcpy() while syncing to the correct event and stream

Image

has become:

Image

Verified for both SWDEV-408437 and SWDEV-408438. Moving this to review state but this is entirely out of context for MIGraphX now.

@TedThemistokleous TedThemistokleous added bugfix Fixes a bug found in the code. onnxruntime PR changes interaction between MIGraphX and Onnxruntime labels Jul 21, 2023
@TedThemistokleous
Copy link
Collaborator Author

@TedThemistokleous
Copy link
Collaborator Author

PR is now in ORT and will part of this week's update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Fixes a bug found in the code. onnxruntime PR changes interaction between MIGraphX and Onnxruntime
Projects
None yet
Development

No branches or pull requests

1 participant