-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accuracy Errors for Resnext50 #1698
Comments
Just a quick update here with a log dump as Ive seen weird behavior between Navi system and the remote MI250. Looks like this is related to the interaction with topk and gatherND I believe the added transpose is causing some odd error which flips our len dimensions causing us to throw an error due to indexing outside of the proper vector output range. had to use a bunch of the MIGRAPHX_GPU_DEBUG flags to get this and leveraged rocgdb to get a proper look at the internals of the kernel to see what the heck was going on. Backtrace full seemed to dump what I needed to see what is happening here. Tried this with random, fill1 and fill0 and it appears that this doesn't effect the output. Key thing was to use MIGRAPHX_GPU_OPTIMIZE=0 so the args don't all get optimized out as we compile said kernels Ive added a simple verify test called test_gathernd_1d.cpp for now to debug this and not have to run the entire network to see if I can quickly replicate this and what triggers this error/state. I think I'll have to add a new matcher sooner than later between topk->transpose->gathernd
|
Having issues trying to reproduce this output in a test of the block that seems to throw the assert before the gathernd
Primarily due to reshape saying the shape is invalid when trying to mirror what I'm seeing in that section of code. Interestingly that's not a 1:1 to what netron sees It as if the second block isn't read in correctly at all for the reshape When this compiles in it looks like its doing the following for this branch (taken from my gdb dump and cherrypicked)
It appears we aren't parsing in that branch correctly then unless we're doing some other functionality with reshapes? |
So few updates on this. It isn't an issue with reshapes after talking with @pfultz2 . That part is infact correct Data driven ops seem to be doing something incorrect here: nonzero, gathernd, and topk are using larger vectors than needed. The GPU assert is related to this with nonzero ->tranpose->gathernd not giving the correct index and going out of bounds. Tried an accuracy test since the output of size of 300 is valid and baked into the network when viewing this on netron. From the looks of it, onnxruntime is cutting down the vector size of the data and resulting in only 5 outputs. After relaxing the length condition and checking the first 5 values of the output, I'm seeing accuracy failures between the CPU EP and MIGraphX From the previous initial run, topk was also a large amount of the overhead which Ive found is related to the issues with the other data driven ops. The value K being used is the largest vector size, which is resulting in topK just performing a sort on the entire vector of data. Similar to the transpose->gatherND issue we're seeing due to nonzero not cutting down the size. I think the next step here is to combine these ops in a pass/matcher here to handle the correct sizes at run time. |
After talking this over with @CharlieL7 it appears the only way out of this is to use dynamic shapes. This is due to the following block with respect to handling topk->gather In SSD resnet for example, we have a similar configuration of blocks which topk outputs to two seperate gathers. In the resnext50 case, one of these gathers, is fed by a concat from all subsequent topk outputs. This is an issue as nonzero, sets the value of K for all topK, which results in K being the largest possible shape in static shapes, thus resulting in padding at the end of the topk. The resulting buffer before we do a gather is then
thus after concat, we attempt to gather on:
One "solution" discussed was to move back the concat and combine topk branches for inference but this then defeats the purpose of retinanet parallelizing these If we want to move forward with this and dynamic shapes
or
|
set to backlog until I can get dynamic shape support. |
Related to #1886 |
Since we got resnext50 running via #1283 there's still a few issues to sort out
Model mirror link: https://zenodo.org/record/6617879/files/resnext50_32x4d_fpn.onnx
Current speculation is that there's some odd interaction between topk and Gather/GatherND
The solution to test this appears to be adding a reshape between these operators to pair down the data vector shape using a matcher as the onnxruntime CPU EP seems to indicate an expected output of 5 items vs the 300 we still see
Further investigation needs to be investigated for the fp16 issue as well as accuracy testing.
The text was updated successfully, but these errors were encountered: