-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Windows failing CI tests #2797
Comments
I'm looking into "3 - oiiotool (Timeout)" |
A few tips that may help in working our way through these tests: First, when CI fails, our scripts are set up to create an "artifact" downloadable from the GitHub page that gives the status summary of the CI run. The artifact is a zip file containing the output of any individual tests that failed. Downloading this zip file and unpacking it locally is a good way to inspect exactly what the failed test output looked like (especially if you are investigating a test that seems to only fail during CI but you can't reproduce it locally). The actual test output will be in (say) Second, many of these "failures" (not the timeouts, but perhaps many of the others) are running correctly, but simply failed to exactly match the test output. This can happen for a number of reasons that are benign (for example, decoding some lossy file formats may get occasional single bit differences depending on the platform and the version of the decoding software). In this case, the "fix" is merely updating the reference output so that there is a match. But of course the trick is that you don't want to simply replace the existing reference output with the new Windows one, because then of course it may not match on Linux and those tests will start failing. So the solution to the fact that reference output may be different for different platforms is that the way our test scripts are written, the test passes if the output matches any of the files found in the When it's images that are the output, we're using OIIO's |
A few more small improvements: #2798 |
Some remarks about the timeouts, I pinned down where the deadlock is but at this moment do not really know why. I seen it happen in various oiio operations (idiff, iconvert, oiiotool,..) but always when interacting with openexr files. The deadlock is happening during destructor of the IlmThread::ThreadPool. The DefaultThreadPoolProvider is syncing some semaphores and somehow this goes haywire. Commenting out the content of So either this is a bug in the OpenEXR Windows threading logic or in the way we interact with it. To be continued in 2021 ;) |
So it never happens with CTEST_PARALLEL_LEVEL=1 (when the number of threads would be chosen to be equal to the number of cores), but only when multiple tests schedule at the same time. It seems that there are only 2 cores per GHA instance, and we are using CTEST_PARALLEL_LEVEL=4 by default, and each of those will see that it's running on a machine with 2 cores. So basically, 8 threads are probably running on a 2 core virtual machine -- that's a pretty "over-threaded" situation and may be revealing subtle bugs (perhaps in OpenEXR's thread pool?) that would never show up on a real user machine under ordinary conditions. I don't know. As a temporary workaround, I would recommend that just for the Windows tests, maybe turn off parallel testing, e.g. in ci.yml in the Another experiment to try is to set CTEST_PARALLEL_LEVEL even higher and see if that triggers problems on Linux or Mac. Another experiment: If the problem is specifically in OpenEXR's ThreadPool destructor (which is called only at our application shut-down?), I wonder if we can circumvent it by pre-emptively calling set_exr_threads (to... 1?) as part of our shutdown procedures? As an ultimate solution, if we never see the problem other than on Windows, we could just avoid (on that platform only?) ever calling set_exr_threads. That would prevent some fine control over exr threading that is nice to have, but I suppose that's better than deadlocks. |
I tried changing CTEST_PARALLEL_LEVEL to 2 (instead of 4) for the Windows tests, and instead of 6 tests failing, only 1 test fails (though, not a consistent one, so definitely still a problem). I'm going to submit a small PR that sets CTEST_PARALLEL_LEVEL=1 for the Windows CI tests, that will get us unstuck for now to fix up the rest of the Windows CI and get it passing, and then we can return later to investigate a more permanent solution to this problem. |
I'm going to merge my other Windows CI fix as well, just to reduce the number of failures a bit more. |
I tried CTEST_PARALLEL_LEVEL=1 and still get occasional timeout failures (not always the same ones). So I think the multiple parallel tests exacerbates the problem but is not the cause. I also tried bumping the version of OpenEXR used for the Windows CI tests from 2.4.1 to 2.5.4, that also did not help. With my quick fix ideas not panning out, I'll leave the deadlock investigation for you to pick up when you get a chance. Don't spin your wheels for too long on this -- I think it's also ok to avoid the problem for now by simply disabling the adjustment of exr_threads (just for Windows) until we get it all sorted out. |
#2805 fixes several more things here, including all the Python tests (unless they are unlucky enough to timeout) |
I figured out the deadlock. It all boils down to the use of C-style The OpenEXR threadpool needs either to have different synchronization or a "are my threads still alive" check (just like the OIIO thread pool :) ). Either way, it is something I will bring up on the OpenEXR project (as I can trigger it using 4 lines of non-OIIO code) though it remains a bit of "do not do that" territory imho. On how to continue for getting the tests stable, we can roughly go three ways about it:
a workaround like Obviously we can also combine some of the options or something completely different |
Wow, fantastic detective work! (1) is something we're working toward anyway, but note that some tests specifically are designed to make sure we do the right thing for certain errors, so this may not solve all the problems. (2) I like this, but I need look more at how we can do that. Let me take a look at the oiiotool code in particular and see if there's a simple way to avoid the exit() that happens when an error is encountered. (3) I'm not thrilled about this one. Why wouldn't it happen to users if they encounter errors? Making it nice only for CI tests doesn't seem like a robust solution. And disabling exr threading for Windows is also not ideal, we get a lot of improved perf because of the exr threading. I do think we should report this to OpenEXR, maybe there is a way to fix on that side of the fence (though we still have OIIO built against old OpenEXR by many users, and test against it, so exr-side changes are a long term fix). That seems to mostly leave choice (2) as the only comprehensive solution, as far as I can see. |
Are you able to get a stack trace and figure out exactly which 'exit' call are the ones that are having the trouble? |
I have seen it trigger from |
and on a related fun (?) note, I'm not able to trigger my repro case anymore when using the latest OpenEXR master. Trying atm to see what happens if I run oiio with latest OpenEXR master |
One thing I don't understand is... why aren't all the Windows users complaining about this? |
Not sure, I guess because it is rather rare to trigger? The changes that any of the OpenEXR worker threads have not run yet when calling exit are rather small? I see if I can do some additional tests to clarify it. I also started an issue with OpenEXR as the repro case is rather trivial |
I also found out why ``131 - unit_imagebuf``` fails The reason that test was failing was because it failed to delete a tif file that was opened for reading. I patched the libtiff file IO here debaetsd@ee50a3a |
I think we should suggest this means of opening the file to libtiff, but not put all this Windows-specific code in our tiffinput.cpp. (Hint that it's an ugly approach: we may need to do i separately for a bunch of different formats.) Our use of remove is problematic -- because we previously read the file with an ImageBuf, it's possible that it's backed by ImageCache which is holding an open ImageInput for that file. It's unusual and ill-advised to try to remove a file that your own process is using. The right solution is to fix the test, by making sure to tell the cache to close the file before removing it. Let me test a quick fix and if it works I'll submit a PR. |
o yeah it is definably something that should be fixed in libtiff, I was more experimenting with it in that commit. Forcing the files to be closed in that test is rather trivial (just move the ImageBuf objects into local C++ scopes and rely on the d'tor to clean up). That was actually my first fix but that I figured that the point of that test could have been to see if it was detected that the file was changed (on disk). Though perhaps remove is a different use case then simply updated. |
ImageBuf is not going to correctly (whatever that means) handle a file that changed during the IB's continuing use of it. The best we can hope for is that if the IB goes away, and then a second IB references the (now changed) file, the second one should get the new pixels, not the old pixels that were in the ImageCache from the first time around. I believe that's what that section of the test is trying to verify. |
I've made some progress on the timeout situation and a few other things. Will try to get to a PR tomorrow. |
#2821 makes a bunch more tests work on Windows. I think I've eliminated all the exit() calls in oiiotool, idiff, and iconvert, but I still get some of those hangs, so I'm not sure what to do next. I did use the ctest command line arguments to retry tests that have a timeout, so that makes the CI often pass those tests, but that of course does not solve the real problem. |
Still seeing those deadlocks confuses me more than I'm willing to admit. |
I've been knocking off failures (unrelated to the deadlocks) one by one as I got free time over the last few weeks: #2821 #2826 #2835 #2836 #2838 I'm pleased to report that I just submitted #2840, which makes the Windows CI pass! Caveats:
|
Even with the new OpenEXR 2.5.5 that incorporates the fix to their threadpool, we're still seeing these deadlocks inside OpenEXR during certain OIIO tests. We usually "pass" the Windows CI overall, but only because every test in our testsuite is imposing a 2 minute timeout and repeating the up to 5 times before giving up with a failure. I'm not quite sure what to try next. Without a Windows machine, it's hard for me to "catch it in the act" and understand where it's getting jammed up. @debaetsd Can you try this again with either OpenEXR master or the 2.5.5 release (both of which have the "fixed, we thought" threadpool) and see if a stack at the point where it's wedged reveals any clues about where and how it's getting stuck? |
I'll try to give it a go later this week |
Finally got around on testing it some more (a newborn baby is not good for productivity :) )
Note that only the main thread is still running at this point I also did some digging into OpenEXR itself and I think I found a solution (both local runs as gha CI seems to confirm it) |
I haven't looked into this for a couple months but I'll see if I can get it back up and running again |
Describe the bug
This issue is to keep track of failing tests when ran inside the Windows CI.
This list is found by running the CI with #2796.
The current list of failing tests is:
Full CI log (incl more and detailed errors on why some tests are failing) attached to this issue
oiio-log.txt
This issue is mainly to keep track on what tests are passing and/or beeing worked on.
Please reply here if you are working on a specific test and/or have a solution.
To Reproduce
Run the specific test on windows and see it fail
When running the CI locally, make sure to launch it from "Git for Windows" bash shell (to mimic the GH actions shell)
Expected behavior
all test pass
The text was updated successfully, but these errors were encountered: