Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test SingleSequenceRecord is not thread safe and fails in AMD card #196

Closed
lmreia opened this issue Mar 28, 2021 · 13 comments · Fixed by #199
Closed

Test SingleSequenceRecord is not thread safe and fails in AMD card #196

lmreia opened this issue Mar 28, 2021 · 13 comments · Fixed by #199

Comments

@lmreia
Copy link

lmreia commented Mar 28, 2021

Hello. I am running the tests and got the following failure in the TEST(TestMultipleAlgoExecutions, SingleSequenceRecord):

error: Expected equality of these values:
  tensorA->vector()
    Which is: { 1, 1, 1 }
  std::vector<float>({ 3, 3, 3 })
    Which is: { 3, 3, 3 }

This is the sequence of commands of the test:

        mgr.sequence()
          ->record<kp::OpTensorSyncDevice>({ tensorA })
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record<kp::OpTensorSyncLocal>({ tensorA })
          ->eval();

If I add kp::OpTensorSyncDevice between the dispatches, it also fails. However, if I add eval() or kp::OpTensorSyncLocal between the dispatches, it passes.

@axsaucedo
Copy link
Member

@lmreia thank you for reporting this, which version of Kompute are you using, is this from master? Also what is your GPU? Can you share the output of vulkaninfo?

@lmreia
Copy link
Author

lmreia commented Mar 28, 2021

I just downloaded Kompute from the master, commit b0df305. I am on Windows 10, with MSVC compiler.

Here is my gpu info:

gpu-info

And here is the output from vulkaninfo:

vulkaninfo-output.txt

@axsaucedo
Copy link
Member

Thank you for sharing @lmreia, I can confirm now that the issue is not with your graphics card, and your card+driver does seem to support all required components. Looking at the test, it seems like the actual shader does not fully ensure thread-safe operations, which is why we can see that with the mutliple eval() it would not show these issues. This is mainly as the eval actually has fences that control these. The two ways to solve this would actually be by modifying the shader to use the atomicAdd(...) function, or alternatively by adding an OpMemoryBarrier in between of each of the dispatch operations.

I will update this test to ensure it's thread-safe by adding an OpMemoryBarrier. Are there any other tests that are failing for you or is this the only one?

@axsaucedo axsaucedo changed the title Test SingleSequenceRecord fails Test SingleSequenceRecord is not thread safe and fails in AMD card Mar 28, 2021
@lmreia
Copy link
Author

lmreia commented Mar 28, 2021

The tests from TestAsyncOperations (TestManagerParallelExecution and TestManagerAsyncExecution) are giving me an alert from AMD Bug Report Tool saying that a driver timeout has ocurred on my system, and the tests never finish running.

I think this is caused by a Windows configuration parameter that limits the response time of the driver to a maximum of ~2 seconds, so Windows kills the process if it takes too long. However I have not searched about this yet so I am not really sure.

All the other tests have passed.

@axsaucedo
Copy link
Member

axsaucedo commented Mar 28, 2021

@lmreia I can't really replicate locally with my NVIDIA card, would you be able to test the following fix to the test? Basically replace the section of the section where the manager runs the section.

    {
        std::shared_ptr<kp::OpMemoryBarrier> shaderBarrier{
                  new kp::OpMemoryBarrier({ tensorA },
                  vk::AccessFlagBits::eTransferRead,
                  vk::AccessFlagBits::eShaderWrite,
                  vk::PipelineStageFlagBits::eComputeShader,
                  vk::PipelineStageFlagBits::eComputeShader)
        };

        mgr.sequence()
          ->record<kp::OpTensorSyncDevice>({ tensorA })
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record(shaderBarrier)
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record(shaderBarrier)
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record<kp::OpTensorSyncLocal>({ tensorA })
          ->eval();
    }

@axsaucedo
Copy link
Member

The tests from TestAsyncOperations (TestManagerParallelExecution and TestManagerAsyncExecution) are giving me an alert from AMD Bug Report Tool saying that a driver timeout has ocurred on my system, and the tests never finish running.

Ok thank you for the update, the TestManagerParallelExecution is currently not run as part of the CI, only run locally as it makes the assumption on the indexes for the familyQueues - it could be configured to work with your AMD card but would require knowing what are the concurrent-enabled familyQueues of your card (you can read more about it here).

In regards to the TestManagerAsyncExecution I don't see why it would fail as there is nothing specific in that test - if you run only that test does it fail? You can run tests with a filter - the makefile shows an example:

mk_run_tests: mk_build_tests
	./build/test/test_kompute --gtest_filter=$(FILTER_TESTS)

@lmreia
Copy link
Author

lmreia commented Mar 28, 2021

The modification you proposed to the test SingleSequenceRecord worked. Now it passes.

About the test TestManagerAsyncExecution, I still get the same alert even if I don't run any other test. Debugging step by step shows me that this line is giving me the driver timeout:

sq1->evalAsync<kp::OpAlgoDispatch>(algo1);

@axsaucedo
Copy link
Member

The modification you proposed to the test SingleSequenceRecord worked. Now it passes.

Great - would you be able to open a PR with the fix? Otherwise I can just open a PR - let me know.

About the test TestManagerAsyncExecution, I still get the same alert even if I don't run any other test.

Ok that's strange, mainly because when you can eval it always calls evalAsync and evalAwait under the hood - it just doesn't add a fence to make the CPU wait until the GPU operation is complete:

https://github.com/EthicalML/vulkan-kompute/blob/b0df305daf7439e495ffb9bf1268de8c9e1e39f9/src/Sequence.cpp#L96

I am not really sure how the driver timeout is in the Async step and not in the await step... it would make sense if the timeout happened in evalAwait - mainly as you can provide the amount of time to wait for timeout. Having said that, by default the number is very high, so I also can't really understand how that can be an issue:

https://github.com/EthicalML/vulkan-kompute/blob/b0df305daf7439e495ffb9bf1268de8c9e1e39f9/src/include/kompute/Sequence.hpp#L202

Do you know which line/command inside that function is where the driver timeout gets called?

@lmreia
Copy link
Author

lmreia commented Mar 28, 2021

I would be glad if you could open the PR. I have never opened one before.

The timeout happens before the execution enters the evalAwait function. The furthest/deepest I could go before the timeout is this line:

https://github.com/EthicalML/vulkan-kompute/blob/b0df305daf7439e495ffb9bf1268de8c9e1e39f9/src/Sequence.cpp#L133

The alert appears right after I run this line, before the return of the function evalAsync, after calling sq1->evalAsync<kp::OpAlgoDispatch>(algo1);.

I may try something like this later:
Disabling Timeout Recovery for Display Drivers
GPU drivers crash with long computations (TDR crash)

@axsaucedo
Copy link
Member

I would be glad if you could open the PR. I have never opened one before.

Ok no worries will do

The alert appears right after I run this line, before the return of the function evalAsync...

Hmm ok I see, that's strange - I wonder whether AMD driver automatically waits on the fence upon submit. But that would be a bug in the driver.

I may try something like this later:...

Ok let me know if it does fix it as AMD cards do seem to have some different nuances that don't appear on NVIDIA / Swiftshader drivers, so would be useful.

@lmreia
Copy link
Author

lmreia commented Mar 28, 2021

I followed the tutorial on GPU drivers crash with long computations (TDR crash) and set both TdrDdiDelay and TdrDelay to 10 seconds. The test TestManagerAsyncExecution passes now. Maybe you could add this as a note in the docs.

When I run the tests in debug mode, sometimes this line is printed on the terminal, but I guess it must have something to do with spdlog.

[*** LOG ERROR #0001 ***] [2021-03-28 05:23:54] [] {string pointer is null}

@axsaucedo
Copy link
Member

@lmreia great, thank you for trying this.

I followed the tutorial on GPU drivers crash with long computations (TDR crash) and set both TdrDdiDelay and TdrDelay to 10 seconds. The test TestManagerAsyncExecution passes now. Maybe you could add this as a note in the docs.

Wow this is quite something - I think it may be best to add it as a comment, I will add a note at the top of this test. Thank you for testing it.

When I run the tests in debug mode, sometimes this line is printed on the terminal, but I guess it must have something to do with spdlog.

This is indeed from SPDLOG, the error is when a blank log is passed, probably something that should be addressed but doesn't seem like an issue related to this.

@axsaucedo
Copy link
Member

Ok I've now added further documentation via 9111158

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants