Test SingleSequenceRecord is not thread safe and fails in AMD card #196

lmreia · 2021-03-28T05:47:26Z

Hello. I am running the tests and got the following failure in the TEST(TestMultipleAlgoExecutions, SingleSequenceRecord):

error: Expected equality of these values:
  tensorA->vector()
    Which is: { 1, 1, 1 }
  std::vector<float>({ 3, 3, 3 })
    Which is: { 3, 3, 3 }

This is the sequence of commands of the test:

        mgr.sequence()
          ->record<kp::OpTensorSyncDevice>({ tensorA })
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record<kp::OpTensorSyncLocal>({ tensorA })
          ->eval();

If I add kp::OpTensorSyncDevice between the dispatches, it also fails. However, if I add eval() or kp::OpTensorSyncLocal between the dispatches, it passes.

The text was updated successfully, but these errors were encountered:

axsaucedo · 2021-03-28T05:56:20Z

@lmreia thank you for reporting this, which version of Kompute are you using, is this from master? Also what is your GPU? Can you share the output of vulkaninfo?

lmreia · 2021-03-28T06:27:09Z

I just downloaded Kompute from the master, commit b0df305. I am on Windows 10, with MSVC compiler.

Here is my gpu info:

And here is the output from vulkaninfo:

vulkaninfo-output.txt

axsaucedo · 2021-03-28T06:45:19Z

Thank you for sharing @lmreia, I can confirm now that the issue is not with your graphics card, and your card+driver does seem to support all required components. Looking at the test, it seems like the actual shader does not fully ensure thread-safe operations, which is why we can see that with the mutliple eval() it would not show these issues. This is mainly as the eval actually has fences that control these. The two ways to solve this would actually be by modifying the shader to use the atomicAdd(...) function, or alternatively by adding an OpMemoryBarrier in between of each of the dispatch operations.

I will update this test to ensure it's thread-safe by adding an OpMemoryBarrier. Are there any other tests that are failing for you or is this the only one?

lmreia · 2021-03-28T06:58:07Z

The tests from TestAsyncOperations (TestManagerParallelExecution and TestManagerAsyncExecution) are giving me an alert from AMD Bug Report Tool saying that a driver timeout has ocurred on my system, and the tests never finish running.

I think this is caused by a Windows configuration parameter that limits the response time of the driver to a maximum of ~2 seconds, so Windows kills the process if it takes too long. However I have not searched about this yet so I am not really sure.

All the other tests have passed.

axsaucedo · 2021-03-28T06:58:35Z

@lmreia I can't really replicate locally with my NVIDIA card, would you be able to test the following fix to the test? Basically replace the section of the section where the manager runs the section.

    {
        std::shared_ptr<kp::OpMemoryBarrier> shaderBarrier{
                  new kp::OpMemoryBarrier({ tensorA },
                  vk::AccessFlagBits::eTransferRead,
                  vk::AccessFlagBits::eShaderWrite,
                  vk::PipelineStageFlagBits::eComputeShader,
                  vk::PipelineStageFlagBits::eComputeShader)
        };

        mgr.sequence()
          ->record<kp::OpTensorSyncDevice>({ tensorA })
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record(shaderBarrier)
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record(shaderBarrier)
          ->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
          ->record<kp::OpTensorSyncLocal>({ tensorA })
          ->eval();
    }

axsaucedo · 2021-03-28T07:02:05Z

The tests from TestAsyncOperations (TestManagerParallelExecution and TestManagerAsyncExecution) are giving me an alert from AMD Bug Report Tool saying that a driver timeout has ocurred on my system, and the tests never finish running.

Ok thank you for the update, the TestManagerParallelExecution is currently not run as part of the CI, only run locally as it makes the assumption on the indexes for the familyQueues - it could be configured to work with your AMD card but would require knowing what are the concurrent-enabled familyQueues of your card (you can read more about it here).

In regards to the TestManagerAsyncExecution I don't see why it would fail as there is nothing specific in that test - if you run only that test does it fail? You can run tests with a filter - the makefile shows an example:

mk_run_tests: mk_build_tests
	./build/test/test_kompute --gtest_filter=$(FILTER_TESTS)

lmreia · 2021-03-28T07:09:56Z

The modification you proposed to the test SingleSequenceRecord worked. Now it passes.

About the test TestManagerAsyncExecution, I still get the same alert even if I don't run any other test. Debugging step by step shows me that this line is giving me the driver timeout:

sq1->evalAsync<kp::OpAlgoDispatch>(algo1);

axsaucedo · 2021-03-28T07:17:12Z

The modification you proposed to the test SingleSequenceRecord worked. Now it passes.

Great - would you be able to open a PR with the fix? Otherwise I can just open a PR - let me know.

About the test TestManagerAsyncExecution, I still get the same alert even if I don't run any other test.

Ok that's strange, mainly because when you can eval it always calls evalAsync and evalAwait under the hood - it just doesn't add a fence to make the CPU wait until the GPU operation is complete:

https://github.com/EthicalML/vulkan-kompute/blob/b0df305daf7439e495ffb9bf1268de8c9e1e39f9/src/Sequence.cpp#L96

I am not really sure how the driver timeout is in the Async step and not in the await step... it would make sense if the timeout happened in evalAwait - mainly as you can provide the amount of time to wait for timeout. Having said that, by default the number is very high, so I also can't really understand how that can be an issue:

https://github.com/EthicalML/vulkan-kompute/blob/b0df305daf7439e495ffb9bf1268de8c9e1e39f9/src/include/kompute/Sequence.hpp#L202

Do you know which line/command inside that function is where the driver timeout gets called?

lmreia · 2021-03-28T07:55:38Z

I would be glad if you could open the PR. I have never opened one before.

The timeout happens before the execution enters the evalAwait function. The furthest/deepest I could go before the timeout is this line:

https://github.com/EthicalML/vulkan-kompute/blob/b0df305daf7439e495ffb9bf1268de8c9e1e39f9/src/Sequence.cpp#L133

The alert appears right after I run this line, before the return of the function evalAsync, after calling sq1->evalAsync<kp::OpAlgoDispatch>(algo1);.

I may try something like this later:
Disabling Timeout Recovery for Display Drivers
GPU drivers crash with long computations (TDR crash)

axsaucedo · 2021-03-28T08:09:16Z

I would be glad if you could open the PR. I have never opened one before.

Ok no worries will do

The alert appears right after I run this line, before the return of the function evalAsync...

Hmm ok I see, that's strange - I wonder whether AMD driver automatically waits on the fence upon submit. But that would be a bug in the driver.

I may try something like this later:...

Ok let me know if it does fix it as AMD cards do seem to have some different nuances that don't appear on NVIDIA / Swiftshader drivers, so would be useful.

lmreia · 2021-03-28T08:42:24Z

I followed the tutorial on GPU drivers crash with long computations (TDR crash) and set both TdrDdiDelay and TdrDelay to 10 seconds. The test TestManagerAsyncExecution passes now. Maybe you could add this as a note in the docs.

When I run the tests in debug mode, sometimes this line is printed on the terminal, but I guess it must have something to do with spdlog.

[*** LOG ERROR #0001 ***] [2021-03-28 05:23:54] [] {string pointer is null}

axsaucedo · 2021-03-28T08:54:26Z

@lmreia great, thank you for trying this.

I followed the tutorial on GPU drivers crash with long computations (TDR crash) and set both TdrDdiDelay and TdrDelay to 10 seconds. The test TestManagerAsyncExecution passes now. Maybe you could add this as a note in the docs.

Wow this is quite something - I think it may be best to add it as a comment, I will add a note at the top of this test. Thank you for testing it.

When I run the tests in debug mode, sometimes this line is printed on the terminal, but I guess it must have something to do with spdlog.

This is indeed from SPDLOG, the error is when a blank log is passed, probably something that should be addressed but doesn't seem like an issue related to this.

axsaucedo · 2021-03-28T08:57:16Z

Ok I've now added further documentation via 9111158

axsaucedo changed the title ~~Test SingleSequenceRecord fails~~ Test SingleSequenceRecord is not thread safe and fails in AMD card Mar 28, 2021

axsaucedo mentioned this issue Mar 28, 2021

Added memory barrier on test #199

Merged

axsaucedo closed this as completed in #199 Mar 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test SingleSequenceRecord is not thread safe and fails in AMD card #196

Test SingleSequenceRecord is not thread safe and fails in AMD card #196

lmreia commented Mar 28, 2021

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021 •

edited

Loading

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021 •

edited

Loading

axsaucedo commented Mar 28, 2021 •

edited

Loading

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021 •

edited

Loading

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021

axsaucedo commented Mar 28, 2021

axsaucedo commented Mar 28, 2021

Test SingleSequenceRecord is not thread safe and fails in AMD card #196

Test SingleSequenceRecord is not thread safe and fails in AMD card #196

Comments

lmreia commented Mar 28, 2021

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021 • edited Loading

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021 • edited Loading

axsaucedo commented Mar 28, 2021 • edited Loading

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021 • edited Loading

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021

axsaucedo commented Mar 28, 2021

axsaucedo commented Mar 28, 2021

lmreia commented Mar 28, 2021 •

edited

Loading

lmreia commented Mar 28, 2021 •

edited

Loading

axsaucedo commented Mar 28, 2021 •

edited

Loading

lmreia commented Mar 28, 2021 •

edited

Loading