Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns: fix crash while setting server during query #14891

Closed
wants to merge 5 commits into from

Conversation

XadillaX
Copy link
Contributor

@XadillaX XadillaX commented Aug 17, 2017

Fix this issue follow these two points:

  1. Keep track of how many queries are currently open. If setServers()
    is called while there are open queries, error out.
  2. For Resolver instances, use option 1. For dns.setServers(), just
    create a fresh new default channel every time it is called, and then
    set its servers list.

Fixes: #14734

Checklist
  • make -j4 test passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

dns, cares

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. cares Issues and PRs related to the c-ares dependency or the cares_wrap binding. labels Aug 17, 2017
@XadillaX
Copy link
Contributor Author

XadillaX commented Aug 17, 2017

Occasional crash in this PR:

/Users/xadillax/Workspace/cpp/master-node/out/Release/node[20134]: ../src/cares_wrap.cc:351:void node::cares_wrap::(anonymous namespace)::ares_sockstate_cb(void *, ares_socket_t, int, int): Assertion `task && "When an ares socket is closed we should have a handle for it"' failed.
 1: node::Abort() [/Users/xadillax/Workspace/cpp/master-node/./node]
 2: node::MakeCallback(v8::Isolate*, v8::Local<v8::Object>, char const*, int, v8::Local<v8::Value>*, node::async_context) [/Users/xadillax/Workspace/cpp/master-node/./node]
 3: node::cares_wrap::(anonymous namespace)::ares_sockstate_cb(void*, int, int, int) [/Users/xadillax/Workspace/cpp/master-node/./node]
 4: ares__close_sockets [/Users/xadillax/Workspace/cpp/master-node/./node]
 5: ares__destroy_servers_state [/Users/xadillax/Workspace/cpp/master-node/./node]
 6: ares_destroy [/Users/xadillax/Workspace/cpp/master-node/./node]
 7: node::cares_wrap::(anonymous namespace)::ExecuteDestroyChannel(uv_work_s*) [/Users/xadillax/Workspace/cpp/master-node/./node]
 8: worker [/Users/xadillax/Workspace/cpp/master-node/./node]
 9: _pthread_body [/usr/lib/system/libsystem_pthread.dylib]
10: _pthread_body [/usr/lib/system/libsystem_pthread.dylib]
11: thread_start [/usr/lib/system/libsystem_pthread.dylib]
[1]    20134 abort      ./node test/internet/test-dns-setserver-in-callback-of-resolve4.js

@addaleax Do you have any idea with this crash?

@tniessen
Copy link
Member

I have got some questions before doing a full review, I hope you don't mind:

  1. Is it a reasonable assumption that there are no outstanding queries (referring to this check)? To me, this sounds like something that could be fixed upstream, but I could be wrong. However, if it is a reasonable assumption, we could consider to throw an error instead.
  2. What does this imply for Resolvers? We effectively hot-swap the channel of a resolver? Are we sure this won't cause problems?
  3. If we do, we need to patch Cancel() to cancel requests in all channels which were created for that Resolver, right? Assuming they have not been destroyed yet.

@addaleax
Copy link
Member

Heads up: This is going to conflict with #14826

@addaleax
Copy link
Member

addaleax commented Aug 17, 2017

I think an easier fix would be to wait with actually carrying out setServers() until all current queries have been handled (and queueing up queries until that point)?

@tniessen
Copy link
Member

I think an easier fix would be to wait with actually carrying out setServers() until all current queries have been handled (and queueing up queries until that point)?

Easier, yes, and probably less error-prone, but that can cause unexpected delays, especially when using the default resolver. I am still in favor of permitting this behavior upstream if there is no good reason to disallow it. And if there is a good reason to disallow it, we should probably do the same instead of working around the actual problem like this.

@XadillaX
Copy link
Contributor Author

XadillaX commented Aug 18, 2017

@tniessen

Is it a reasonable assumption that there are no outstanding queries (referring to this check)? To me, this sounds like something that could be fixed upstream, but I could be wrong. However, if it is a reasonable assumption, we could consider to throw an error instead.

C-ARES is not designed for libuv. If we use it in multi-thread, some locks are needed.

Imagine this situation in Node.js:

dns.resolve4('google.com', common.mustCall(function(/* err, nameServers */) {
  dns.setServers([ '8.8.8.8' ]);
}));

dns.resolve4('google.com', common.mustCall(function() {
  // do nothing...
}));

If the first resolve4() done, and second one is querying in another thread and not finished yet. Now the main event loop do the dns.setServer() - the assert will be trigged and Node.js crashes.

@XadillaX
Copy link
Contributor Author

I think an easier fix would be to wait with actually carrying out setServers() until all current queries have been handled (and queueing up queries until that point)?

Doesn't it block the event loop?

@tniessen
Copy link
Member

Is it a reasonable assumption that there are no outstanding queries (referring to this check)? To me, this sounds like something that could be fixed upstream, but I could be wrong. However, if it is a reasonable assumption, we could consider to throw an error instead.

C-ARES is not designed for libuv. If we use it in multi-thread, some locks are needed.

If the first resolve4() done, and second one is querying in another thread and not finished yet.

Sorry, I don't really get your point here. How is this related to threads? (c-ares does not use threads internally as far as I know, but I might be wrong.)

@XadillaX
Copy link
Contributor Author

XadillaX commented Aug 18, 2017

@tniessen But Node.js do dns.resolveA() in another thread in libuv (via AsyncWrap) or it will be blocked. That's the problem. C-ARES thinks you only have one thread and if you're setting a DNS server that you must finished all your queries.

@BridgeAR
Copy link
Member

BridgeAR commented Sep 2, 2017

Where do we stand here?

@XadillaX
Copy link
Contributor Author

XadillaX commented Sep 3, 2017

Where do we stand here?

@BridgeAR Sorry but I don't quite understand what you mean.

@BridgeAR
Copy link
Member

BridgeAR commented Sep 3, 2017

@XadillaX you posted that this PR this has an issue, there was no proper review yet and the last update was more then two weeks before my post. Therefore my question is how this is further progressing. Did you look into the issue again?

@XadillaX
Copy link
Contributor Author

XadillaX commented Sep 5, 2017

@BridgeAR I've updated the code based on #14826.

/cc @addaleax

@tniessen
Copy link
Member

tniessen commented Sep 8, 2017

By the way, I just noticed the documentation explicitely prohibits this behavior:

The dns.setServers() method must not be called while a DNS query is in progress.

@addaleax
Copy link
Member

addaleax commented Sep 8, 2017

@XadillaX I am not sure – can you explain what the job of the added threading in this PR is? Previously, everything was single-threaded with no synchronization needed…

@XadillaX
Copy link
Contributor Author

@addaleax You see this example:

dns.resolve4('google.com', common.mustCall(function(/* err, nameServers */) {
    dns.setServer([ '8.8.8.8' ]);
});

dns.resolve4('google.com', common.mustCall(function() {
  // DO NOTHING
}));

Sometime this example will crash because when we are doing the second dns.resolve4, it may execute setServer in the main eventloop. And dns.setServer will assert there's no active query in C-ARES. But the result is, the second dns.resolve4 actives the query channel and occurs the crash.

@addaleax
Copy link
Member

@XadillaX Yes, I understand the problem. What I don’t understand so much is why using multi-threading is helpful to solve it.

@XadillaX
Copy link
Contributor Author

XadillaX commented Sep 10, 2017

@addaleax

  1. add read write lock to setServer and resolve.
  2. when resolve (query), create a queue work and do some lock operation.
  3. when setServer, replace the original channel to a new one and wait for unlocking the write lock. after unlocking, delete the original channel.

If I do not use multithread, the main eventloop will be blocked by lock.

@addaleax
Copy link
Member

@XadillaX I see! There is at least one problem, though: If you look up the docs for pthread_rwlock_unlock(), which is what libuv uses for its own rwlocks, you’ll find that “results are undefined if the read-write lock rwlock is not held by the calling thread”.

Also, I think this adds a lot of complexity for behaviour that is arguably not a bug, because this is explicitly forbidden by our documentation.

I could think of these alternative approaches:

  1. Keep track of how many queries are currently open. If setServers() is called while there are open queries, error out.
  2. Keep track of how many queries are currently open. If setServers() is called while there are open queries, store the servers structure that we create, and add a hook that calls ares_set_servers_ports() once that count drops to zero.
  3. For Resolver instances, use option 1. For dns.setServers(), just create a fresh new default channel every time it is called, and then set its servers list.

What do you think?

@XadillaX
Copy link
Contributor Author

XadillaX commented Sep 11, 2017

@addaleax

  1. This is awesome but not for the default Resolver. Maybe we can provide another API activityCount.
  2. If we call resolve before another one is finished, and call a new resolve before the one we mentioned before is finished, the loop won't break and the new servers won't be set forever.
  3. It's suitable for option 1, but I wonder the performance.

@addaleax
Copy link
Member

@XadillaX Yes, I don’t like 2 either. For 1 and 3, I think either is fine, and I wouldn’t worry about performance too much as I can’t imagine anybody is calling .setServers() more than once a minute or so in practice (and if they are, they probably shouldn’t).

@XadillaX
Copy link
Contributor Author

XadillaX commented Sep 11, 2017

@addaleax Sorry I read it wrong. I read it as creating a new channel everytime resolve and setServer.

Now I agree with 1 and 3. And I'll continue working for it.

@BridgeAR
Copy link
Member

Ping @XadillaX you still want to follow up on this and rework the PR, right?

@BridgeAR
Copy link
Member

Ping @XadillaX

@addaleax
Copy link
Member

@XadillaX Yes – I think why this works a lot of the time is that setServers([]) usually, depending on the contents of unitialized memory, results in queries failing with ENOMEM due to the miscalculation that gives malloc(-8).

I think for now you should be fine moving the added tests from this PR into a new file, don’t worry about setServers([]).

@BridgeAR BridgeAR added the wip Issues and PRs that are still a work in progress. label Nov 22, 2017
@XadillaX
Copy link
Contributor Author

XadillaX commented Dec 5, 2017

@XadillaX
Copy link
Contributor Author

XadillaX commented Dec 5, 2017

/cc @addaleax I've updated the test case.

@BridgeAR BridgeAR removed the wip Issues and PRs that are still a work in progress. label Dec 6, 2017
addresses.INET4_HOST,
common.mustCall(function() {
// DO NOTHING
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit

A must call does not require a noop function.

}, common.expectsError({
code: 'ERR_DNS_SET_SERVERS_FAILED',
message: /^c-ares failed to set servers: "There are pending queries\." \[.+\]$/g
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit

You could use common.expectsError(throwingFn, errorObj) instead of assert.throws(throwingFn, common.expectsError(errorObj)).


dns.resolve('localhost', function(err/*, ret*/) {
// nothing
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use common.mustCall for the noop functions in this test? Or are they actually not called at all? In that case a mustNotCall could be used.

@XadillaX
Copy link
Contributor Author

XadillaX commented Dec 6, 2017

@addaleax
Copy link
Member

addaleax commented Dec 6, 2017

Landed in b3678df … finally. ;) Thanks for sticking with this PR and pushing it through!

@addaleax addaleax closed this Dec 6, 2017
addaleax pushed a commit that referenced this pull request Dec 6, 2017
Fix this issue follow these two points:

1. Keep track of how many queries are currently open. If `setServers()`
   is called while there are open queries, error out.
2. For `Resolver` instances, use option 1. For dns.setServers(), just
   create a fresh new default channel every time it is called, and then
   set its servers list.

PR-URL: #14891
Fixes: #14734
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
@XadillaX XadillaX mentioned this pull request Dec 8, 2017
MylesBorins pushed a commit that referenced this pull request Dec 12, 2017
Fix this issue follow these two points:

1. Keep track of how many queries are currently open. If `setServers()`
   is called while there are open queries, error out.
2. For `Resolver` instances, use option 1. For dns.setServers(), just
   create a fresh new default channel every time it is called, and then
   set its servers list.

PR-URL: #14891
Fixes: #14734
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
MylesBorins pushed a commit that referenced this pull request Dec 12, 2017
Fix this issue follow these two points:

1. Keep track of how many queries are currently open. If `setServers()`
   is called while there are open queries, error out.
2. For `Resolver` instances, use option 1. For dns.setServers(), just
   create a fresh new default channel every time it is called, and then
   set its servers list.

PR-URL: #14891
Fixes: #14734
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
@MylesBorins MylesBorins mentioned this pull request Dec 12, 2017
@gibfahn gibfahn added lts-watch-v8.x baking-for-lts PRs that need to wait before landing in a LTS release. labels Dec 20, 2017
gibfahn pushed a commit that referenced this pull request Feb 19, 2018
Fix this issue follow these two points:

1. Keep track of how many queries are currently open. If `setServers()`
   is called while there are open queries, error out.
2. For `Resolver` instances, use option 1. For dns.setServers(), just
   create a fresh new default channel every time it is called, and then
   set its servers list.

Fixes: #14734
PR-URL: #14891
Backport-PR-URL: #17778
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
@MylesBorins MylesBorins removed the baking-for-lts PRs that need to wait before landing in a LTS release. label Aug 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. cares Issues and PRs related to the c-ares dependency or the cares_wrap binding.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dns: occasionally crashing in resolve4 and setServers
9 participants