Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-resolution loop when service config is not supported #3199

Closed
kazegusuri opened this issue Nov 20, 2019 · 2 comments · Fixed by #3201
Closed

Re-resolution loop when service config is not supported #3199

kazegusuri opened this issue Nov 20, 2019 · 2 comments · Fixed by #3201
Assignees

Comments

@kazegusuri
Copy link
Contributor

What version of gRPC are you using?

v1.25.0 or later

What version of Go are you using (go version)?

go version go1.13.4 darwin/amd64

What operating system (Linux, Windows, …) and version?

Mac, Linux

What did you do?

If possible, provide a recipe for reproducing the error.

What did you expect to see?

For DNS resolver, the default frequency of DNS resolution is 30 minutes, so DNS resolution happens every 30 minutes.

What did you see instead?

For DNS resolver, the minimum duration for re-resolution is set to 30 seconds by minDNSResRate. When Service Config is invalid, DNS resolution happens every 30 seconds. This happens even if service config feature is not used, which means txt lookup returns error.

This is because of polling that is introduced from v1.25.0 for error handling. The NewServiceConfig is called from resolver when service config is updated.
In DNS resolver, watcher calls NewServiceConfig. The polling calls ResolveNow to do immediate re-resolution, so the request of dns resolution is queued in dns resolver. The request also makes a call of NewServiceConfig, so it becomes infinite loop.

If a resolver calls NewServiceConfig as a result of ResolveNow, the resolver possibly becomes loops. I'm not sure which part is bad, but NewServiceConfig should not be called when service config is disabled and it is better to have backoff for re-resolution.

@kazegusuri kazegusuri changed the title Infinite loop of resolver when service config is not supported Re-resolution loop when service config is not supported Nov 20, 2019
@dfawley
Copy link
Member

dfawley commented Nov 20, 2019

Sounds like two things may be wrong:

  1. If there is no service config, that is not supposed to be an error.
  2. If there is an actual error fetching the service config, there should be an exponential backoff applied to the resolver's polling rate.

@dfawley dfawley self-assigned this Nov 20, 2019
@dfawley dfawley added the P1 label Nov 20, 2019
@dfawley
Copy link
Member

dfawley commented Nov 20, 2019

OK, I see what's happening. When service config lookups are disabled, the resolver is returning the empty string as the service config, which is not valid JSON, so the ClientConn treats that as an error and asks for another update. I will make the DNS resolver stop reporting any service config when it's disabled, or when it receives an error during the TXT lookup. This will go back to the previous behavior. I will also make the ClientConn ignore the service config, regardless of whether there is an error, when service config lookups are disabled.

I also have a PR to update the DNS resolver to use the V2 API that implements the newer-style error handling requirements. #3165, in progress, and also blocked by #3186. Once done, an error during the TXT lookup should return an error to the ClientConn, and result in polling until no error is reported. This follows an exponential backoff starting at 1 second and eventually reaching 2 minutes. Because of the additional throttling in the DNS resolver, we will only allow it to refresh at most once every 30 seconds. But it will eventually start polling every 2 minutes after the backoff reaches its maximum.

@lock lock bot locked as resolved and limited conversation to collaborators May 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants