Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Echo method to send inactivity probe #368

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pperiyasamy
Copy link
Contributor

This commit moves sendEcho call into goroutine to avoid chances for it gets blocked when nbdb leader is accidently disconnected, Otherwise there is no way to disconnect existing client connection and make it to connect with new leader.

@pperiyasamy
Copy link
Contributor Author

/cc @jcaamano @dcbw

@pperiyasamy
Copy link
Contributor Author

/hold

We may have to make send echo to be a time bound call as it currently holding rpc mutex's RLock for longer period as it takes almost 12 mins to return unexpected EOF error when nbdb leader is accidentally gone away.
During this period reconnect is not possible because Disconnect() seems waiting to acquire rpc mutex's Lock because of above mentioned reason.

@pperiyasamy pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch 2 times, most recently from b50fa7f to 8cbc0ed Compare September 25, 2023 12:55
@pperiyasamy pperiyasamy changed the title Move sendEcho into goroutine for inactivity probe Use Echo method to send inactivity probe Sep 25, 2023
@pperiyasamy
Copy link
Contributor Author

We may have to make send echo to be a time bound call as it currently holding rpc mutex's RLock for longer period as it takes almost 12 mins to return unexpected EOF error when nbdb leader is accidentally gone away.
During this period reconnect is not possible because Disconnect() seems waiting to acquire rpc mutex's Lock because of above mentioned reason.

Using client's Echo method directly to fix the issue. Please take a look now.

client/client.go Outdated
o.Disconnect()
return
}
lastEcho = ""
case <-time.After(o.options.inactivityTimeout):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to re-use the timer instead of using a different one on each for loop, otherwise we can end up with a lot of concurrent running timers if we are constantly seeing traffic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the timer eligible for garbage collection when select receives signal from trafficSeen channel ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is but if we spawn them at a greater rate than they are garbage collected (which requires the timer to complete first) then we have a problem. Seems too much of a risk to me when we could just re-use the same timer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok @jcaamano , added a code now to reuse the same timer.

client/client.go Outdated
// Can't use o.Echo() because it blocks; we need the Call object direct from o.rpcClient.Go()
call := o.sendEcho(args, &reply)
if call == nil {
ctx, cancel := context.WithTimeout(context.Background(), o.options.timeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this going to block ongoing transacts because we are not consuming the trafficSeen channel in the meantime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It blocks ongoing transactions. Now calling Echo in a separate goroutine.

@pperiyasamy pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch 2 times, most recently from b865ddb to 49be92a Compare September 25, 2023 19:28
// Otherwise stuff thisEcho into the echoReplied channel
echoReplied <- thisEcho
}
ctx, cancel := context.WithTimeout(context.Background(), o.options.timeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two nuances:

  • Do we make any checks for o.options.timeoutto be smaller than o.options.inactivityTimeout? Or should we expect multiple instances of this goroutine to be running simultaneously?
  • We might see traffic while the echo is in progress and still disconnect if the echo itself fails.

I don't think any of these are necessarily a problem but can be disconcerting if we end up seeing logs and troubleshooting scenarios where any of that is happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do we make any checks for o.options.timeoutto be smaller than o.options.inactivityTimeout? Or should we expect multiple instances of this goroutine to be running simultaneously?
  • We might see traffic while the echo is in progress and still disconnect if the echo itself fails.

The first concern is fixed with commit e0bed3f.

For the 2nd one, Do you think echo fails whereas transaction would succeed. Does that really happen (both goes via same tcp channel) ? Do you want to have some retries before going for a disconnect ? If this is not really a concern for now, we can skip doing it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am just happy if we check that o.options.timeout is smaller than o.options.inactivityTimeout. I would drop everything else from e0bed3f

Copy link
Contributor Author

@pperiyasamy pperiyasamy Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok @jcaamano, done with commit c287d43 itself.
Hope you're still fine with cf04722 for go bump.

@pperiyasamy pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch 3 times, most recently from 464e26b to a1d471d Compare September 26, 2023 13:14
This commit uses client's Echo method to send inactivity probe as it can be
made as a time bound call so that rpc read lock can not be held indefinitely.
This would make disconnect happens immediately when current ovsdb leader is
accidentally gone away (this happens in a very rare scenario in which case
sendEcho method returns with unexpected EOF error after 12 mins only) and
reconnects with new ovsdb leader.

Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com>
This bump up go into 1.19.6 and fixes lint errors generated
in CI after upgrading go version.

Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com>
@pperiyasamy pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch from a1d471d to cf04722 Compare September 27, 2023 12:33
@pperiyasamy pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch 2 times, most recently from e719ea9 to d569373 Compare November 8, 2023 06:15
if reflect.TypeOf(o.conn).String() == "*net.TCPConn" {
tcpConn := o.conn.(*net.TCPConn)
if o.options.timeout > 0 {
err := tcpConn.SetDeadline(time.Now().Add(o.options.timeout * 3))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to SetDeadline for each call? If so, there requires a mutex.

BTW: How does this affect rpcClient.CallWithContext? Does rpcClient discard the timeout context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, The SetDeadline looks to be needed for every read/write on tcp channel. This method is always invoked after o.rpcMutex is acquired. so we're safe here.

correct, The rpcClient.CallWithContext method is not fully timebound call, so fixing it here based on feedback from a PR raised there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a related issue in grpc-go maybe you can take a look.

issue: grpc/grpc#15889
pr: https://github.com/grpc/grpc-go/pull/2307/files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @halfcrazy , yes, it's the same issue that we are trying to fix in libovsdb client as well.
added a commit 3867774 to set up tcp user timeout.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pperiyasamy have you considered @halfcrazy concerns about needing a mutex?

This sets up tcp connection deadline for write and read requests
to fail after a deadline has been exceeded.

Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com>
@pperiyasamy pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch from d569373 to 304a102 Compare November 9, 2023 07:18
This commit sets TCP_USER_TIMEOUT socket option for tcp connection
so that channel write doesn't block indefinitely on network disconnect.

Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com>
@pperiyasamy pperiyasamy force-pushed the send-echo-async-for-inactivity-check branch from 304a102 to 3867774 Compare November 9, 2023 07:28
Copy link
Collaborator

@jcaamano jcaamano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pperiyasamy Isn't the TCP user timeout an alternative to setDeadline, so we wouldn't need both? If not, are you able to reproduce a problem where one of the approaches is not enough and the other one is needed? I would stick with the TCP user timeout otherwise, I think.

// a rpc request to the server.
if tcpConn, ok := o.conn.(*net.TCPConn); ok {
if o.options.timeout > 0 {
err := tcpConn.SetDeadline(time.Now().Add(o.options.timeout * 3))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why * 3? Wouldn't it make sense to use inactivityTimeout for this?

if reflect.TypeOf(o.conn).String() == "*net.TCPConn" {
tcpConn := o.conn.(*net.TCPConn)
if o.options.timeout > 0 {
err := tcpConn.SetDeadline(time.Now().Add(o.options.timeout * 3))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pperiyasamy have you considered @halfcrazy concerns about needing a mutex?

// channel write doesn't block indefinitely on network disconnect.
var userTimeout time.Duration
if o.options.timeout > 0 {
userTimeout = o.options.timeout * 3
Copy link
Collaborator

@jcaamano jcaamano Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, why * 3? Wouldn't it make sense to use inactivityTimeout for this as well?

if o.options.timeout > 0 {
userTimeout = o.options.timeout * 3
} else {
userTimeout = defaultTimeout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that to start with we shouldn't have a default, and just set it to inactivityTimeout when that is set.

@coveralls
Copy link

coveralls commented Jan 16, 2024

Pull Request Test Coverage Report for Build 6325970834

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 29 of 40 (72.5%) changed or added relevant lines in 4 files are covered.
  • 203 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.09%) to 76.07%

Changes Missing Coverage Covered Lines Changed/Added Lines %
ovsdb/schema.go 0 1 0.0%
client/options.go 1 3 33.33%
modelgen/generator.go 0 2 0.0%
client/client.go 28 34 82.35%
Files with Coverage Reduction New Missed Lines %
client/options.go 4 75.61%
client/client.go 199 74.58%
Totals Coverage Status
Change from base Build 6159565928: 0.09%
Covered Lines: 5280
Relevant Lines: 6941

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants