Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdk#994] Heal deadline #995

Merged

Conversation

Bolodya1997
Copy link

@Bolodya1997 Bolodya1997 commented Jun 24, 2021

Description

  1. Adds timeout to heal requests.
  2. Fixes interpose not to mark itself as "clear" on failed heal request.

Issue link

Closes #994.

How Has This Been Tested?

  • Added unit testing to cover Covered by existing unit testing
  • Tested manually
  • Tested by integration testing
  • Have not tested

Types of changes

  • Bug fix
  • New functionallity
  • Documentation
  • Refactoring
  • CI

Vladimir Popov added 3 commits June 24, 2021 11:11
Signed-off-by: Vladimir Popov <vladimir.popov@xored.com>
Signed-off-by: Vladimir Popov <vladimir.popov@xored.com>
Signed-off-by: Vladimir Popov <vladimir.popov@xored.com>
Signed-off-by: Vladimir Popov <vladimir.popov@xored.com>
@Bolodya1997 Bolodya1997 force-pushed the sdk#994/heal-deadline branch 3 times, most recently from 438765f to ab3248e Compare June 24, 2021 11:12
Signed-off-by: Vladimir Popov <vladimir.popov@xored.com>
@Bolodya1997 Bolodya1997 marked this pull request as ready for review June 24, 2021 15:03
@@ -108,8 +108,6 @@ func (l *interposeServer) Request(ctx context.Context, request *networkservice.N
return result, nil
}

l.activeConnection.Delete(activeConnID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to delete this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When heal fails its first Request, it cleans up active connection - it results in Close failing with no active connection found because for interpose it looks like trying to close not existing connection. heal stands in chain after interpose, so client cannot stop healing in such case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... As I can see with this fix activeConnection map will not delete connections.
Am I get it correctly that we got a leak of active connections?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, missed this.
Fixed in other way with setting clientURL = nil an invoking next.Close in Close in case of no active connection exists.

nseCtxCancel()
time.Sleep(100 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add sleep here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test validates that heal can be stopped after it has been started - so we need to wait for some time to make sure that heal is really running at the moment when we are trying to stop it.
Without this sleep test is also working, but it tests preventive heal stop - and so it is not the thing that should be tested here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some other way we can make sure that heal has started?
Recently we found out that on CI 100ms sometimes isn't enough for thread synchronization. And the test where we had issues with timeout is much simpler, so complex heal tests likely require even more time to be sure that another thread started working.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be problematic.
Waiting for 100ms doesn't guarantee that heal has been started, but it does mean that in most test runs it will be so.

@Bolodya1997 Bolodya1997 marked this pull request as draft June 29, 2021 03:52
Signed-off-by: Vladimir Popov <vladimir.popov@xored.com>
@Bolodya1997 Bolodya1997 marked this pull request as ready for review June 29, 2021 06:17
@denis-tingaikin denis-tingaikin merged commit 40c33e0 into networkservicemesh:main Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Heal blocks all other Requests/Closes
4 participants