-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: return error from websocket handler if socket id is taken #8877
fix: return error from websocket handler if socket id is taken #8877
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8877 +/- ##
==========================================
+ Coverage 47.53% 47.70% +0.17%
==========================================
Files 1066 1066
Lines 170433 170434 +1
Branches 2237 2235 -2
==========================================
+ Hits 81018 81311 +293
+ Misses 89257 88965 -292
Partials 158 158
Flags with carried forward coverage won't be shown. Click here to find out more.
|
7c47a4a
to
5cf7305
Compare
… already taken, instead of stopping the agent with a matching id Intends to resolve [DET-7940]
…agent does not yet exist
9f906cb
to
f86ca6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good to me, but I'll invoke @stoksc to verify
Also, I saw the ticket ID in the branch name, but could you add it to the description or the PR title?
a.stop(err) | ||
return | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could omit this return and just fall through to the last return nil case
…e same id try to connect
Thanks! The ticket ID is at the bottom of the description, under the Checklist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, awesome work with this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just some non-blocking so to chat over
a.mu.Lock() | ||
defer a.mu.Unlock() | ||
|
||
err := a.handleWebsocketConnection(msg) | ||
if err != nil { | ||
if errors.Is(err, errWebsocketAlreadyConnected) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the reason for this behavior is to cover the case where 'agent restarted and tried to reconnect but we haven't noticed yet but we wanna recover faster'. even in that case though, this (your change) is still correct, eventually (prob immediately) the master will lose the socket and the agent will be able to reconnect (unless this was covering up a bug that didn't happen, ha.. so make sure to test it locally to get a feel for what the behavior is).
"gotest.tools/assert" | ||
) | ||
|
||
func TestAgentConnection(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just put this in a file beside the agents file? i'm kinda over this 'tests so far from code' pattern. even if it is basically an e2e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also name could prob be more specific (TestDuplicateAgentConnectionsDontBoot .. ? idk, hard one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a try -- this started out as a unit test, drifted over to integration tests because mimicking agent behavior seemed difficult, then ended up setting up a fake agent anyway. It probably can move back to unit tests; thanks for pointing that out :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh i just meant the name. Integration tests dont have to live in test/integration, that was the old pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also integration == needs db or elastic, really, right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's really great to know -- thanks!
AgentId: testAgentID, | ||
}) | ||
assert.NilError(t, err, "could not connect to master api to get agent") | ||
assert.Assert(t, apiResp.GetAgent().GetEnabled()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've been using require mostly instead of assert. i feel like the first failure is always the most useful and the rest is noise, so i'm happy to crash tests early. curious what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to switch over; I don't feel that strongly about any testing framework, just went for the one that happened to be in the other master integration tests really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either way is fine with me
* rm: have agent.HandleWebsocketConnection return error if socket id is already taken, instead of stopping the agent with a matching id Intends to resolve [DET-7940] * add release notes * fix(?) release notes formatting for linter * capture and raise new error from handling websocket connections when agent does not yet exist * attempt again to fix rst lint issues * edit and rstfmt release note * add integration regression test for scenario where two agents with the same id try to connect * move agent duplicate conn id test to internal/rm/agentrm, misc PR cleanup
Description
I'm taking this on as a spin-up task and still learning; please give this one appropriate scrutiny.
My understanding is that the handler for incoming websocket connections would mark an agent as terminated if any errors occurred while handling new connections on the socket id (based on agent id), including the error in the case that there's already a working connection on that socket. This would cause both agents to fail, instead of leaving the original up and just terminating the dupe.
I think this still leaves open the possibility that an agent with a duplicate ID manages to connect while the original's connection is briefly unstable, and the dupe becomes the "original", but since the previous state of affairs was that both agents would die no matter what, I think this is a net improvement and addresses the main issue stated on Tim's bug report.
Test Plan
Pass new integration test.
Commentary (optional)
The error message still isn't especially clear to whoever launches the second agent ("connection refused" is very generic); I'd be interested in suggestions on how the error condition could be communicated and propagated more clearly, or if we think it's fine as-is.
Following the repro steps described on the Jira ticket, I configured a local
devcluster
with two agents that both haveagent_id: agent1
. On start, without any additional input, the first agent starts up and stays up, and the second agent stops and stays stopped. Then I connected two more agents with the command:The first came up and stayed up; the second failed quickly with an error message:
(I also did this before making changes and observed the same issues Tim did.)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket
DET-7940