fix: return error from websocket handler if socket id is taken #8877

jesse-amano-hpe · 2024-02-23T19:15:45Z

Description

I'm taking this on as a spin-up task and still learning; please give this one appropriate scrutiny.

My understanding is that the handler for incoming websocket connections would mark an agent as terminated if any errors occurred while handling new connections on the socket id (based on agent id), including the error in the case that there's already a working connection on that socket. This would cause both agents to fail, instead of leaving the original up and just terminating the dupe.

I think this still leaves open the possibility that an agent with a duplicate ID manages to connect while the original's connection is briefly unstable, and the dupe becomes the "original", but since the previous state of affairs was that both agents would die no matter what, I think this is a net improvement and addresses the main issue stated on Tim's bug report.

Test Plan

Pass new integration test.

Commentary (optional)

The error message still isn't especially clear to whoever launches the second agent ("connection refused" is very generic); I'd be interested in suggestions on how the error condition could be communicated and propagated more clearly, or if we think it's fine as-is.

Following the repro steps described on the Jira ticket, I configured a local devcluster with two agents that both have agent_id: agent1. On start, without any additional input, the first agent starts up and stays up, and the second agent stops and stays stopped. Then I connected two more agents with the command:

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -e DET_AGENT_ID="test.id" -e DET_MASTER_HOST=127.0.0.1 -e DET_MASTER_PORT=8080 --net=host determinedai/determined-agent:latest

The first came up and stayed up; the second failed quickly with an error message:

FATA[2024-02-23T18:45:50Z] long disconnected and unable to reconnect to master: initial connection to master failed: error dialing master: dial tcp 127.0.0.1:8080: connect: connection refused

(I also did this before making changes and observed the same issues Tim did.)

Checklist

Changes have been manually QA'd
User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
See Release Note for details.
Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

DET-7940

netlify · 2024-02-23T19:16:07Z

✅ Deploy Preview for determined-ui ready!

Name	Link
🔨 Latest commit	`9a7c179`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/65de5a53534373000727d2c5
😎 Deploy Preview	https://deploy-preview-8877--determined-ui.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

codecov · 2024-02-23T19:56:26Z

Codecov Report

Attention: Patch coverage is 85.71429% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 47.70%. Comparing base (d24b19a) to head (9a7c179).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8877      +/-   ##
==========================================
+ Coverage   47.53%   47.70%   +0.17%     
==========================================
  Files        1066     1066              
  Lines      170433   170434       +1     
  Branches     2237     2235       -2     
==========================================
+ Hits        81018    81311     +293     
+ Misses      89257    88965     -292     
  Partials      158      158

Flag	Coverage Δ
backend	`44.15% <85.71%> (+0.79%)`	⬆️
harness	`63.69% <ø> (ø)`
web	`42.53% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
master/internal/rm/agentrm/agents.go	`63.30% <100.00%> (+30.87%)`	⬆️
master/internal/rm/agentrm/agent.go	`15.57% <80.00%> (+15.57%)`	⬆️

... and 3 files with indirect coverage changes

docs/release-notes/7940-socket-error.rst

… already taken, instead of stopping the agent with a matching id Intends to resolve [DET-7940]

…agent does not yet exist

eecsliu

seems good to me, but I'll invoke @stoksc to verify

Also, I saw the ticket ID in the branch name, but could you add it to the description or the PR title?

eecsliu · 2024-02-26T21:37:14Z

master/internal/rm/agentrm/agent.go

 		a.stop(err)
-		return
+		return nil


nit: we could omit this return and just fall through to the last return nil case

…e same id try to connect

jesse-amano-hpe · 2024-02-26T23:05:47Z

seems good to me, but I'll invoke @stoksc to verify

Also, I saw the ticket ID in the branch name, but could you add it to the description or the PR title?

Thanks!

The ticket ID is at the bottom of the description, under the Checklist.

master/test/integration/rm/agent_test.go

amandavialva01

LGTM, awesome work with this!

stoksc

lgtm, just some non-blocking so to chat over

stoksc · 2024-02-27T14:52:49Z

master/internal/rm/agentrm/agent.go

 	a.mu.Lock()
 	defer a.mu.Unlock()

 	err := a.handleWebsocketConnection(msg)
 	if err != nil {
+		if errors.Is(err, errWebsocketAlreadyConnected) {


i think the reason for this behavior is to cover the case where 'agent restarted and tried to reconnect but we haven't noticed yet but we wanna recover faster'. even in that case though, this (your change) is still correct, eventually (prob immediately) the master will lose the socket and the agent will be able to reconnect (unless this was covering up a bug that didn't happen, ha.. so make sure to test it locally to get a feel for what the behavior is).

stoksc · 2024-02-27T14:56:27Z

master/test/integration/rm/agent_test.go

+	"gotest.tools/assert"
+)
+
+func TestAgentConnection(t *testing.T) {


can you just put this in a file beside the agents file? i'm kinda over this 'tests so far from code' pattern. even if it is basically an e2e.

also name could prob be more specific (TestDuplicateAgentConnectionsDontBoot .. ? idk, hard one)

I'll give it a try -- this started out as a unit test, drifted over to integration tests because mimicking agent behavior seemed difficult, then ended up setting up a fake agent anyway. It probably can move back to unit tests; thanks for pointing that out :)

Oh i just meant the name. Integration tests dont have to live in test/integration, that was the old pattern.

also integration == needs db or elastic, really, right now

Ah, that's really great to know -- thanks!

stoksc · 2024-02-27T14:58:58Z

master/test/integration/rm/agent_test.go

+		AgentId: testAgentID,
+	})
+	assert.NilError(t, err, "could not connect to master api to get agent")
+	assert.Assert(t, apiResp.GetAgent().GetEnabled())


we've been using require mostly instead of assert. i feel like the first failure is always the most useful and the rest is noise, so i'm happy to crash tests early. curious what you think.

Happy to switch over; I don't feel that strongly about any testing framework, just went for the one that happened to be in the other master integration tests really.

either way is fine with me

…anup

…ame-id-causes-panic

* rm: have agent.HandleWebsocketConnection return error if socket id is already taken, instead of stopping the agent with a matching id Intends to resolve [DET-7940] * add release notes * fix(?) release notes formatting for linter * capture and raise new error from handling websocket connections when agent does not yet exist * attempt again to fix rst lint issues * edit and rstfmt release note * add integration regression test for scenario where two agents with the same id try to connect * move agent duplicate conn id test to internal/rm/agentrm, misc PR cleanup

jesse-amano-hpe requested a review from a team as a code owner February 23, 2024 19:15

jesse-amano-hpe requested a review from eecsliu February 23, 2024 19:15

cla-bot bot added the cla-signed label Feb 23, 2024

determined-ci requested a review from a team February 23, 2024 19:15

determined-ci added the documentation Improvements or additions to documentation label Feb 23, 2024

jesse-amano-hpe requested review from amandavialva01 and removed request for a team February 23, 2024 19:16

determined-ci requested a review from a team February 23, 2024 19:32

amandavialva01 reviewed Feb 23, 2024

View reviewed changes

docs/release-notes/7940-socket-error.rst Outdated Show resolved Hide resolved

jesse-amano-hpe force-pushed the DET-7940-Agent-with-same-id-causes-panic branch from 7c47a4a to 5cf7305 Compare February 23, 2024 20:56

jesse-amano-hpe requested a review from a team as a code owner February 23, 2024 22:25

jesse-amano-hpe requested a review from MikhailKardash February 23, 2024 22:25

jesse-amano-hpe added 6 commits February 23, 2024 14:45

rm: have agent.HandleWebsocketConnection return error if socket id is…

4862063

… already taken, instead of stopping the agent with a matching id Intends to resolve [DET-7940]

add release notes

95b01ee

fix(?) release notes formatting for linter

8309dbb

capture and raise new error from handling websocket connections when …

841be62

…agent does not yet exist

attempt again to fix rst lint issues

fe335c3

edit and rstfmt release note

f86ca6c

jesse-amano-hpe force-pushed the DET-7940-Agent-with-same-id-causes-panic branch from 9f906cb to f86ca6c Compare February 23, 2024 22:45

jesse-amano-hpe removed request for a team and MikhailKardash February 23, 2024 22:46

eecsliu reviewed Feb 26, 2024

View reviewed changes

eecsliu requested a review from stoksc February 26, 2024 22:37

add integration regression test for scenario where two agents with th…

294496a

…e same id try to connect

amandavialva01 reviewed Feb 26, 2024

View reviewed changes

master/test/integration/rm/agent_test.go Outdated Show resolved Hide resolved

amandavialva01 reviewed Feb 26, 2024

View reviewed changes

master/test/integration/rm/agent_test.go Outdated Show resolved Hide resolved

amandavialva01 approved these changes Feb 26, 2024

View reviewed changes

stoksc approved these changes Feb 27, 2024

View reviewed changes

jesse-amano-hpe added 2 commits February 27, 2024 13:54

move agent duplicate conn id test to internal/rm/agentrm, misc PR cle…

1fa41c4

…anup

Merge remote-tracking branch 'origin/main' into DET-7940-Agent-with-s…

9a7c179

…ame-id-causes-panic

jesse-amano-hpe merged commit ad94c17 into main Feb 27, 2024
72 of 84 checks passed

jesse-amano-hpe deleted the DET-7940-Agent-with-same-id-causes-panic branch February 27, 2024 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: return error from websocket handler if socket id is taken #8877

fix: return error from websocket handler if socket id is taken #8877

jesse-amano-hpe commented Feb 23, 2024 •

edited

Loading

netlify bot commented Feb 23, 2024 •

edited

Loading

codecov bot commented Feb 23, 2024 •

edited

Loading

eecsliu left a comment

eecsliu Feb 26, 2024

jesse-amano-hpe commented Feb 26, 2024

amandavialva01 left a comment

stoksc left a comment

stoksc Feb 27, 2024 •

edited

Loading

stoksc Feb 27, 2024

stoksc Feb 27, 2024

jesse-amano-hpe Feb 27, 2024

stoksc Feb 27, 2024

stoksc Feb 27, 2024

jesse-amano-hpe Feb 27, 2024

stoksc Feb 27, 2024

jesse-amano-hpe Feb 27, 2024

stoksc Feb 27, 2024

fix: return error from websocket handler if socket id is taken #8877

fix: return error from websocket handler if socket id is taken #8877

Conversation

jesse-amano-hpe commented Feb 23, 2024 • edited Loading

Description

Test Plan

Commentary (optional)

Checklist

Ticket

netlify bot commented Feb 23, 2024 • edited Loading

✅ Deploy Preview for determined-ui ready!

codecov bot commented Feb 23, 2024 • edited Loading

Codecov Report

eecsliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jesse-amano-hpe commented Feb 26, 2024

amandavialva01 left a comment

Choose a reason for hiding this comment

stoksc left a comment

Choose a reason for hiding this comment

stoksc Feb 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jesse-amano-hpe commented Feb 23, 2024 •

edited

Loading

netlify bot commented Feb 23, 2024 •

edited

Loading

codecov bot commented Feb 23, 2024 •

edited

Loading

stoksc Feb 27, 2024 •

edited

Loading