Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: return error from websocket handler if socket id is taken #8877

Merged
merged 9 commits into from
Feb 27, 2024

Conversation

jesse-amano-hpe
Copy link
Contributor

@jesse-amano-hpe jesse-amano-hpe commented Feb 23, 2024

Description

I'm taking this on as a spin-up task and still learning; please give this one appropriate scrutiny.

My understanding is that the handler for incoming websocket connections would mark an agent as terminated if any errors occurred while handling new connections on the socket id (based on agent id), including the error in the case that there's already a working connection on that socket. This would cause both agents to fail, instead of leaving the original up and just terminating the dupe.

I think this still leaves open the possibility that an agent with a duplicate ID manages to connect while the original's connection is briefly unstable, and the dupe becomes the "original", but since the previous state of affairs was that both agents would die no matter what, I think this is a net improvement and addresses the main issue stated on Tim's bug report.

Test Plan

Pass new integration test.

Commentary (optional)

The error message still isn't especially clear to whoever launches the second agent ("connection refused" is very generic); I'd be interested in suggestions on how the error condition could be communicated and propagated more clearly, or if we think it's fine as-is.

image

Following the repro steps described on the Jira ticket, I configured a local devcluster with two agents that both have agent_id: agent1. On start, without any additional input, the first agent starts up and stays up, and the second agent stops and stays stopped. Then I connected two more agents with the command:

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -e DET_AGENT_ID="test.id" -e DET_MASTER_HOST=127.0.0.1 -e DET_MASTER_PORT=8080 --net=host determinedai/determined-agent:latest

The first came up and stayed up; the second failed quickly with an error message:

FATA[2024-02-23T18:45:50Z] long disconnected and unable to reconnect to master: initial connection to master failed: error dialing master: dial tcp 127.0.0.1:8080: connect: connection refused 

(I also did this before making changes and observed the same issues Tim did.)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

DET-7940

@jesse-amano-hpe jesse-amano-hpe requested a review from a team as a code owner February 23, 2024 19:15
@cla-bot cla-bot bot added the cla-signed label Feb 23, 2024
@determined-ci determined-ci requested a review from a team February 23, 2024 19:15
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Feb 23, 2024
@jesse-amano-hpe jesse-amano-hpe requested review from amandavialva01 and removed request for a team February 23, 2024 19:16
Copy link

netlify bot commented Feb 23, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 9a7c179
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/65de5a53534373000727d2c5
😎 Deploy Preview https://deploy-preview-8877--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@determined-ci determined-ci requested a review from a team February 23, 2024 19:32
Copy link

codecov bot commented Feb 23, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 47.70%. Comparing base (d24b19a) to head (9a7c179).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8877      +/-   ##
==========================================
+ Coverage   47.53%   47.70%   +0.17%     
==========================================
  Files        1066     1066              
  Lines      170433   170434       +1     
  Branches     2237     2235       -2     
==========================================
+ Hits        81018    81311     +293     
+ Misses      89257    88965     -292     
  Partials      158      158              
Flag Coverage Δ
backend 44.15% <85.71%> (+0.79%) ⬆️
harness 63.69% <ø> (ø)
web 42.53% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/agentrm/agents.go 63.30% <100.00%> (+30.87%) ⬆️
master/internal/rm/agentrm/agent.go 15.57% <80.00%> (+15.57%) ⬆️

... and 3 files with indirect coverage changes

@jesse-amano-hpe jesse-amano-hpe force-pushed the DET-7940-Agent-with-same-id-causes-panic branch from 9f906cb to f86ca6c Compare February 23, 2024 22:45
Copy link
Contributor

@eecsliu eecsliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems good to me, but I'll invoke @stoksc to verify

Also, I saw the ticket ID in the branch name, but could you add it to the description or the PR title?

a.stop(err)
return
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could omit this return and just fall through to the last return nil case

@eecsliu eecsliu requested a review from stoksc February 26, 2024 22:37
@jesse-amano-hpe
Copy link
Contributor Author

seems good to me, but I'll invoke @stoksc to verify

Also, I saw the ticket ID in the branch name, but could you add it to the description or the PR title?

Thanks!

The ticket ID is at the bottom of the description, under the Checklist.

Copy link
Contributor

@amandavialva01 amandavialva01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, awesome work with this!

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just some non-blocking so to chat over

a.mu.Lock()
defer a.mu.Unlock()

err := a.handleWebsocketConnection(msg)
if err != nil {
if errors.Is(err, errWebsocketAlreadyConnected) {
Copy link
Contributor

@stoksc stoksc Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the reason for this behavior is to cover the case where 'agent restarted and tried to reconnect but we haven't noticed yet but we wanna recover faster'. even in that case though, this (your change) is still correct, eventually (prob immediately) the master will lose the socket and the agent will be able to reconnect (unless this was covering up a bug that didn't happen, ha.. so make sure to test it locally to get a feel for what the behavior is).

"gotest.tools/assert"
)

func TestAgentConnection(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just put this in a file beside the agents file? i'm kinda over this 'tests so far from code' pattern. even if it is basically an e2e.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also name could prob be more specific (TestDuplicateAgentConnectionsDontBoot .. ? idk, hard one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a try -- this started out as a unit test, drifted over to integration tests because mimicking agent behavior seemed difficult, then ended up setting up a fake agent anyway. It probably can move back to unit tests; thanks for pointing that out :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i just meant the name. Integration tests dont have to live in test/integration, that was the old pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also integration == needs db or elastic, really, right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's really great to know -- thanks!

AgentId: testAgentID,
})
assert.NilError(t, err, "could not connect to master api to get agent")
assert.Assert(t, apiResp.GetAgent().GetEnabled())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've been using require mostly instead of assert. i feel like the first failure is always the most useful and the rest is noise, so i'm happy to crash tests early. curious what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to switch over; I don't feel that strongly about any testing framework, just went for the one that happened to be in the other master integration tests really.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either way is fine with me

@jesse-amano-hpe jesse-amano-hpe merged commit ad94c17 into main Feb 27, 2024
72 of 84 checks passed
@jesse-amano-hpe jesse-amano-hpe deleted the DET-7940-Agent-with-same-id-causes-panic branch February 27, 2024 23:32
maxrussell pushed a commit that referenced this pull request Mar 21, 2024
* rm: have agent.HandleWebsocketConnection return error if socket id is already taken, instead of stopping the agent with a matching id

Intends to resolve [DET-7940]

* add release notes

* fix(?) release notes formatting for linter

* capture and raise new error from handling websocket connections when agent does not yet exist

* attempt again to fix rst lint issues

* edit and rstfmt release note

* add integration regression test for scenario where two agents with the same id try to connect

* move agent duplicate conn id test to internal/rm/agentrm, misc PR cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants