Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't access agentState when it may be nil #8921

Merged
merged 4 commits into from
Mar 6, 2024

Conversation

stoksc
Copy link
Contributor

@stoksc stoksc commented Feb 28, 2024

Description

AgentState is only set on the agent master-side whenever it is "started", which is when we've made it through all the initialization messages. But an agent can crash while it is starting, and then this access into AgentState causes a NPE, which probably ends up being a convincing red herring when debugging.

Test Plan

I'll follow up to add tests in a bit.

Commentary (optional)

Yep.

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

RM-53

@stoksc stoksc requested a review from a team as a code owner February 28, 2024 22:15
@cla-bot cla-bot bot added the cla-signed label Feb 28, 2024
Copy link

codecov bot commented Feb 28, 2024

Codecov Report

Attention: Patch coverage is 22.22222% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 47.31%. Comparing base (a5b425a) to head (1632c06).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8921      +/-   ##
==========================================
+ Coverage   47.27%   47.31%   +0.04%     
==========================================
  Files        1162     1162              
  Lines      176114   176115       +1     
  Branches     2235     2237       +2     
==========================================
+ Hits        83259    83336      +77     
+ Misses      92697    92621      -76     
  Partials      158      158              
Flag Coverage Δ
backend 42.52% <22.22%> (+0.18%) ⬆️
harness 63.94% <ø> (ø)
web 42.53% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/agentrm/agent.go 25.42% <22.22%> (+9.84%) ⬆️

... and 4 files with indirect coverage changes

Copy link

netlify bot commented Feb 28, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 1632c06
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/65e7a4b529430800082570cd

nil,
func() {},
)
require.NotPanics(t, func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such a simple test, yet if you revert the commit, it fails.

@erikwilson
Copy link
Contributor

I am having a hard time following the stack trace, but it looks like there might be a cause here: https://github.com/determined-ai/determined/pull/8921/files#diff-592e76f1ff2b91e03db4c9d3c585051753ad4584904f64ea7e0275356907474bL881 ?

The other time I have seen this is when the agent config that we pass from the server is bad the agent will early disconnect. In this case I think the agent was just i/o blocked from too many containers, possibly restarted or reconnected, but was unable to complete the network i/o operations quickly enough on restart.

@stoksc
Copy link
Contributor Author

stoksc commented Feb 28, 2024

Ah yeah, you're right. It's 2 panics. Once initially and one in a defer. I think that's why it looks so weird.

@stoksc
Copy link
Contributor Author

stoksc commented Feb 28, 2024

If you run

func TestAgentFastFailAfterFirstConnect2(t *testing.T) {
	a := newAgent(
		"test",
		queue.New[agentUpdatedEvent](),
		"default",
		&config.ResourcePoolConfig{},
		&aproto.MasterSetAgentOptions{},
		nil,
		func() {},
	)
	require.NotPanics(t, func() {
		a.HandleWebsocketDisconnect()
	})
}

you'll see the exact stack trace. With the other test you'll see all but the first 2 lines.

@erikwilson
Copy link
Contributor

Also, I am slow and just realized we default to t2.xlarge for our aux deployments. My suggestion would be to stay away from the burstable instance types for these deployments as we are more likely to see these i/o issues.

@stoksc
Copy link
Contributor Author

stoksc commented Feb 28, 2024

OK, fixed that and added a more involved but probably better test.

@stoksc stoksc force-pushed the stoksc/fix/npe-on-agent-crash branch 2 times, most recently from d580bc9 to d4a02f0 Compare February 28, 2024 23:25
@stoksc
Copy link
Contributor Author

stoksc commented Feb 28, 2024

Before I land I'm gonna carefully for other agentState NPEs, too.

@stoksc stoksc force-pushed the stoksc/fix/npe-on-agent-crash branch from d4a02f0 to 3b1fb13 Compare February 29, 2024 17:20
@stoksc stoksc requested a review from a team as a code owner February 29, 2024 17:20
@stoksc stoksc requested a review from hamidzr February 29, 2024 17:20
@stoksc stoksc force-pushed the stoksc/fix/npe-on-agent-crash branch from 3b1fb13 to c811f51 Compare March 4, 2024 20:15
@stoksc stoksc enabled auto-merge (squash) March 5, 2024 23:08
@stoksc stoksc disabled auto-merge March 6, 2024 01:48
@stoksc
Copy link
Contributor Author

stoksc commented Mar 6, 2024

Merging over code cov. I added an if around 5 lines of untested code, but I tested the meat of the change pretty effectively I think.

@stoksc stoksc merged commit 191a144 into main Mar 6, 2024
69 of 82 checks passed
@stoksc stoksc deleted the stoksc/fix/npe-on-agent-crash branch March 6, 2024 01:49
maxrussell pushed a commit that referenced this pull request Mar 21, 2024
AgentState is only set on the agent master-side whenever it is "started", which
is when we've made it through all the initialization messages. But an agent can
crash while it is starting, and then this access into AgentState causes a NPE,
which probably ends up being a convincing red herring when debugging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants