-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: don't access agentState when it may be nil #8921
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8921 +/- ##
==========================================
+ Coverage 47.27% 47.31% +0.04%
==========================================
Files 1162 1162
Lines 176114 176115 +1
Branches 2235 2237 +2
==========================================
+ Hits 83259 83336 +77
+ Misses 92697 92621 -76
Partials 158 158
Flags with carried forward coverage won't be shown. Click here to find out more.
|
✅ Deploy Preview for determined-ui canceled.
|
nil, | ||
func() {}, | ||
) | ||
require.NotPanics(t, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
such a simple test, yet if you revert the commit, it fails.
I am having a hard time following the stack trace, but it looks like there might be a cause here: https://github.com/determined-ai/determined/pull/8921/files#diff-592e76f1ff2b91e03db4c9d3c585051753ad4584904f64ea7e0275356907474bL881 ? The other time I have seen this is when the agent config that we pass from the server is bad the agent will early disconnect. In this case I think the agent was just i/o blocked from too many containers, possibly restarted or reconnected, but was unable to complete the network i/o operations quickly enough on restart. |
Ah yeah, you're right. It's 2 panics. Once initially and one in a defer. I think that's why it looks so weird. |
If you run
you'll see the exact stack trace. With the other test you'll see all but the first 2 lines. |
Also, I am slow and just realized we default to |
OK, fixed that and added a more involved but probably better test. |
d580bc9
to
d4a02f0
Compare
Before I land I'm gonna carefully for other agentState NPEs, too. |
d4a02f0
to
3b1fb13
Compare
3b1fb13
to
c811f51
Compare
Merging over code cov. I added an if around 5 lines of untested code, but I tested the meat of the change pretty effectively I think. |
AgentState is only set on the agent master-side whenever it is "started", which is when we've made it through all the initialization messages. But an agent can crash while it is starting, and then this access into AgentState causes a NPE, which probably ends up being a convincing red herring when debugging.
Description
AgentState is only set on the agent master-side whenever it is "started", which is when we've made it through all the initialization messages. But an agent can crash while it is starting, and then this access into AgentState causes a NPE, which probably ends up being a convincing red herring when debugging.
Test Plan
I'll follow up to add tests in a bit.
Commentary (optional)
Yep.
Checklist
docs/release-notes/
.See Release Note for details.
Ticket
RM-53