Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover faster after network outage #49

Open
bajtos opened this issue Dec 14, 2023 · 3 comments
Open

Recover faster after network outage #49

bajtos opened this issue Dec 14, 2023 · 3 comments
Labels
good first issue 🤗 Good for newcomers

Comments

@bajtos
Copy link
Member

bajtos commented Dec 14, 2023

#47 increased the delay between retrievals to ~60 seconds. We are now waiting for 60 seconds before we try to connect to spark-api after being offline.

Workaround: restart the Station after coming online.

Proposed fix:

  • detect whether we are online (see ActivityState.#healthy)
  • if we are offline then reduce the delay between iterations to something like 3-5 seconds
@bajtos bajtos added the good first issue 🤗 Good for newcomers label Dec 14, 2023
@bajtos
Copy link
Member Author

bajtos commented Dec 14, 2023

@juliangruber
Copy link
Member

#47 increased the delay between retrievals to ~60 seconds. We are now waiting for 60 seconds before we try to connect to spark-api after being offline.

Workaround: restart the Station after coming online.

I want to make sure I understand the problem statement. Why is it a problem to wait 60 seconds after having been offline? Isn't it ok to be offline, then wait 60 seconds, then try again? And why does restarting Station fix this?

@bajtos
Copy link
Member Author

bajtos commented Jan 8, 2024

Here is what I observed:

  • Sometimes, when I wake my computer from sleep and connect to the network, I see that the Station icon in the tray/menubar indicates offline status.
  • I know my computer is online because I can browse the web, but the Station still stays offline.
  • When I restart the Station, it comes almost immediately online.

This behaviour creates an impression that the Station cannot correctly detect the transition of the computer from offline to online. (Personally, I perceive such behaviour as the app developers' sloppiness, and I don't want to perceive myself as a sloppy person.)

why does restarting Station fix this?

IIUC, the Station decides whether we are offline or online based on the outcome of a SPARK iteration. The Station goes offline when SPARK cannot fetch round details or submit the measurement. When we are offline, and SPARK reports that it was able to fetch round details, we go back online.

This worked well when the delay between jobs was ~10 seconds. It no longer works with the current ~60-second delay because it can take up to 60 seconds before Station/SPARK can detect that we are back online.

When I restart the Station, SPARK starts the next job immediately and therefore the Station quickly transitions to the online status.

Here is the main SPARK loop:

spark/lib/spark.js

Lines 165 to 187 in fc756cf

async run () {
while (true) {
const started = Date.now()
try {
await this.nextRetrieval()
this.#activity.onHealthy()
} catch (err) {
if (err.statusCode === 400 && err.serverMessage === 'OUTDATED CLIENT') {
this.#activity.onOutdatedClient()
} else {
this.#activity.onError()
}
console.error(err)
}
const duration = Date.now() - started
const baseDelay = APPROX_ROUND_LENGTH_IN_MS / this.#maxTasksPerNode
const delay = baseDelay - duration
if (delay > 0) {
console.log('Sleeping for %s seconds before starting the next task...', Math.round(delay / 1000))
await sleep(delay)
}
}
}

I propose modifying the following line to calculate different delays based on whether we are in a healthy (online) state.

const delay = baseDelay - duration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue 🤗 Good for newcomers
Projects
Status: 🗃 backlog
Development

No branches or pull requests

2 participants