Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Inform the user when jobs status change #5337

Open
2 tasks
suiguoxin opened this issue Mar 2, 2021 · 3 comments
Open
2 tasks

Inform the user when jobs status change #5337

suiguoxin opened this issue Mar 2, 2021 · 3 comments
Assignees
Labels

Comments

@suiguoxin
Copy link
Member

suiguoxin commented Mar 2, 2021

Motivation

Some jobs may fail unexpectedly.
If the users can be informed when the jobs fail, the users will be able to handle the issue in time.
This will save the users from checking their job status all the time.

Similar for other status changes.

Background:

  • This function should be set by job instead of user
  • The trigger event can be
    • Start Running
    • Failed
    • Succeeded
    • WaitingTooLong
  • Notification can be sent to users by email / webportal and this should be configurable
    • some notification methods maybe not available if the admin doesn't enable it

Design

Workflow:

  • Part 1: Job configuration
  • Part 2: monitor & trigger corresponding alerts
  • Part 3: alerts handling

Part 1: Job / User configuration

  • What alerts to send is configured by job:
    • enable this feature in job protocal, in the field extras -> jobStatusChangeNotification
    • support further modification after jobs get submitted
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true
  hivedScheduler:
    taskRoles:
      taskrole:
        skuNum: 1
        skuType: GENERIC-WORKER
  jobStatusChangeNotification: 
    running: false
    succeeded: true
    stopped: false
    failed: true
    retried: false
  • How to send alerts is configured by user: set in user-profile page, user can select from these available actions:
    • webportal notification
    • email notification: this action will only be available when : 1) user email is not empty; 2) email-user action is available in alert-handler
{
    "username": "gusui",
    "email": "gusui@microsoft.com",
    "extension": {
        "sshKeys": [],
        "getJobStatusChangeNotificationBy": 
            email: true,
            webportal: true
    },
}

Part 2: monitor & trigger corresponding alerts

design with DB

  • add the following columns to the framework table in DB:
notificationAtRunning | BOOLEAN
notifiedAtRunning | BOOLEAN
notificationAtSucceeded | BOOLEAN
notifiedAtSucceeded | BOOLEAN
notificationAtFailed | BOOLEAN
notifiedAtFailed | BOOLEAN
notificationAtRetried | BOOLEAN
notifiedAtRetried | INTERGER (the Nth retry has been notified)

these columns are used to save job config & alerts state

  • add a container framework-status-notification-poller in alert-manager, which
    • watch DB framework table
    • send the alert when the config is enabled & the alert has not been sent
    • update framework table after successfully sending alerts to alert-manager

Part 3: alerts handling

  • src/alert-manager/deploy/alert-manager-configmap.yaml: add a new receiver and a new route
  • alert-handler: add an email template inform-user-job-status-change

Archive

Problems of watching k8s Framework object: not stable, may miss certain status change

Proposal 1

  • add a container framework-status-notification-poller in alert-manager, which
    • watch framework through k8s API
    • send the alert when a framework fails & this feature is enabled

Proposal 2

  • Job Exporter:

    • add a container, which monitor Framework status & export the following metric:
      • job_status(job_name="demo_job", username="demo_user",virtual_cluster="nni", status="running", pai_service_name="job-exporter", notification_status=["succeed", "failed"])
      • value: 0/1/2/3 (waiting/running/succeed/failed)
      • export value only at job status changes instead of exporting with a fixed frequency
  • Benefits: useful for averageWaitingTime, failingRate, & other statistics

  • Prometheus:

- alert: PAIJobFSucceed
  expr: max by (job_name) job_status{notification_status.includes("succeed")}[1m] == 2
  labels: 
    severity: warn
# - alert: PAIJobFailed
#   expr: changes(job_status{failureNotification="true"}[1m]) > 0 and job_status == 3
#   labels: 
#     severity: warn
@suiguoxin suiguoxin changed the title Inform the user when job fails Inform the user when jobs fail Mar 2, 2021
@suiguoxin suiguoxin self-assigned this Mar 11, 2021
@fanyangCS
Copy link
Contributor

fanyangCS commented Mar 11, 2021

the notification is also useful when the job succeeds. maybe the feature could be rephrased as: notifying the user when a job completes.

@fanyangCS
Copy link
Contributor

related to #2235 #3640

@suiguoxin suiguoxin changed the title Inform the user when jobs fail Inform the user when jobs status change Mar 11, 2021
@yiyione yiyione mentioned this issue Apr 26, 2021
16 tasks
@suiguoxin
Copy link
Member Author

suiguoxin commented May 12, 2021

Work Items

Part 1: Job / User configuration

Part 2: monitor & trigger corresponding alerts P0

Part 3: alerts handling P0 #5492

  • src/alert-manager/deploy/alert-manager-configmap.yaml: add a new receiver and a new route
  • alert-handler: add an email template job-status-change-alert

Doc

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants