Roadmap

OpenPAI Roadmap

We typically look out 6 to 12 months and establish topics we want to work on. As we go we learn and our assessment of some of the topics listed changes. Thus, we may add or drop topics as we go.

We describe some initiatives as "investigations" which means our goal in the next few months is to better understand the problem and potential solutions before scheduling actual feature work. Once an investigation is done, we will update our plan, either deferring the initiative or committing to it.

The up to date backlog is tracking in the Backlog issue. Iteration plans can be found --> Iteration Plans.

As always, we will listen to your feedback and adapt our plans if needed.

Themes

Our roadmap covers the following themes:

User themes

Easy to use: Simple and just enough UX for data scientists, researchers and students
Easy to debug: Easy to understand and use experience for debugging the training jobs

Admin and Ops themes

Easy to manage: Fluent installation, upgrade and maintenance experience for IT admin
Easy to operate: Easy to use metrics for Operations to understand the resources usage

Easy to use

Provide filters for jobs list page #302 @Gerhut

Refine Job detail page
- 1st round refine #2211 @sunqinzheng @qfyin
- Tensorboard direct link option in job page
Make left navigation tree items configurable for IT admins

Instead of setting the GPU/memory/core manually and separately, PAI expose simple SKUs that are available in this PAI instance for user to select #2062 @debuggy @qfyin

Improve user on board efficiency #2349:
- Provide better admin set up guidance, user usage guide in the UI/command.
- Provide better documentation and best practices for admin and users about how to use NFS as PAI's storage
- data/code/model: NFS and HDFS usage in the portal
- job submission: Clear and simple job workflow navigations
- experiment results: Simple and easy to find User Logs for experiments
- diagnostic: Clear and useful information for diagnostics (self-services)
Accurate Job Status, Failure Reason (Category), and useful Resolutions #2326
A user home page to provide a concentrated view for all 'my' own related work.
As a PAI user, I need to better understand how resources will/are/were allocated for my jobs, so that I know how to better compile my job scripts. Related Issues: #2062, #1943, #1989, #1777, #1819, #1904, #1995, #1968
Job list view shows GPU and Task counts
As an ops, I need to know the best practice of using PAI VC, and better manageability on VC. Related issues: #2073, #906
Provide better documentation, examples and best practices for checkpoint, so job could retry from the most recently progress
PAIShare and Marketplace scenarios

Easy to debug

Better Job Debugging Experience for End Users #2210 @ydye
- Job debugging reservation when job failed due to users' error. #2213 @ydye
- An option for user to decide whether to enable debugging reservation for the job or not. #2214 @ydye
- Approach to collect the container information which is reserved for job debugging. #2215
- Display the debugging reservation status for job in webportal #2216
- Approach to notify users when their jobs are in debugging reservation #2217
- Provide detail information when the job container exits. #2218

Easy to manage

Team wise storage management support #2204 @ydye @wangdian
Backward compatible upgrading #2212 @hao1939

Role based access control
- User account integration with AAD #1663
A complete story for storage supports
Persistent logs for jobs
GPU scheduling with priority
A complete story about VC bonus, (Per) job preemption choice. #2340
As a running service, we should not expose too much info to unknown users before login.
PAI everywhere
- Run PAI on an existing Kubernetes cluster
Support to allocate resources for VC by quantity instead of percentage
HA support for OpenPAI
Cluster/Machine auto maintainance

Easy to operate

Detect and alert for unhealthy GPU #2192 @mzmssg @xudifsd
Provide the ability to query all the jobs in a Node in PAI Web Portal #2128 @xudifsd
Display resource utility per vc/queue metrics in grafana #2208 @xudifsd

Ability to generate reports for cluster/vc/resources/users/jobs usage [#2127]
Aware and alert for low utilization jobs (https://github.com/Microsoft/pai/issues/2127)
GPU status summary
List all the GPUs' utilization for 1 machine
As an ops, I need the capability to batch create/update/delete user accounts and share with users through their email. Related Issues: #2078, #2085, #921

Better foundation

In addition to the above themes, there are fundamental architecture improvements need to be taken to support all the great features and experiences:

End to end job event tracking/logging support
Support Automatic Hyperparameter tuning or running Job as (NNI) Experiencment with sub jobs
Easy to customized Webportal

If there are any questions or concerns about this wiki, please open OpenPAI Issue directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly