Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Support aml #2615

Merged
merged 109 commits into from
Jul 1, 2020
Merged
Show file tree
Hide file tree
Changes from 85 commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
dcd2ffd
Merge pull request #251 from microsoft/master
SparkSnail May 29, 2020
a738331
init changes
squirrelsc Jun 2, 2020
3177aeb
Merge remote-tracking branch 'official/master' into 2391-reuse-job
squirrelsc Jun 2, 2020
2aafac1
refactors
squirrelsc Jun 3, 2020
0435b7f
refactoring
squirrelsc Jun 3, 2020
2e5ef51
minor fix, and take some review comments.
squirrelsc Jun 3, 2020
6d7bc62
move reuse to upper level
squirrelsc Jun 3, 2020
c67b162
support multi nodes
Jun 4, 2020
e13a620
fix eslint errors
Jun 4, 2020
59d4a71
support multi environments better
Jun 5, 2020
eae0540
Merge remote-tracking branch 'official/master' into 2391-reuse-job
Jun 5, 2020
81c49cf
code refactor
Jun 5, 2020
92cab3a
fix openpai yaml format
Jun 5, 2020
0674d88
fix k8s yaml schema
Jun 5, 2020
e5b9665
rename forward training service
Jun 5, 2020
67ef648
Merge remote-tracking branch 'official/master' into 2391-reuse-job
Jun 5, 2020
3b8b6fb
Merge pull request #252 from microsoft/master
SparkSnail Jun 7, 2020
1e626fd
add trialService
Jun 9, 2020
c6b6061
not send stop for single node
Jun 9, 2020
916e444
Merge pull request #253 from microsoft/master
SparkSnail Jun 15, 2020
b8e47be
rename environmentManager to trialDispatcher
Jun 16, 2020
0ee933a
support no central storage service
Jun 16, 2020
c7973be
init
SparkSnail Jun 16, 2020
c094057
improve delopment support
Jun 16, 2020
c2735d3
Merge remote-tracking branch 'official/master' into 2391-reuse-job
Jun 16, 2020
d0b2504
use latest storage component
Jun 16, 2020
c8d4696
add gpu info
Jun 16, 2020
1a9f19f
work version
SparkSnail Jun 16, 2020
3f4c177
separate channel and add gpu collector in runner
Jun 16, 2020
648e0bb
Merge branch '2391-reuse-job' of https://github.com/squirrelsc/nni in…
SparkSnail Jun 16, 2020
2fa4a77
init
SparkSnail Jun 16, 2020
caeffb8
Merge pull request #254 from microsoft/master
SparkSnail Jun 17, 2020
d0768b0
add more GPU information, and improve debugging.
squirrelsc Jun 17, 2020
8dff16f
fix GPU info collector
Jun 17, 2020
bea8ed6
Merge branch '2391-reuse-job' of https://github.com/squirrelsc/nni in…
SparkSnail Jun 18, 2020
e297aa5
update
SparkSnail Jun 18, 2020
500c1cb
channel support single file
Jun 18, 2020
d880512
refine code, and implement command channel
Jun 19, 2020
5c33d11
update
SparkSnail Jun 20, 2020
45424e8
support concurrent trials in runner.
Jun 22, 2020
9ca3444
implement web channel
Jun 24, 2020
5018039
fix eslint errors, and rename rest to web
Jun 24, 2020
283bceb
remove trial service, as it's replaced by channel.
Jun 24, 2020
0c67c5c
Merge remote-tracking branch 'official/master' into 2391-reuse-job
Jun 24, 2020
671f5d8
fix merged problem, and small refine for ut.
Jun 24, 2020
a65a810
fix pylint errors
Jun 24, 2020
6d36ae5
fix lint error
Jun 26, 2020
5e352f7
init
SparkSnail Jun 26, 2020
b9d1aa5
fix conflict
SparkSnail Jun 28, 2020
a3a91d8
format
SparkSnail Jun 28, 2020
57c300e
Merge pull request #255 from microsoft/master
SparkSnail Jun 28, 2020
69a5170
remove useless deferred.
Jun 29, 2020
edc4608
fix package
SparkSnail Jun 29, 2020
c1f0239
fix incorrect check logic
Jun 29, 2020
af97bb1
make license header consistent
Jun 29, 2020
10feb6a
Merge remote-tracking branch 'official/master' into 2391-reuse-job
Jun 29, 2020
c00cd31
add missed await.
Jun 29, 2020
78f1386
add doc and example
SparkSnail Jun 29, 2020
586d6ac
support log level in UT
Jun 29, 2020
2db8ff8
refine interface to support aml better.
Jun 29, 2020
f631e4c
fix runtime error on exit
Jun 29, 2020
5982fb3
Merge remote-tracking branch 'official/master' into 2391-reuse-job
Jun 29, 2020
f687a6e
fix eslint error
Jun 29, 2020
476ffec
send metric data from channel
Jun 29, 2020
0f2367c
support version check
Jun 29, 2020
9d7bd3c
fix pylint errors
Jun 29, 2020
130ed27
fix non-local failed ITs
Jun 29, 2020
ab86080
fix comments
SparkSnail Jun 30, 2020
4b11a53
fix conflict
SparkSnail Jun 30, 2020
15ee064
fix conflict
SparkSnail Jun 30, 2020
7c48610
format
SparkSnail Jun 30, 2020
c0c7d96
format code
SparkSnail Jun 30, 2020
93eefb2
format code
SparkSnail Jun 30, 2020
53cea0f
remove unused code
SparkSnail Jun 30, 2020
34d9351
format code
SparkSnail Jun 30, 2020
25a9dab
fix comments
SparkSnail Jun 30, 2020
cada76a
fix comments
SparkSnail Jun 30, 2020
de7dc7c
fix comments
SparkSnail Jun 30, 2020
428dc3d
add blank line
SparkSnail Jun 30, 2020
2e9c70e
fix comments
SparkSnail Jun 30, 2020
8cf8583
fix comments
SparkSnail Jun 30, 2020
fd5fd9e
fix build
SparkSnail Jun 30, 2020
54a22af
fix comments
SparkSnail Jun 30, 2020
525b961
fix channel async calls
Jun 30, 2020
8ec5e7d
fix comments
SparkSnail Jun 30, 2020
bdd3840
fix comments
SparkSnail Jun 30, 2020
b341dce
fix comments
SparkSnail Jun 30, 2020
ce81c51
Merge remote-tracking branch 'snail/dev-aml' into 2391-improve
Jun 30, 2020
e66dc23
fix comments
SparkSnail Jun 30, 2020
9cf6744
merge code logic
Jun 30, 2020
bd77f5c
Merge remote-tracking branch 'snail/dev-aml' into 2391-improve
Jun 30, 2020
ddfb0cc
Merge branch 'master' of https://github.com/microsoft/nni into dev-aml
SparkSnail Jun 30, 2020
65660e6
Merge pull request #257 from microsoft/master
SparkSnail Jun 30, 2020
fec8a67
Merge branch 'master' of https://github.com/SparkSnail/nni into dev-aml
SparkSnail Jun 30, 2020
5200a3a
Merge remote-tracking branch 'snail/dev-aml' into 2391-improve
Jun 30, 2020
51befa5
fix eslint errors
Jun 30, 2020
478629f
add run fo messages
Jun 30, 2020
c299ce1
Merge pull request #256 from squirrelsc/2391-improve
SparkSnail Jun 30, 2020
0517e13
fix comments
SparkSnail Jun 30, 2020
fc4b978
sort class
SparkSnail Jun 30, 2020
e527743
fix eslint
SparkSnail Jun 30, 2020
b047681
fix eslint
SparkSnail Jun 30, 2020
4acc7e8
fix annotation
SparkSnail Jun 30, 2020
ec1475a
fix import aml
SparkSnail Jun 30, 2020
8eaeebf
fix comments
SparkSnail Jun 30, 2020
56b6818
fix doc build
SparkSnail Jul 1, 2020
e09ff79
fix trial_runner import
SparkSnail Jul 1, 2020
ecf615d
fix doc
SparkSnail Jul 1, 2020
a7a3baf
fix pylint
SparkSnail Jul 1, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions docs/en_US/TrainingService/AMLMode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
**Run an Experiment on Azure Machine Learning**
===
NNI supports running an experiment on [AML](https://azure.microsoft.com/en-us/services/machine-learning/) , called aml mode.

## Setup environment
Step 1. Install NNI, follow the install guide [here](../Tutorial/QuickStart.md).

Step 2. Create AML account, follow the document [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace-cli).

Step 3. Get your account information.
![](../../img/aml_account.png)

Step4. Install AML package environment.
```
python3 -m pip install azureml --user
python3 -m pip install azureml-sdk --user
```

## Run an experiment
Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's content is like:

```yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
trainingServicePlatform: aml
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
computeTarget: ussc40rscl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace it with a placeholder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

nodeCount: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is docker image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, missed this variable in doc.

amlConfig:
subscriptionId: ${replace_to_your_subscriptionId}
resourceGroup: ${replace_to_your_resourceGroup}
workspaceName: ${replace_to_your_workspaceName}

```

Note: You should set `trainingServicePlatform: aml` in NNI config YAML file if you want to start experiment in aml mode.

Compared with [LocalMode](LocalMode.md) trial configuration in aml mode have these additional keys:
* computeTarget
* required key. The computer cluster name you want to use in your AML workspace.
* nodeCount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nodeCount can default to 1 because multi-machine runs are seldom used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, perhaps hide this variable is better, has removed.

* required key. The number of nodes to use for one run. [refer](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.runconfiguration?view=azure-ml-py#variables)

amlConfig:
* subscriptionId
* the subscriptionId of your account
* resourceGroup
* the resourceGroup of your account
* workspaceName
* the workspaceName of your account

Binary file added docs/img/aml_account.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions examples/trials/mnist-pytorch/config_aml.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
trainingServicePlatform: aml
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
computeTarget: ussc40rscl
nodeCount: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each trial will use one node, i.e., all 8 GPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

amlConfig:
subscriptionId: ${replace_to_your_subscriptionId}
resourceGroup: ${replace_to_your_resourceGroup}
workspaceName: ${replace_to_your_workspaceName}
25 changes: 25 additions & 0 deletions examples/trials/mnist-tfv1/config_aml.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
authorName: default
squirrelsc marked this conversation as resolved.
Show resolved Hide resolved
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
trainingServicePlatform: aml
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
computeTarget: ussc40rscl
nodeCount: 1
amlConfig:
subscriptionId: ${replace_to_your_subscriptionId}
resourceGroup: ${replace_to_your_resourceGroup}
workspaceName: ${replace_to_your_workspaceName}
57 changes: 57 additions & 0 deletions src/nni_manager/config/aml/amlUtil.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing license

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

import sys
import time
import json
from argparse import ArgumentParser
from azureml.core import Experiment, RunConfiguration, ScriptRunConfig
from azureml.core.compute import ComputeTarget
from azureml.core.run import RUNNING_STATES, RunStatus, Run
from azureml.core import Workspace
from azureml.core.conda_dependencies import CondaDependencies

if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument('--subscription_id', help='the subscription id of aml')
parser.add_argument('--resource_group', help='the resource group of aml')
parser.add_argument('--workspace_name', help='the workspace name of aml')
parser.add_argument('--compute_target', help='the compute cluster name of aml')
parser.add_argument('--docker_image', help='the docker image of job')
parser.add_argument('--experiment_name', help='the experiment name')
parser.add_argument('--script_dir', help='script directory')
parser.add_argument('--script_name', help='script name')
parser.add_argument('--node_count', help='node count of run')
args = parser.parse_args()

ws = Workspace(args.subscription_id, args.resource_group, args.workspace_name)
compute_target = ComputeTarget(workspace=ws, name=args.compute_target)
experiment = Experiment(ws, args.experiment_name)
run_config = RunConfiguration()
dependencies = CondaDependencies()
dependencies.add_pip_package("azureml-sdk")
dependencies.add_pip_package("azureml")
run_config.environment.python.conda_dependencies = dependencies
run_config.environment.docker.enabled = True
run_config.environment.docker.base_image = args.docker_image
run_config.target = compute_target
run_config.node_count = args.node_count
config = ScriptRunConfig(source_directory=args.script_dir, script=args.script_name, run_config=run_config)
run = experiment.submit(config)
print(run.get_details()["runId"])
while True:
line = sys.stdin.readline().rstrip()
if line == 'update_status':
print('status:' + run.get_status())
elif line == 'tracking_url':
print('tracking_url:' + run.get_portal_url())
elif line == 'stop':
run.cancel()
exit(0)
elif line == 'receive':
print('receive:' + json.dumps(run.get_metrics()))
elif line:
items = line.split(':')
if items[0] == 'command':
run.log('nni_manager', line[8:])
8 changes: 6 additions & 2 deletions src/nni_manager/main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@ async function initContainer(foreground: boolean, platformMode: string, logFileN
Container.bind(TrainingService)
.to(DLTSTrainingService)
.scope(Scope.Singleton);
} else if (platformMode === 'aml') {
Container.bind(TrainingService)
.to(RouterTrainingService)
.scope(Scope.Singleton);
} else {
throw new Error(`Error: unsupported mode: ${platformMode}`);
}
Expand Down Expand Up @@ -93,7 +97,7 @@ async function initContainer(foreground: boolean, platformMode: string, logFileN

function usage(): void {
console.info('usage: node main.js --port <port> --mode \
<local/remote/pai/kubeflow/frameworkcontroller/paiYarn> --start_mode <new/resume> --experiment_id <id> --foreground <true/false>');
<local/remote/pai/kubeflow/frameworkcontroller/paiYarn/aml> --start_mode <new/resume> --experiment_id <id> --foreground <true/false>');
}

const strPort: string = parseArg(['--port', '-p']);
Expand All @@ -113,7 +117,7 @@ const foreground: boolean = foregroundArg.toLowerCase() === 'true' ? true : fals
const port: number = parseInt(strPort, 10);

const mode: string = parseArg(['--mode', '-m']);
if (!['local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts'].includes(mode)) {
if (!['local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'].includes(mode)) {
console.log(`FATAL: unknown mode: ${mode}`);
usage();
process.exit(1);
Expand Down
1 change: 1 addition & 0 deletions src/nni_manager/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
"ignore": "^5.1.4",
"js-base64": "^2.4.9",
"kubernetes-client": "^6.5.0",
"python-shell": "^2.0.1",
"rx": "^4.1.0",
"sqlite3": "^4.0.2",
"ssh2": "^0.6.1",
Expand Down
7 changes: 7 additions & 0 deletions src/nni_manager/rest_server/restValidationSchemas.ts
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ export namespace ValidationSchemas {
nniManagerNFSMountPath: joi.string().min(1),
containerNFSMountPath: joi.string().min(1),
paiConfigPath: joi.string(),
computeTarget: joi.string(),
nodeCount: joi.number(),
paiStorageConfigName: joi.string().min(1),
nasMode: joi.string().valid('classic_mode', 'enas_mode', 'oneshot_mode', 'darts_mode'),
portList: joi.array().items(joi.object({
Expand Down Expand Up @@ -150,6 +152,11 @@ export namespace ValidationSchemas {
email: joi.string().min(1),
password: joi.string().min(1)
}),
aml_config: joi.object({
subscriptionId: joi.string().min(1),
resourceGroup: joi.string().min(1),
workspaceName: joi.string().min(1)
}),
nni_manager_ip: joi.object({ // eslint-disable-line @typescript-eslint/camelcase
nniManagerIp: joi.string().min(1)
})
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ export enum TrialConfigMetadataKey {
NNI_MANAGER_IP = 'nni_manager_ip',
FRAMEWORKCONTROLLER_CLUSTER_CONFIG = 'frameworkcontroller_config',
DLTS_CLUSTER_CONFIG = 'dlts_config',
AML_CLUSTER_CONFIG = 'aml_config',
VERSION_CHECK = 'version_check',
LOG_COLLECTION = 'log_collection'
}
132 changes: 132 additions & 0 deletions src/nni_manager/training_service/reusable/aml/amlClient.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.

'use strict';

import * as fs from 'fs';
import * as request from 'request';
import * as path from 'path';
import { Deferred } from 'ts-deferred';
import { PythonShell } from 'python-shell';

export class AMLClient {
public subscriptionId: string;
public resourceGroup: string;
public workspaceName: string;
public experimentId: string;
public image: string;
public scriptName: string;
public pythonShellClient: undefined | PythonShell;
public codeDir: string;
public nodeCount: number;
public computeTarget: string;

constructor(
subscriptionId: string,
resourceGroup: string,
workspaceName: string,
experimentId: string,
computeTarget: string,
nodeCount: number,
image: string,
scriptName: string,
codeDir: string,
) {
this.subscriptionId = subscriptionId;
this.resourceGroup = resourceGroup;
this.workspaceName = workspaceName;
this.experimentId = experimentId;
this.image = image;
this.nodeCount = nodeCount;
this.scriptName = scriptName;
this.codeDir = codeDir;
this.computeTarget = computeTarget;
}

public submit(): Promise<string> {
const deferred: Deferred<string> = new Deferred<string>();
this.pythonShellClient = new PythonShell('amlUtil.py', {
scriptPath: './config/aml',
pythonOptions: ['-u'], // get print results in real-time
args: [
'--subscription_id', this.subscriptionId,
'--resource_group', this.resourceGroup,
'--workspace_name', this.workspaceName,
'--compute_target', this.computeTarget,
'--docker_image', this.image,
'--experiment_name', `nni_exp_${this.experimentId}`,
'--script_dir', this.codeDir,
'--script_name', this.scriptName,
'--node_count', this.nodeCount.toString()
]
});
this.pythonShellClient.on('message', function (envId: any) {
// received a message sent from the Python script (a simple "print" statement)
deferred.resolve(envId);
});
return deferred.promise;
}

public stop(): void {
if (this.pythonShellClient === undefined) {
throw Error('python shell client not initialized!');
}
this.pythonShellClient.send('stop');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this.pythonShellClient.send a synchronous method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

}

public getTrackingUrl(): Promise<string> {
const deferred: Deferred<string> = new Deferred<string>();
if (this.pythonShellClient === undefined) {
throw Error('python shell client not initialized!');
}
this.pythonShellClient.send('tracking_url');
let trackingUrl = '';
this.pythonShellClient.on('message', function (status: any) {
let items = status.split(':');
if (items[0] === 'tracking_url') {
trackingUrl = items.splice(1, items.length).join('')
}
deferred.resolve(trackingUrl);
});
return deferred.promise;
}

public updateStatus(oldStatus: string): Promise<string> {
const deferred: Deferred<string> = new Deferred<string>();
if (this.pythonShellClient === undefined) {
throw Error('python shell client not initialized!');
}
let newStatus = oldStatus;
this.pythonShellClient.send('update_status');
this.pythonShellClient.on('message', function (status: any) {
let items = status.split(':');
if (items[0] === 'status') {
newStatus = items.splice(1, items.length).join('')
}
deferred.resolve(newStatus);
});
return deferred.promise;
}

public sendCommand(message: string): void {
if (this.pythonShellClient === undefined) {
throw Error('python shell client not initialized!');
}
this.pythonShellClient.send(`command:${message}`);
}

public receiveCommand(): Promise<any> {
const deferred: Deferred<any> = new Deferred<any>();
if (this.pythonShellClient === undefined) {
throw Error('python shell client not initialized!');
}
this.pythonShellClient.send('receive');
this.pythonShellClient.on('message', function (command: any) {
let items = command.split(':')
if (items[0] === 'receive') {
deferred.resolve(JSON.parse(command.slice(8)))
}
});
return deferred.promise;
}
}
Loading