Skip to content

Commit

Permalink
Add cluster awareness and decommission docs (#2438)
Browse files Browse the repository at this point in the history
* Add cluster awareness and decommission docs

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Bukhtawar Khan <bukhtawar7152@gmail.com>

* Edit technical feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add new cluster awareness examples

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add technical feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Alice Williams <88908598+alicejw-aws@users.noreply.github.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Alice Williams <88908598+alicejw-aws@users.noreply.github.com>

* Add Caroline's feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Add one more tweak

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Update _ml-commons-plugin/cluster-settings.md

Co-authored-by: Heather Halter <HDHALTER@AMAZON.COM>

* Update _ml-commons-plugin/cluster-settings.md

Co-authored-by: Heather Halter <HDHALTER@AMAZON.COM>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _ml-commons-plugin/cluster-settings.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _ml-commons-plugin/cluster-settings.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-decommission.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-awareness.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Update _api-reference/cluster-decommission.md

Co-authored-by: Nate Bower <nbower@amazon.com>

* Add editoiral feedback

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Fix typos

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Final editorial note

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Co-authored-by: Bukhtawar Khan <bukhtawar7152@gmail.com>
Co-authored-by: Alice Williams <88908598+alicejw-aws@users.noreply.github.com>
Co-authored-by: Heather Halter <HDHALTER@AMAZON.COM>
Co-authored-by: Nate Bower <nbower@amazon.com>
  • Loading branch information
5 people authored and vagimeli committed Jan 25, 2023
1 parent c5cf556 commit 891be1f
Show file tree
Hide file tree
Showing 6 changed files with 245 additions and 9 deletions.
121 changes: 121 additions & 0 deletions _api-reference/cluster-awareness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
layout: default
title: Cluster routing and awareness
nav_order: 16
---

# Cluster routing and awareness

To control the distribution of search or HTTP traffic, you can use the weights per awareness attribute to control the distribution of search or HTTP traffic across zones. This is commonly used for zonal deployments, heterogeneous instances, and routing traffic away from zones during zonal failure.

## HTTP and path methods

```
PUT /_cluster/routing/awareness/<attribute>/weights
GET /_cluster/routing/awareness/<attribute>/weights?local
GET /_cluster/routing/awareness/<attribute>/weights
```

## Path parameters

Parameter | Type | Description
:--- | :--- | :---
attribute | String | The name of the awareness attribute, usually `zone`. The attribute name must match the values listed in the request body when assigning weights to zones.

## Request body parameters

Parameter | Type | Description
:--- | :--- | :---
weights | JSON object | Assigns weights to attributes within the request body of the PUT request. Weights can be set in any ratio, for example, 2:3:5. In a 2:3:5 ratio with 3 zones, for every 100 requests sent to the cluster, each zone would receive either 20, 30, or 50 search requests in a random order. When assigned a weight of `0`, the zone does not receive any search traffic.
_version | String | Implements optimistic concurrency control (OCC) through versioning. The parameter uses simple versioning, such as `1`, and increments upward based on each subsequent modification. This allows any servers from which a request originates to validate whether or not a zone has been modified.


In the following example request body, `zone_1` and `zone_2` receive 50 requests each, whereas `zone_3` is prevented from receiving requests:

```
{
"weights":
{
"zone_1": "5",
"zone_2": "5",
"zone_3": "0"
}
"_version" : 1
}
```

## Example: Weighted round robin search

The following example request creates a round robin shard allocation for search traffic by using an undefined ratio:

### Request

PUT /_cluster/routing/awareness/zone/weights
{
"weights":
{
"zone_1": "1",
"zone_2": "1",
"zone_3": "0"
}
"_version" : 1
}

### Response

```
{
"acknowledged": true
}
```


## Example: Getting weights for all zones

The following example request gets weights for all zones.

### Request

```
GET /_cluster/routing/awareness/zone/weights
```

### Response

OpenSearch responds with the weight of each zone:

```json
{
"weights":
{

"zone_1": "1.0",
"zone_2": "1.0",
"zone_3": "0.0"
},
"_version":1
}
```

## Example: Deleting weights

You can remove your weight ratio for each zone using the `DELETE` method.

### Request

```
DELETE /_cluster/routing/awareness/zone/weights
```

### Response

```json
{
"_version":1
}
```

## Next steps

- For more information about zone commissioning, see [Cluster decommission]({{site.url}}{{site.baseurl}}/api-reference/cluster-decommission/).
- For more information about allocation awareness, see [Cluster formation]({{site.url}}{{site.baseurl}}/opensearch/cluster/#advanced-step-6-configure-shard-allocation-awareness-or-forced-awareness).
80 changes: 80 additions & 0 deletions _api-reference/cluster-decommission.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
layout: default
title: Cluster decommission
nav_order: 20
---

# Cluster decommission

The cluster decommission operation adds support decommissioning based on awareness. It greatly benefits multi-zone deployments, where awareness attributes, such as `zones`, can aid in applying new upgrades to a cluster in a controlled fashion. This is especially useful during outages, in which case, you can decommission the unhealthy zone to prevent replication requests from stalling and prevent your request backlog from becoming too large.

For more information about allocation awareness, see [Shard allocation awareness]({{site.url}}{{site.baseurl}}//opensearch/cluster/#shard-allocation-awareness).


## HTTP and Path methods

```
PUT /_cluster/decommission/awareness/{awareness_attribute_name}/{awareness_attribute_value}
GET /_cluster/decommission/awareness/{awareness_attribute_name}/_status
DELETE /_cluster/decommission/awareness
```

## URL parameters

Parameter | Type | Description
:--- | :--- | :---
awareness_attribute_name | String | The name of awareness attribute, usually `zone`.
awareness_attribute_value | String | The value of the awareness attribute. For example, if you have shards allocated in two different zones, you can give each zone a value of `zone-a` or `zoneb`. The cluster decommission operation decommissions the zone listed in the method.


## Example: Decommissioning and recommissioning a zone

You can use the following example requests to decommission and recommission a zone:

### Request

The following example request decommissions `zone-a`:

```
PUT /_cluster/decommission/awareness/<zone>/<zone-a>
```

If you want to recommission a decommissioned zone, you can use the `DELETE` method:

```
DELETE /_cluster/decommission/awareness
```

### Response


```json
{
"acknowledged": true
}
```

## Example: Getting zone decommission status

The following example requests returns the decommission status of all zones.

### Request

```
GET /_cluster/decommission/awareness/zone/_status
```


### Response

```json
{
"zone-1": "INIT | DRAINING | IN_PROGRESS | SUCCESSFUL | FAILED"
}
```


## Next steps

- For more information about zone awareness and weight, see [Cluster awareness]({{site.url}}{{site.baseurl}}/api-reference/cluster-awareness/).
- For more information about allocation awareness, see [Cluster formation]({{site.url}}{{site.baseurl}}/opensearch/cluster/#advanced-step-6-configure-shard-allocation-awareness-or-forced-awareness).
3 changes: 2 additions & 1 deletion _api-reference/cluster-health.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Cluster health
nav_order: 16
nav_order: 17
---

# Cluster health
Expand Down Expand Up @@ -47,6 +47,7 @@ wait_for_events | Enum | Wait until all currently queued events with the given p
wait_for_no_relocating_shards | Boolean | Whether to wait until there are no relocating shards in the cluster. Default is false.
wait_for_no_initializing_shards | Boolean | Whether to wait until there are no initializing shards in the cluster. Default is false.
wait_for_status | Enum | Wait until the cluster health reaches the specified status or better. Supported values are `green`, `yellow`, and `red`.
weights | JSON object | Assigns weights to attributes within the request body of the PUT request. Weights can be set in any ration, for example, 2:3:5. In a 2:3:5 ratio with three zones, for every 100 requests sent to the cluster, each zone would receive either 20, 30, or 50 search requests in a random order. When assigned a weight of `0`, the zone does not receive any search traffic.

#### Sample request

Expand Down
2 changes: 1 addition & 1 deletion _api-reference/cluster-settings.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Cluster settings
nav_order: 17
nav_order: 18
---

# Cluster settings
Expand Down
2 changes: 1 addition & 1 deletion _api-reference/count.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Count
nav_order: 20
nav_order: 21
---

# Count
Expand Down
46 changes: 40 additions & 6 deletions _ml-commons-plugin/cluster-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ To enhance and customize your OpenSearch cluster for machine learning (ML), you

## Run tasks and models on ML nodes only

If `true`, ML Commons tasks and models run machine learning (ML) tasks on ML nodes only. If `false`, tasks and models run on ML nodes first. If no ML nodes exist, tasks and models run on data nodes. Don't set as `false` on a production cluster.
If `true`, ML Commons tasks and models run machine learning (ML) tasks on ML nodes only. If `false`, tasks and models run on ML nodes first. If no ML nodes exist, tasks and models run on data nodes. We recommend that you do not set this value to "false" on production clusters.

### Setting

Expand All @@ -27,7 +27,7 @@ plugins.ml_commons.only_run_on_ml_node: true

## Dispatch tasks to ML node

`round_robin` dispatches ML tasks to ML nodes using round robin routing. `least_load` gathers all ML nodes' runtime information, such as JVM heap memory usage and running tasks, then dispatches tasks to the ML node with the least load.
`round_robin` dispatches ML tasks to ML nodes using round robin routing. `least_load` gathers runtime information from all ML nodes, like JVM heap memory usage and running tasks, and then dispatches the tasks to the ML node with the lowest load.


### Setting
Expand All @@ -43,7 +43,9 @@ plugins.ml_commons.task_dispatch_policy: round_robin
- Value range: `round_robin` or `least_load`


## Set sync up job intervals
## Set sync job intervals

When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular job to sync newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.

When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular sync up job to sync up newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.

Expand All @@ -60,7 +62,7 @@ plugins.ml_commons.sync_up_job_interval_in_seconds: 10

## Predict monitoring requests

Controls how many predict requests are monitored on one node. If set to `0`, OpenSearch clears all monitoring predict requests in the node's cache, and does not monitor predict requests from that point forward.
Controls how many upload model tasks can run in parallel on one node. If set to `0`, you cannot upload models to any node.

### Setting

Expand Down Expand Up @@ -92,7 +94,7 @@ plugins.ml_commons.max_upload_model_tasks_per_node: 10

## Load model tasks per node

Controls how many load model tasks can run in parallel on one node. If set to `0`, you cannot load models to any node.
Controls how many load model tasks can run in parallel on one node. If set to 0, you cannot load models to any node.

### Setting

Expand All @@ -107,7 +109,7 @@ plugins.ml_commons.max_load_model_tasks_per_node: 10

## Add trusted URL

The default value allows uploading a model file from any `http`, `https`, `ftp`, or local file. You can change this value to restrict trusted model URL.
The default value allows you to upload a model file from any http/https/ftp/local file. You can change this value to restrict trusted model URLs.


### Setting
Expand All @@ -120,3 +122,35 @@ plugins.ml_commons.trusted_url_regex: ^(https?\|ftp\|file)://[-a-zA-Z0-9+&@#/%?=

- Default value: `^(https?\|ftp\|file)://[-a-zA-Z0-9+&@#/%?=~_\|!:,.;]*[-a-zA-Z0-9+&@#/%=~_\|]`
- Value range: Java regular expression (regex) string

## Assign task timeout

Assigns how long in seconds an ML task will live. After the timeout, the task will fail.

### Setting

```
plugins.ml_commons.ml_task_timeout_in_seconds: 600
```

### Values

- Default value: 600
- Value range: [1, 86400]

## Set native memory threshold

Sets a circuit breaker that checks all system memory usage before running an ML task. If the native memory exceeds the threshold, OpenSearch throws an exception and stops running any ML task.

Values are based on the percentage of memory available. When set to `0`, no ML tasks will run. When set to `100`, the circuit breaker closes and no threshold exists.

### Setting

```
plugins.ml_commons.native_memory_threshold: 90
```

### Values

- Default value: 90
- Value range: [0, 100]

0 comments on commit 891be1f

Please sign in to comment.