Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster awareness and decommission docs #2438

Merged
merged 30 commits into from
Jan 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
ac6879c
Add cluster awareness and decommission docs
Naarcha-AWS Jan 19, 2023
f1a1656
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 19, 2023
2009556
Edit technical feedback
Naarcha-AWS Jan 19, 2023
9ab64cd
Add new cluster awareness examples
Naarcha-AWS Jan 20, 2023
470ea01
Add technical feedback
Naarcha-AWS Jan 23, 2023
4855b78
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
430c8f5
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
d17cf00
Add Caroline's feedback
Naarcha-AWS Jan 23, 2023
fcefe54
Add one more tweak
Naarcha-AWS Jan 23, 2023
6ae0497
Update _ml-commons-plugin/cluster-settings.md
Naarcha-AWS Jan 23, 2023
d225427
Update _ml-commons-plugin/cluster-settings.md
Naarcha-AWS Jan 23, 2023
ab9c864
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
78b17a3
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
dfbd089
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
8834026
Update _ml-commons-plugin/cluster-settings.md
Naarcha-AWS Jan 23, 2023
f1647f6
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
9c528ec
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
ddce7a0
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
164ab0c
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
d571148
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
f8f6a82
Update _ml-commons-plugin/cluster-settings.md
Naarcha-AWS Jan 23, 2023
69859c7
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
17d7c97
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
0ed2074
Update _api-reference/cluster-decommission.md
Naarcha-AWS Jan 23, 2023
88183c8
Update _api-reference/cluster-awareness.md
Naarcha-AWS Jan 23, 2023
306d242
Update _api-reference/cluster-decommission.md
Naarcha-AWS Jan 23, 2023
b8eb313
Add editoiral feedback
Naarcha-AWS Jan 23, 2023
be4b3c4
Fix typos
Naarcha-AWS Jan 23, 2023
94c4a10
Final editorial note
Naarcha-AWS Jan 23, 2023
7254b92
Fix merge conflicts
Naarcha-AWS Jan 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions _api-reference/cluster-awareness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
layout: default
title: Cluster routing and awareness
nav_order: 16
---

# Cluster routing and awareness

To control the distribution of search or HTTP traffic, you can use the weights per awareness attribute to control the distribution of search or HTTP traffic across zones. This is commonly used for zonal deployments, heterogeneous instances, and routing traffic away from zones during zonal failure.

## HTTP and path methods

```
PUT /_cluster/routing/awareness/<attribute>/weights
GET /_cluster/routing/awareness/<attribute>/weights?local
GET /_cluster/routing/awareness/<attribute>/weights
```

## Path parameters

Parameter | Type | Description
:--- | :--- | :---
attribute | String | The name of the awareness attribute, usually `zone`. The attribute name must match the values listed in the request body when assigning weights to zones.

## Request body parameters

Parameter | Type | Description
:--- | :--- | :---
weights | JSON object | Assigns weights to attributes within the request body of the PUT request. Weights can be set in any ratio, for example, 2:3:5. In a 2:3:5 ratio with 3 zones, for every 100 requests sent to the cluster, each zone would receive either 20, 30, or 50 search requests in a random order. When assigned a weight of `0`, the zone does not receive any search traffic.
_version | String | Implements optimistic concurrency control (OCC) through versioning. The parameter uses simple versioning, such as `1`, and increments upward based on each subsequent modification. This allows any servers from which a request originates to validate whether or not a zone has been modified.


In the following example request body, `zone_1` and `zone_2` receive 50 requests each, whereas `zone_3` is prevented from receiving requests:

```
{
"weights":
{
"zone_1": "5",
"zone_2": "5",
"zone_3": "0"
}
"_version" : 1
}
```

## Example: Weighted round robin search

The following example request creates a round robin shard allocation for search traffic by using an undefined ratio:

### Request

PUT /_cluster/routing/awareness/zone/weights
{
"weights":
{
"zone_1": "1",
"zone_2": "1",
"zone_3": "0"
}
"_version" : 1
}

### Response

```
{
"acknowledged": true
}
```


## Example: Getting weights for all zones

The following example request gets weights for all zones.

### Request

```
GET /_cluster/routing/awareness/zone/weights
```

### Response

OpenSearch responds with the weight of each zone:

```json
{
"weights":
{

"zone_1": "1.0",
"zone_2": "1.0",
"zone_3": "0.0"
},
"_version":1
}
```

## Example: Deleting weights

You can remove your weight ratio for each zone using the `DELETE` method.

### Request

```
DELETE /_cluster/routing/awareness/zone/weights
```

### Response

```json
{
"_version":1
}
```

## Next steps

- For more information about zone commissioning, see [Cluster decommission]({{site.url}}{{site.baseurl}}/api-reference/cluster-decommission/).
- For more information about allocation awareness, see [Cluster formation]({{site.url}}{{site.baseurl}}/opensearch/cluster/#advanced-step-6-configure-shard-allocation-awareness-or-forced-awareness).
80 changes: 80 additions & 0 deletions _api-reference/cluster-decommission.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
layout: default
title: Cluster decommission
nav_order: 20
---

# Cluster decommission

The cluster decommission operation adds support decommissioning based on awareness. It greatly benefits multi-zone deployments, where awareness attributes, such as `zones`, can aid in applying new upgrades to a cluster in a controlled fashion. This is especially useful during outages, in which case, you can decommission the unhealthy zone to prevent replication requests from stalling and prevent your request backlog from becoming too large.

For more information about allocation awareness, see [Shard allocation awareness]({{site.url}}{{site.baseurl}}//opensearch/cluster/#shard-allocation-awareness).


## HTTP and Path methods

```
PUT /_cluster/decommission/awareness/{awareness_attribute_name}/{awareness_attribute_value}
GET /_cluster/decommission/awareness/{awareness_attribute_name}/_status
DELETE /_cluster/decommission/awareness
```

## URL parameters

Parameter | Type | Description
:--- | :--- | :---
awareness_attribute_name | String | The name of awareness attribute, usually `zone`.
awareness_attribute_value | String | The value of the awareness attribute. For example, if you have shards allocated in two different zones, you can give each zone a value of `zone-a` or `zoneb`. The cluster decommission operation decommissions the zone listed in the method.


## Example: Decommissioning and recommissioning a zone

You can use the following example requests to decommission and recommission a zone:

### Request

The following example request decommissions `zone-a`:

```
PUT /_cluster/decommission/awareness/<zone>/<zone-a>
```

If you want to recommission a decommissioned zone, you can use the `DELETE` method:

```
DELETE /_cluster/decommission/awareness
```

### Response


```json
{
"acknowledged": true
}
```

## Example: Getting zone decommission status

The following example requests returns the decommission status of all zones.

### Request

```
GET /_cluster/decommission/awareness/zone/_status
```


### Response

```json
{
"zone-1": "INIT | DRAINING | IN_PROGRESS | SUCCESSFUL | FAILED"
}
```


## Next steps

- For more information about zone awareness and weight, see [Cluster awareness]({{site.url}}{{site.baseurl}}/api-reference/cluster-awareness/).
- For more information about allocation awareness, see [Cluster formation]({{site.url}}{{site.baseurl}}/opensearch/cluster/#advanced-step-6-configure-shard-allocation-awareness-or-forced-awareness).
3 changes: 2 additions & 1 deletion _api-reference/cluster-health.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Cluster health
nav_order: 16
nav_order: 17
---

# Cluster health
Expand Down Expand Up @@ -47,6 +47,7 @@ wait_for_events | Enum | Wait until all currently queued events with the given p
wait_for_no_relocating_shards | Boolean | Whether to wait until there are no relocating shards in the cluster. Default is false.
wait_for_no_initializing_shards | Boolean | Whether to wait until there are no initializing shards in the cluster. Default is false.
wait_for_status | Enum | Wait until the cluster health reaches the specified status or better. Supported values are `green`, `yellow`, and `red`.
weights | JSON object | Assigns weights to attributes within the request body of the PUT request. Weights can be set in any ration, for example, 2:3:5. In a 2:3:5 ratio with three zones, for every 100 requests sent to the cluster, each zone would receive either 20, 30, or 50 search requests in a random order. When assigned a weight of `0`, the zone does not receive any search traffic.

#### Sample request

Expand Down
2 changes: 1 addition & 1 deletion _api-reference/cluster-settings.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Cluster settings
nav_order: 17
nav_order: 18
---

# Cluster settings
Expand Down
2 changes: 1 addition & 1 deletion _api-reference/count.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Count
nav_order: 20
nav_order: 21
---

# Count
Expand Down
46 changes: 40 additions & 6 deletions _ml-commons-plugin/cluster-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ To enhance and customize your OpenSearch cluster for machine learning (ML), you

## Run tasks and models on ML nodes only

If `true`, ML Commons tasks and models run machine learning (ML) tasks on ML nodes only. If `false`, tasks and models run on ML nodes first. If no ML nodes exist, tasks and models run on data nodes. Don't set as `false` on a production cluster.
If `true`, ML Commons tasks and models run machine learning (ML) tasks on ML nodes only. If `false`, tasks and models run on ML nodes first. If no ML nodes exist, tasks and models run on data nodes. We recommend that you do not set this value to "false" on production clusters.

### Setting

Expand All @@ -27,7 +27,7 @@ plugins.ml_commons.only_run_on_ml_node: true

## Dispatch tasks to ML node

`round_robin` dispatches ML tasks to ML nodes using round robin routing. `least_load` gathers all ML nodes' runtime information, such as JVM heap memory usage and running tasks, then dispatches tasks to the ML node with the least load.
`round_robin` dispatches ML tasks to ML nodes using round robin routing. `least_load` gathers runtime information from all ML nodes, like JVM heap memory usage and running tasks, and then dispatches the tasks to the ML node with the lowest load.


### Setting
Expand All @@ -43,7 +43,9 @@ plugins.ml_commons.task_dispatch_policy: round_robin
- Value range: `round_robin` or `least_load`


## Set sync up job intervals
## Set sync job intervals

When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular job to sync newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.

When returning runtime information with the [profile API]({{site.url}}{{site.baseurl}}/ml-commons-plugin/api#profile), ML Commons will run a regular sync up job to sync up newly loaded or unloaded models on each node. When set to `0`, ML Commons immediately stops sync up jobs.

Expand All @@ -60,7 +62,7 @@ plugins.ml_commons.sync_up_job_interval_in_seconds: 10

## Predict monitoring requests

Controls how many predict requests are monitored on one node. If set to `0`, OpenSearch clears all monitoring predict requests in the node's cache, and does not monitor predict requests from that point forward.
Controls how many upload model tasks can run in parallel on one node. If set to `0`, you cannot upload models to any node.

### Setting

Expand Down Expand Up @@ -92,7 +94,7 @@ plugins.ml_commons.max_upload_model_tasks_per_node: 10

## Load model tasks per node

Controls how many load model tasks can run in parallel on one node. If set to `0`, you cannot load models to any node.
Controls how many load model tasks can run in parallel on one node. If set to 0, you cannot load models to any node.

### Setting

Expand All @@ -107,7 +109,7 @@ plugins.ml_commons.max_load_model_tasks_per_node: 10

## Add trusted URL

The default value allows uploading a model file from any `http`, `https`, `ftp`, or local file. You can change this value to restrict trusted model URL.
The default value allows you to upload a model file from any http/https/ftp/local file. You can change this value to restrict trusted model URLs.


### Setting
Expand All @@ -120,3 +122,35 @@ plugins.ml_commons.trusted_url_regex: ^(https?\|ftp\|file)://[-a-zA-Z0-9+&@#/%?=

- Default value: `^(https?\|ftp\|file)://[-a-zA-Z0-9+&@#/%?=~_\|!:,.;]*[-a-zA-Z0-9+&@#/%=~_\|]`
- Value range: Java regular expression (regex) string

## Assign task timeout

Assigns how long in seconds an ML task will live. After the timeout, the task will fail.

### Setting

```
plugins.ml_commons.ml_task_timeout_in_seconds: 600
```

### Values

- Default value: 600
- Value range: [1, 86400]

## Set native memory threshold

Sets a circuit breaker that checks all system memory usage before running an ML task. If the native memory exceeds the threshold, OpenSearch throws an exception and stops running any ML task.

Values are based on the percentage of memory available. When set to `0`, no ML tasks will run. When set to `100`, the circuit breaker closes and no threshold exists.

### Setting

```
plugins.ml_commons.native_memory_threshold: 90
```

### Values

- Default value: 90
- Value range: [0, 100]