Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Searchable Snapshot] Propose API #2922

Closed
Tracked by #2919
andrross opened this issue Apr 15, 2022 · 9 comments
Closed
Tracked by #2919

[Searchable Snapshot] Propose API #2922

andrross opened this issue Apr 15, 2022 · 9 comments
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@andrross
Copy link
Member

andrross commented Apr 15, 2022

Cluster Configuration

A "remote searcher" is a concept that should have broad applicability beyond searching remote snapshots, as it defines a node capable of searching (and caching) indexes hosted in a remote store, potentially in any number of formats beyond a snapshot. This proposes to create a new node role to define such a capability:

Node Configuration

A new node role will be introduced: remote_searcher. This role indicates that the node is capable of hosting "remote" shards where the data is authoritatively stored in a remote store. The index data will not be permanently stored on the local instance storage. The local disk can be used as a cache. The cache size can be configured with the following setting in opensearch.yml:

node.remote_searcher.cache.size: 100GB

Index creation API

Several options exist for the creating a remote index. This section details some of the options.

Proposal 1: Extend Snapshot Restore

A new parameter will be introduced in the snapshot restore API: storage_type.

Setting Description
storage_type Must be one of local or remote_snapshot. local is the default if not specified, and indicates that all snapshot metadata and index data will be downloaded to local instance storage. remote_snapshot indicates that snapshot metadata will be downloaded to the cluster but the remote repository will remain the authoritative store of the index data. Data will be downloaded and cached as necessary to service queries. At least one node in the cluster must be configured for the remote_searcher role in order to restore a snapshot of type remote_snapshot.

For example:

POST _snapshot/my-repository/2/_restore
{
  "indices": "my-index*",

  "storage_type": "remote_snapshot",  <-- NEW PARAMETER

  "ignore_unavailable": true,
  "include_global_state": false,
  "include_aliases": false,
  "partial": false,
  "rename_pattern": "opensearch-dashboards(.+)",
  "rename_replacement": "restored-opensearch-dashboards$1",
  "index_settings": {
    "index.blocks.read_only": false
  },
  "ignore_index_settings": [
    "index.refresh_interval"
  ]
}

Proposal 2: Extend Create Index

New settings will be introduced to the create index API under the index.remote namespace.

Setting Description
index.remote Defaults to false, indicating the index will exclusively use local instance storage. If true then the index.remote.datastore property must be specified.
index.remote.datastore.type Type of remote store. Only snapshot is supported initially
index.remote.datastore.snapshot.id Identifier of the snapshot to use as the source for this remote index.
PUT /my-index-1

{
  "settings": {
    "index.remote": true,
    "index.remote.datastore.type": "snapshot"
    "index.remote.datastore.snapshot.id": "my-snapshot-repo/1"
  }
}
@reta
Copy link
Collaborator

reta commented May 11, 2022

@andrross thinking about APIs, would it make sense to consider a new attach / detach concept for the index? I believe one of the primarily use cases will be not the creation of the index knowing ahead of time it will be remote but shelving the existing indices (fe by data age or any other compliance requirements) into the remote ones.

The explicit attach API might be helpful when it the index in question is needed to be fully available for some prolonged period of time vs querying it in ad-hoc fashion (that would be solved by cache).

Curious to hear what do you think.

@andrross
Copy link
Member Author

Listing out scenarios, with some rough API sketches:

1. Index stored only on instance storage

This is the status quo. Included here for completeness, but "false" is the implicit default.

PUT /my-index-1
{
  "settings": {
    "index.remote": false
  }
}

2. Index stored on instance storage and is backed up to remote

This is the work in flight. All index data is still completely stored on instance storage and queries are handled the same as they are today.

PUT /my-index-1
{
  "settings": {
    "index.remote": true,
    "index.remote.datastore.type": "remote_backup"
    "index.remote.datastore.repository": "my-snapshot-repo"
  }
}

3. Index stored in remote storage, but is queryable and writeable

This is an evolution of the case above, where the index data is stored in the remote store and queried from remote store, which local cache for performance benefits. This scenario would require remote-searcher nodes to exist in the cluster.

PUT /my-index-1
{
  "settings": {
    "index.remote": true,
    "index.remote.datastore.type": "remote"
    "index.remote.datastore.repository": "my-snapshot-repo"
  }
}

4. Index stored in a snapshot

In this case the data is stored in a snapshot as they exist today. The index is not writeable. This is really meant as an intermediate goal as we ultimately build towards implementing scenario number 3.

PUT /my-index-1
{
  "settings": {
    "index.remote": true,
    "index.remote.datastore.type": "snapshot"
    "index.remote.datastore.snapshot.id": "my-snapshot-repo/1"
  }
}

In the fullness of time it should be possible to migrate indexes between scenarios 1, 2, and 3 as efficiently as possible in order to solve the "shelving existing indices" problem you describe. Scenario 4 may go away in the long term if the current snapshotting mechanism is ultimately replaced these new remote capabilities, but is being included as an incremental feature that can add value along the way. @reta What do you think? Did you have something specific in mind with attach/detach APIs?

@reta
Copy link
Collaborator

reta commented May 11, 2022

Thanks @andrross, I think my question could be rephrased like that: is it worth to have an explicit API (which I called attach / detach for the absence of better name for now) vs manipulating over index settings. Beside just clean semantics, explicit APIs allow to introduce dedicate security checks fe.

@dblock
Copy link
Member

dblock commented May 11, 2022

Do we think we'd want to permission differently attaching a remote index? Feels like we are manipulating index properties (size, location, type, id), no?

If we go with attach/detach, that should work for all indexes.

@andrross
Copy link
Member Author

Feels like we are manipulating index properties

Yeah, that's what led me to propose it this way. Also, in any case we'd want the ability to get the current properties of an index, so index settings made sense for that. If we had imperative-style APIs (attach/detach) we'd also need APIs to describe the current attached or detached state, so it's not obvious to me that these properties are different from regular index settings.

@anasalkouz
Copy link
Member

I think what customer need is to create index backed by existing snapshot, which they will use the Create Index API?
but why we need to extend Snapshot Restore API? what the use-case for this?

@andrross
Copy link
Member Author

I think what customer need is to create index backed by existing snapshot, which they will use the Create Index API? but why we need to extend Snapshot Restore API? what the use-case for this?

Snapshot restore is essentially an API for creating an index. Extending snapshot restore to create a searchable snapshot-type index makes a lot of sense, as a lot of the options in the API are still relevant (such as the renaming options, restoring partial snapshots, etc). Really the unresolved question here is whether there is a single API that can work with regular snapshots as well as the more forward looking option to create a remote index backed by the new remote store-type indexes that are being introduced.

@dblock
Copy link
Member

dblock commented Jun 28, 2022

@andrross what is your strong opinion about what to go with?

@andrross andrross changed the title [Searchable Remote Index] Propose API [Searchable Snapshot] Propose API Jun 29, 2022
@andrross
Copy link
Member Author

andrross commented Aug 5, 2022

For the searchable snapshot phase of the storage roadmap, I think adding an option to the existing snapshot restore API is the right way to go for the reasons listed above. We're going to start with that approach for the initial development, but we can always change it based on feedback.

There are still many open questions about the API and user experience for later phases in the storage roadmap (#3739) but I intend to continue that discussion in separate issues. For the purposes of the searchable snapshot API though, I'm going to go ahead and close this issue with the intent to get an experimental version behind a feature flag that implements "proposal 1".

@andrross andrross closed this as completed Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
Status: Done
Development

No branches or pull requests

5 participants