Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Shallow Snapshots V2 - Scaling Snapshot Operations #15083

Closed
anshu1106 opened this issue Aug 2, 2024 · 2 comments
Closed

[RFC] Shallow Snapshots V2 - Scaling Snapshot Operations #15083

anshu1106 opened this issue Aug 2, 2024 · 2 comments
Labels
enhancement Enhancement or improvement to existing feature or request Roadmap:Cost/Performance/Scale Project-wide roadmap label Storage:Snapshots

Comments

@anshu1106
Copy link
Contributor

anshu1106 commented Aug 2, 2024

Objective

Shallow copy snapshots are currently taken for remote store enabled clusters, capturing references to data stored in a remote store rather than copying all the data. This approach makes the snapshot process faster and more efficient, as it doesn’t require transferring large amounts of data. In a shallow copy snapshot, a reference to the remote store metadata file is stored in the snapshot shard metadata.

This RFC covers changes to shallow snapshot flow with Timestamp Pinning. There is a separate RFC on Timestamp Pinning.

In this RFC we discuss the shortcomings of the current snapshot flow and propose mechanisms to make snapshot operations scale independently of the number of shards in the cluster.

Issues with the current snapshot flow

  1. Communication overhead

For create snapshot, there are two communication channels between the cluster manager and all other nodes

  • The cluster manager updates the ClusterState object by adding, removing, or altering the contents of its custom entry SnapshotsInProgress. All nodes consume the state of the SnapshotInProgress and start or terminate the relevant shard snapshot tasks accordingly.
  • Nodes executing shard snapshot tasks report either success or failure of their snapshot tasks by submitting an UpdateIndexShardSnapshotStatusRequest to the cluster manager node, which updates the snapshot’s entry in the ClusterState object accordingly. Corresponding to each shard status, a cluster state update task is created with priority set as normal.

The following image depicts the interaction between the cluster manager node, the data nodes, and the ClusterState object described above.
communication-channel (1)

This two-way communication overhead increases as the number of shards grows.

  1. Non-deterministic snapshot creation duration - The cluster state update tasks for shard status updates can be delayed by higher priority tasks, making snapshot creation duration unpredictable.

  2. Excessive locking overhead

    • A lock file is created per segment metadata file to prevent deletion of data referenced by shallow snapshots.
    • For a cluster with 100K shards, each snapshot creates 100K lock files, even if only 100 shards were actively receiving indexing throughput.
    • This increases remote store calls proportionally to the number of shards during snapshot creation and deletion.

Requirements

  • Make snapshot operations more deterministic.
  • Minimize the number of cluster state updates.
  • Scale snapshot operations independently of the number of shards.

Proposed Solution

Centralize Snapshot Creation with Timestamp pinning

Since the shallow snapshot keeps a reference to remote store segment metadata in snapshot metadata and doesn’t require data nodes to upload the segment files, we propose to move snapshot checkpointing to cluster manager. This eliminates all the communication overhead between data nodes and the cluster manager node. This can also remove the need for per shard snapshot status to be maintained in cluster state.

Timestamp Pinning - Implement Timestamp pinning to replace the explicit locking mechanism. With Timestamp Pinning, there is no need to create a lock file for each shard. Instead, timestamps are used to implicitly lock segment files referenced by the snapshot and prevent deletion during GC. This approach reduces the number of remote store calls for locks during snapshot creation and deletion.

High Level Design Details

Create Snapshot Operation

  1. Snapshot creation operation is executed entirely on the cluster manager node
  2. Perform all existing validations (snapshot name, UUID creation, repository checks, etc.).
  3. Evaluate the timestamp to be pinned.
  4. Prepare an in-memory list of all the indices to be snapshotted and the cluster metadata (if include_global_state is set to true in create snapshot request).
  5. Finalize the snapshot (can be parallelized)
    1. Count totals shards for snapshot info details (Shard failure always set to 0).
    2. Write global metadata, IndexMetadata and Snapshot info to the snapshot repo
    3. Update repository data with the new snapshot .
    4. Update snapshot repository generation.
  6. Write the pinned timestamp for implicit locking.
  7. If timestamp pinning fails after retries, asynchronously delete the files uploaded in the snapshot repo.
  8. Fail snapshot if the cluster manager switched since the start of the snapshot operation.
  9. Send success/failure in the response.
    create-snapshot-seq (1)

For all operations, create snapshot is considered successful only if a corresponding pinned timestamp is found for the snapshot. GET operations require a filter to check for the presence of pinned timestamp for shallow snapshots v2 to mark the snapshot status as SUCCESSFUL.

Benefits of the proposed approach

  • Snapshot creation time can be more deterministic and independent of number of shards
  • Automated snapshot creation can never be partial since we can account for last successful segment file for checkpointing.
  • Create snapshot can be successful even if few indices are red. It will be able to snapshot last successful state.
  • We don’t require a flush to be triggered per index before snapshotting. In order to ensure that data is not lost for indices with higher refresh interval, remote segment + remote translog provides complete state of successfully indexed data at any given time. Garbage collection of segments as well as translog will skip deletion of metadata files that match pinned timestamp.

Additional Considerations

Timestamp pinning to be done after uploading snapshot metadata files

If cluster manager switch happens after timestamp pinning but before uploading snapshot metadata files, we do need a mechanism to let new cluster manager know about the state. To avoid overhead of additional state management, we plan to go with updating pinned timestamp after snapshot finalization. In case pinned timestamp update fails, we can delete the metadata files uploaded in the snapshot repo and fail the snapshot.

We don’t need to write snapshot shard level metadata file

  • Deferring resolution of snapshot timestamp to segment and translog can help make create operation faster.
  • LIST operation for scanning through the segments and translogs for timestamp resolution is costly and can be deferred to restore.
  • Delete will become faster as it doesn’t require to delete per shard snapshot metadata file.
  • Details stored in per shard metadata file can be calculated on demand from segment metadata in remote store.

Handling Cluster Manager failure

The snapshot request is acknowledged only after completion and verification that no cluster manager switch occurred during the process. If the cluster manager term and version change during the snapshot operation, the request fails

@harishbhakuni
Copy link
Contributor

Thanks @anshu1106 for the RFC. I have few comments/suggestions:

  1. Why are we not using cluster state entry for this?

For all operations, create snapshot is considered successful only if a corresponding pinned timestamp is found for the snapshot. GET operations require a filter to check for the presence of pinned timestamp for shallow snapshots v2 to mark the snapshot status as SUCCESSFUL.

Some of the GET requests today directly reads the files stored in the repository. with this, we will be adding some overhead of checking the presence of pinned timestamp as well which would mean an additional get or list call on remote store repository.

In case pinned timestamp update fails, we can delete the metadata files uploaded in the snapshot repo and fail the snapshot.

this would mean, in case of failure we will have to delete number of indices + 2 md files and also update the repository data of the snapshot repository. if we do this asynchronously and repository data update fails and another snapshot gets triggered. repository data can end up having stale snapshot entries. How will we take care of removing those entries?

We can avoid all this by introducing a cluster state entry for the snapshot which would handle master swap issues and that way we can update the repository data and other md files at the end. as the operations are happening only in master node and at index level only, it should not be that costly.

  1. Details about other snapshot operations is not mentioned here. are we planning to add those separately in another doc? mainly curious how snapshot deletion will work and what all will it need to do.

@anshu1106
Copy link
Contributor Author

Thanks @anshu1106 for the RFC. I have few comments/suggestions:

  1. Why are we not using cluster state entry for this?

For all operations, create snapshot is considered successful only if a corresponding pinned timestamp is found for the snapshot. GET operations require a filter to check for the presence of pinned timestamp for shallow snapshots v2 to mark the snapshot status as SUCCESSFUL.

Some of the GET requests today directly reads the files stored in the repository. with this, we will be adding some overhead of checking the presence of pinned timestamp as well which would mean an additional get or list call on remote store repository.

Yeah thats correct. The rationale behind this is to make the create operation faster since it can be executed more frequently and can run in automated manner.

In case pinned timestamp update fails, we can delete the metadata files uploaded in the snapshot repo and fail the snapshot.

this would mean, in case of failure we will have to delete number of indices + 2 md files and also update the repository data of the snapshot repository. if we do this asynchronously and repository data update fails and another snapshot gets triggered. repository data can end up having stale snapshot entries. How will we take care of removing those entries?

We can avoid all this by introducing a cluster state entry for the snapshot which would handle master swap issues and that way we can update the repository data and other md files at the end. as the operations are happening only in master node and at index level only, it should not be that costly.

Correct, we did spend some time on this. It makes sense to go ahead without cluster state entry to reduce the response time of create snapshot operation since cluster state update may have to compete with other higher priority tasks. But looks like, without cluster state entry we may lose upon some out of the box concurrency checks that is currently supported by snapshots. It makes sense to have a cluster state entry to avoid conflicting concurrent operations and as you mentioned will be helpful during cluster manager switch scenarios.

  1. Details about other snapshot operations is not mentioned here. are we planning to add those separately in another doc? mainly curious how snapshot deletion will work and what all will it need to do.

Will add the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Roadmap:Cost/Performance/Scale Project-wide roadmap label Storage:Snapshots
Projects
Status: New
Status: ✅ Done
Development

No branches or pull requests

4 participants