Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Proposal : Pluggable Translog #1319

Open
Bukhtawar opened this issue Sep 30, 2021 · 11 comments
Open

Feature Proposal : Pluggable Translog #1319

Bukhtawar opened this issue Sep 30, 2021 · 11 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep RFC Issues requesting major changes v3.0.0 Issues and PRs related to version 3.0.0

Comments

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Sep 30, 2021

Overview

The purpose of this document is to propose various mechanisms of achieving durability of uncommitted indexing operations and outlines the advantages of making the implementation pluggable. This is not a complete design document, it is a proposal to seek community feedback on.

Motivation

  1. Provide flexible user controlled durability options
  2. Support for highly durable translogs

Background

What is the role of a translog?
All operations that affect an OpenSearch index (i.e. adding, updating, or removing documents) are added to an in-memory buffer, which is periodically flushed to create segments on disk on a Lucene commit. These operations also need to be durably written to a write-ahead transaction log (aka translog) before they are acknowledged back to the client since Lucene commits are too expensive to be performed per request. In the event of a process crash, recent operations that have been acknowledged but not yet included in the last Lucene commit are recovered from the translog.

The following are the various mechanisms through which the engine interacts with the translog—

  1. Ingestion Interactions
    For efficiency, the translog operations are first buffered to an in-memory queue so that not all the indexing threads have to perform the costly fsync operation ** to disk. One of the indexing threads acquire a lock to dequeue and subsequently fsync, the buffered translog operations to disk. Once the fsync operation is complete, for operation consistency, OpenSearch also writes a checkpoint to the disk with the last sequence number that is written to the translog, which helps prevent translog corruption. Once all translog operations for a given request have been durably persisted to disk, the OpenSearch sends an acknowledgment back to the client

    When the translog grows beyond a configurable size, in order to prevent recoveries from taking too long, the OpenSearch engine triggers a flush in the background. OpenSearch flush performs a Lucene commit, moving segments to disk also rolling over and starting a new translog file.

Untitled (2) (1)

  1. Shard Recovery
    In the event of a crash, existing segment files are first recovered from the local disk. Translog files are opened and recent operations that have not been committed to Lucene are recovered by replaying all operations from the translog beyond the sequence number from the last Lucene commit checkpoint. A flush is triggered creating Lucene segments on disk and purging all unreferenced translog files.
    Recovery mechanisms

    1. Local recovery/ Primary failover : In case of recovery from an existing store(has in-sync translog and segments locally on disk) all uncommitted operations are replayed from local translog on disk.
    2. Peer recovery : Translog doesn’t play a role in peer recovery instead peer shard copy(primary-primary or primary-replica) is orchestrated via shard retention leases by replaying indexing operations that are indexed into Lucene and deleted operations from the preserved soft-deletes on the peer primary.
  2. Cross cluster replication
    Cross Cluster Replication in OpenSearch uses a pull-based replication model to pull translog operations into the follower index. The translog in the leader cluster is pulled incrementally from the current sequence number of the follower shard upto the global checkpoint of the shard being replicated from at the leader, once the bootstrap from committed segments are complete.

Downsides of current translog implementation

  1. Less flexible : The engine is tightly coupled with local disk based translogs and forces users to heavily rely on them for durability
  2. Lesser durability options: The current option supported for translog persistence is local disk which makes full-recovery impossible even with faster snapshots in the event of losing a single node in a non-replicated setup.
  3. Slow Ingestion : The present primary-backup replication model requires translogs to be synchronously written to all shard copies in the replication group before acknowledging a write request. There is a possibility that one degraded node can stall ingestion. With remote store and segment based replication model, we do not need to replicate operations to all copies synchronously
  4. Costly : Users achieve high durability by replicating translogs on additional shard copies, even if that copy isn’t needed for read availability

Use Cases

The pluggable translog support is meant to cater to a broad spectrum of users who would want to customise their durability options based on specific needs.

  1. The translogs are essential for durability but not the only way to achieve the same. Some users might choose not to use a translog at all, as they might want to self-manage durability by periodically checkpointing indexing requests and making calls to commit Lucene segments to disk. In the event of a primary node crash, all uncommitted changes could be re-driven from the previous checkpoint which is what Yelp does today. The commits can alternatively be made at the end of a batch of requests controlled by the user.
  2. Users might want to choose a separate durability semantic based on document search-ability(refresh). So all changes since the last refresh which haven’t been made visible to search is also not guaranteed to persist
  3. Users might need high durability guarantees on their data that can be achieved by configuring a remote storage for persisting translogs in addition to keeping segments.
  4. In future we might want to support streaming APIs which would trigger an explicit commit on events like connection close

High Level Proposal

We propose decoupling translog from the engine by abstracting out translog code from the server code base and moving it to a separate module. The engine would provide knobs to choose the translog option, whether or not translogs need to be maintained by the system. The translog module would further support extension points for a remote store which would be implemented as separate extended plugins. The translog module would support the default store as the local store so that the code is fully backward compatible, while providing users an option to customize their own durability by installing the specific durability extension plugins. All IO interactions across different storage interactions would be checksummed to detect or prevent translog corruption.

Translog durability

  1. Local : This is same as the durability option we have today, translogs will be co-located on the local disk with the segments and would behave the same for both segment and document based replication i.e all indexing operation would require translog to be replicated synchronously within the shard replication group. For a non-replicated index, incase of a node failure, shards cannot be recovered on a different node.

  2. Remote : Translogs can be stored on a remote store as provided by the durability extension plugin, in which case we don’t need translogs to be redundantly written on replicas. The indexing operations in case of segment based replication, need not even be replicated to the replica as primary is responsible for both segment creation and translog persistence to a durable store

    Local recovery process will need to pull translogs from the remote store and replay the needed translog operations once segments have been restored.
    For a non-replicated index, incase of a node failure, shards can be recovered on a different node if segments are also present on the remote store, by first restoring the segments and then pulling the translogs and re-playing the missing operations

  3. No translog : Users can choose to call commit periodically to make recent changes durable on disk. In the event of a primary node failure all changes since the previous Lucene commit would be lost and would need to re-driven.

Low-level proposal

translog (2) (2) (1)

Future Work

  1. Support for request checkpointing For users who wouldn’t want to choose translogs for durability, we would need to support additional durability semantics based on checkpoints. We would need to add support for checkpointing requests that might need API changes(backward compatible) to return a monotonically increasing seq no corresponding to the last indexing operation (per shard). The same seqno would be made available via APIs like flush to return seqno corresponding to the last Lucene commit. Indexing requests beyond the last successful commit checkpoint can be safely purged.
  2. Performance benchmarks with remote stores The remote translog durability options could have an indexing performance impact. We would be benchmarking indexing performance with the supported durability extensions

FAQs

What is a Lucene commit?
Lucene commit is a process of flushing all pending changes (added & deleted documents, segment merges, added indexes, etc.) to the index, and syncing all referenced index files. The data on disk is ready to be used for searching. and the index updates will survive an OS or machine crash or power loss.

What are shard retention leases?
Shard retention leases is a mechanism used for peer recoveries, aimed at preventing shard recoveries from having to fallback to expensive file copy operations if shard history is available from a certain sequence number. This lease would be acquired by a recovering replica copy to prevent any operations on the primary beyond this sequence number from being merged away

What is a soft-delete?
Lucene has a functionality to keep deleted documents alive based on time or any other constraint in the index. The soft delete merge policy allows to control when soft deletes are claimed by merges.

@Bukhtawar Bukhtawar added the enhancement Enhancement or improvement to existing feature or request label Sep 30, 2021
@Bukhtawar Bukhtawar changed the title Pluggable Translog [RFC] Pluggable Translog Sep 30, 2021
@nknize nknize added the RFC Issues requesting major changes label Oct 7, 2021
@Bukhtawar Bukhtawar changed the title [RFC] Pluggable Translog Feature Proposal : Pluggable Translog Jan 24, 2022
@nknize
Copy link
Collaborator

nknize commented Jan 25, 2022

3. No translog : Users can choose to call commit periodically to make recent changes durable on disk. In the event of a primary node failure all changes since the previous Lucene commit would be lost and would need to re-driven.

I'm curious what other folks think about this. I know some are strongly opposed giving the option to disable this level of durability. I'm not in that camp. I like the idea of giving the option to completely disable the xlog and save that processing storage time for simple fast/unreliable uses cases. Similar to folks choosing UDP over TCP and not caring if certain packets are lost.

@jayeshathila
Copy link
Member

jayeshathila commented Jan 30, 2022

As of now translog.add(..) happens in the same flow of indexing the doc - here .

Do we want to provide control over when to add to translog ? If yes, we can provide functionality to register custom listeners which will be called in index method, and it's upto the listener if it want to add translog sync or async.

@Bukhtawar
Copy link
Collaborator Author

@jayeshathila I have updated the the sequence diagram for translog interaction, for better understanding of the existing system, would recommend you go through the same

@jayeshathila
Copy link
Member

Will this Path mentioned in TranslogDurabilityExtension (LLD diagram above) , work generically accross different Stores. For say, do we want to use the same Path for S3 files as well ?

I know this Path have method to map a URI public static Path of(URI uri) but not sure if it will work for all the remote filesystems.

@jayeshathila
Copy link
Member

For my understanding, who (which class) will be caller to loadExtenstions(Class<T>) ?

@redcape
Copy link

redcape commented Feb 26, 2022

I'm pretty interested in this feature primarily for the pluggable transport option. I want the capability to encrypt the translog byte stream with different keys (different indexes must have different encryption keys). The different keys per index requirement is what pushes me away from using EBS; mounting different EBS volumes per index doesn't make sense for my use-case either due to volume and variability of indexes/keys.

@Bukhtawar Looking at the diagram and description, it seems like AbstractStore and the remote store idea is more about a backing store that is used to copy the data locally then read it from the local filesystem when needed. Is that a correct interpretation? Can we ensure the interfaces provided for the plugin are able to be used to layer read/write operations from the store? For example: remote-store-to-memory or data encryption purpose (plugin would prevent unencrypted translog data from hitting disk)

@elfisher
Copy link

@Bukhtawar I saw this didn't have the 2.1 tag. Is this still targeting 2.1?

@samuel-oci
Copy link
Contributor

samuel-oci commented Dec 13, 2022

+1 to that
I am also interested in extending the translog and potentially filter through all the input/output file system bytestreams.
It would be great if we had the same level of abstraction to input/output byte streams like we have for FSDirectory in Lucene for example.

I'm pretty interested in this feature primarily for the pluggable transport option. I want the capability to encrypt the translog byte stream with different keys (different indexes must have different encryption keys). The different keys per index requirement is what pushes me away from using EBS; mounting different EBS volumes per index doesn't make sense for my use-case either due to volume and variability of indexes/keys.

@Bukhtawar Looking at the diagram and description, it seems like AbstractStore and the remote store idea is more about a backing store that is used to copy the data locally then read it from the local filesystem when needed. Is that a correct interpretation? Can we ensure the interfaces provided for the plugin are able to be used to layer read/write operations from the store? For example: remote-store-to-memory or data encryption purpose (plugin would prevent unencrypted translog data from hitting disk)

@Jeevananthan-23
Copy link

Translog is really helpful to maintain the linearizability of atomic commits in OpenSearch Clusters which may lead to Raft consensus, @Bukhtawar your thoughts on this #6369.

@anasalkouz
Copy link
Member

@Bukhtawar @rramachand21 Are this tracking green for 2.9?

@anasalkouz anasalkouz added Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed Indexing & Search distributed framework labels Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep RFC Issues requesting major changes v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

No branches or pull requests

8 participants