Feature Proposal : Pluggable Translog #1319

Bukhtawar · 2021-09-30T17:38:03Z

Overview

The purpose of this document is to propose various mechanisms of achieving durability of uncommitted indexing operations and outlines the advantages of making the implementation pluggable. This is not a complete design document, it is a proposal to seek community feedback on.

Motivation

Provide flexible user controlled durability options
Support for highly durable translogs

Background

What is the role of a translog?
All operations that affect an OpenSearch index (i.e. adding, updating, or removing documents) are added to an in-memory buffer, which is periodically flushed to create segments on disk on a Lucene commit. These operations also need to be durably written to a write-ahead transaction log (aka translog) before they are acknowledged back to the client since Lucene commits are too expensive to be performed per request. In the event of a process crash, recent operations that have been acknowledged but not yet included in the last Lucene commit are recovered from the translog.

The following are the various mechanisms through which the engine interacts with the translog—

Ingestion Interactions
For efficiency, the translog operations are first buffered to an in-memory queue so that not all the indexing threads have to perform the costly fsync operation ** to disk. One of the indexing threads acquire a lock to dequeue and subsequently fsync, the buffered translog operations to disk. Once the fsync operation is complete, for operation consistency, OpenSearch also writes a checkpoint to the disk with the last sequence number that is written to the translog, which helps prevent translog corruption. Once all translog operations for a given request have been durably persisted to disk, the OpenSearch sends an acknowledgment back to the client

When the translog grows beyond a configurable size, in order to prevent recoveries from taking too long, the OpenSearch engine triggers a flush in the background. OpenSearch flush performs a Lucene commit, moving segments to disk also rolling over and starting a new translog file.

Shard Recovery
In the event of a crash, existing segment files are first recovered from the local disk. Translog files are opened and recent operations that have not been committed to Lucene are recovered by replaying all operations from the translog beyond the sequence number from the last Lucene commit checkpoint. A flush is triggered creating Lucene segments on disk and purging all unreferenced translog files.
Recovery mechanisms
1. Local recovery/ Primary failover : In case of recovery from an existing store(has in-sync translog and segments locally on disk) all uncommitted operations are replayed from local translog on disk.
2. Peer recovery : Translog doesn’t play a role in peer recovery instead peer shard copy(primary-primary or primary-replica) is orchestrated via shard retention leases by replaying indexing operations that are indexed into Lucene and deleted operations from the preserved soft-deletes on the peer primary.
Cross cluster replication
Cross Cluster Replication in OpenSearch uses a pull-based replication model to pull translog operations into the follower index. The translog in the leader cluster is pulled incrementally from the current sequence number of the follower shard upto the global checkpoint of the shard being replicated from at the leader, once the bootstrap from committed segments are complete.

Downsides of current translog implementation

Less flexible : The engine is tightly coupled with local disk based translogs and forces users to heavily rely on them for durability
Lesser durability options: The current option supported for translog persistence is local disk which makes full-recovery impossible even with faster snapshots in the event of losing a single node in a non-replicated setup.
Slow Ingestion : The present primary-backup replication model requires translogs to be synchronously written to all shard copies in the replication group before acknowledging a write request. There is a possibility that one degraded node can stall ingestion. With remote store and segment based replication model, we do not need to replicate operations to all copies synchronously
Costly : Users achieve high durability by replicating translogs on additional shard copies, even if that copy isn’t needed for read availability

Use Cases

The pluggable translog support is meant to cater to a broad spectrum of users who would want to customise their durability options based on specific needs.

The translogs are essential for durability but not the only way to achieve the same. Some users might choose not to use a translog at all, as they might want to self-manage durability by periodically checkpointing indexing requests and making calls to commit Lucene segments to disk. In the event of a primary node crash, all uncommitted changes could be re-driven from the previous checkpoint which is what Yelp does today. The commits can alternatively be made at the end of a batch of requests controlled by the user.
Users might want to choose a separate durability semantic based on document search-ability(refresh). So all changes since the last refresh which haven’t been made visible to search is also not guaranteed to persist
Users might need high durability guarantees on their data that can be achieved by configuring a remote storage for persisting translogs in addition to keeping segments.
In future we might want to support streaming APIs which would trigger an explicit commit on events like connection close

High Level Proposal

We propose decoupling translog from the engine by abstracting out translog code from the server code base and moving it to a separate module. The engine would provide knobs to choose the translog option, whether or not translogs need to be maintained by the system. The translog module would further support extension points for a remote store which would be implemented as separate extended plugins. The translog module would support the default store as the local store so that the code is fully backward compatible, while providing users an option to customize their own durability by installing the specific durability extension plugins. All IO interactions across different storage interactions would be checksummed to detect or prevent translog corruption.

Translog durability

Local : This is same as the durability option we have today, translogs will be co-located on the local disk with the segments and would behave the same for both segment and document based replication i.e all indexing operation would require translog to be replicated synchronously within the shard replication group. For a non-replicated index, incase of a node failure, shards cannot be recovered on a different node.
Remote : Translogs can be stored on a remote store as provided by the durability extension plugin, in which case we don’t need translogs to be redundantly written on replicas. The indexing operations in case of segment based replication, need not even be replicated to the replica as primary is responsible for both segment creation and translog persistence to a durable store

Local recovery process will need to pull translogs from the remote store and replay the needed translog operations once segments have been restored.
For a non-replicated index, incase of a node failure, shards can be recovered on a different node if segments are also present on the remote store, by first restoring the segments and then pulling the translogs and re-playing the missing operations
No translog : Users can choose to call commit periodically to make recent changes durable on disk. In the event of a primary node failure all changes since the previous Lucene commit would be lost and would need to re-driven.

Low-level proposal

Future Work

Support for request checkpointing For users who wouldn’t want to choose translogs for durability, we would need to support additional durability semantics based on checkpoints. We would need to add support for checkpointing requests that might need API changes(backward compatible) to return a monotonically increasing seq no corresponding to the last indexing operation (per shard). The same seqno would be made available via APIs like flush to return seqno corresponding to the last Lucene commit. Indexing requests beyond the last successful commit checkpoint can be safely purged.
Performance benchmarks with remote stores The remote translog durability options could have an indexing performance impact. We would be benchmarking indexing performance with the supported durability extensions

FAQs

What is a Lucene commit?
Lucene commit is a process of flushing all pending changes (added & deleted documents, segment merges, added indexes, etc.) to the index, and syncing all referenced index files. The data on disk is ready to be used for searching. and the index updates will survive an OS or machine crash or power loss.

What are shard retention leases?
Shard retention leases is a mechanism used for peer recoveries, aimed at preventing shard recoveries from having to fallback to expensive file copy operations if shard history is available from a certain sequence number. This lease would be acquired by a recovering replica copy to prevent any operations on the primary beyond this sequence number from being merged away

What is a soft-delete?
Lucene has a functionality to keep deleted documents alive based on time or any other constraint in the index. The soft delete merge policy allows to control when soft deletes are claimed by merges.

nknize · 2022-01-25T20:55:20Z

3. No translog : Users can choose to call commit periodically to make recent changes durable on disk. In the event of a primary node failure all changes since the previous Lucene commit would be lost and would need to re-driven.

I'm curious what other folks think about this. I know some are strongly opposed giving the option to disable this level of durability. I'm not in that camp. I like the idea of giving the option to completely disable the xlog and save that processing storage time for simple fast/unreliable uses cases. Similar to folks choosing UDP over TCP and not caring if certain packets are lost.

jayeshathila · 2022-01-30T12:42:21Z

As of now translog.add(..) happens in the same flow of indexing the doc - here .

Do we want to provide control over when to add to translog ? If yes, we can provide functionality to register custom listeners which will be called in index method, and it's upto the listener if it want to add translog sync or async.

Bukhtawar · 2022-01-30T16:22:36Z

@jayeshathila I have updated the the sequence diagram for translog interaction, for better understanding of the existing system, would recommend you go through the same

jayeshathila · 2022-02-01T18:37:49Z

Will this Path mentioned in TranslogDurabilityExtension (LLD diagram above) , work generically accross different Stores. For say, do we want to use the same Path for S3 files as well ?

I know this Path have method to map a URI public static Path of(URI uri) but not sure if it will work for all the remote filesystems.

jayeshathila · 2022-02-01T18:40:01Z

For my understanding, who (which class) will be caller to loadExtenstions(Class<T>) ?

redcape · 2022-02-26T02:17:01Z

I'm pretty interested in this feature primarily for the pluggable transport option. I want the capability to encrypt the translog byte stream with different keys (different indexes must have different encryption keys). The different keys per index requirement is what pushes me away from using EBS; mounting different EBS volumes per index doesn't make sense for my use-case either due to volume and variability of indexes/keys.

@Bukhtawar Looking at the diagram and description, it seems like AbstractStore and the remote store idea is more about a backing store that is used to copy the data locally then read it from the local filesystem when needed. Is that a correct interpretation? Can we ensure the interfaces provided for the plugin are able to be used to layer read/write operations from the store? For example: remote-store-to-memory or data encryption purpose (plugin would prevent unencrypted translog data from hitting disk)

Bukhtawar · 2022-06-16T05:14:10Z

Sub-tasks for this Meta Issue
[Pluggable Translog] Decouple Translog from Engine
Provide extension points for Remote Translog
Remote Translog plugin implementation for S3
Remote Translog with NRT Replica Engine
Move Translog dependency from core to a module

elfisher · 2022-06-20T15:30:00Z

@Bukhtawar I saw this didn't have the 2.1 tag. Is this still targeting 2.1?

samuel-oci · 2022-12-13T04:57:22Z

+1 to that
I am also interested in extending the translog and potentially filter through all the input/output file system bytestreams.
It would be great if we had the same level of abstraction to input/output byte streams like we have for FSDirectory in Lucene for example.

I'm pretty interested in this feature primarily for the pluggable transport option. I want the capability to encrypt the translog byte stream with different keys (different indexes must have different encryption keys). The different keys per index requirement is what pushes me away from using EBS; mounting different EBS volumes per index doesn't make sense for my use-case either due to volume and variability of indexes/keys.

@Bukhtawar Looking at the diagram and description, it seems like AbstractStore and the remote store idea is more about a backing store that is used to copy the data locally then read it from the local filesystem when needed. Is that a correct interpretation? Can we ensure the interfaces provided for the plugin are able to be used to layer read/write operations from the store? For example: remote-store-to-memory or data encryption purpose (plugin would prevent unencrypted translog data from hitting disk)

Jeevananthan-23 · 2023-03-14T15:49:51Z

Translog is really helpful to maintain the linearizability of atomic commits in OpenSearch Clusters which may lead to Raft consensus, @Bukhtawar your thoughts on this #6369.

anasalkouz · 2023-07-13T16:57:19Z

@Bukhtawar @rramachand21 Are this tracking green for 2.9?

Bukhtawar added the enhancement Enhancement or improvement to existing feature or request label Sep 30, 2021

Bukhtawar changed the title ~~Pluggable Translog~~ [RFC] Pluggable Translog Sep 30, 2021

anasalkouz added distributed framework Indexing & Search labels Sep 30, 2021

anasalkouz mentioned this issue Sep 30, 2021

Translog as an OpenSearch Module #1117

Closed

nknize added the RFC Issues requesting major changes label Oct 7, 2021

sachinpkale mentioned this issue Jan 24, 2022

[Feature Proposal] Add Remote Storage Options for Improved Durability #1968

Closed

Bukhtawar changed the title ~~[RFC] Pluggable Translog~~ Feature Proposal : Pluggable Translog Jan 24, 2022

anasalkouz mentioned this issue Apr 1, 2022

Pluggable alternatives to translog functionality using existing market solutions #1277

Closed

sohami mentioned this issue Apr 14, 2022

[RFC] Searchable Remote Index #2900

Closed

krishna-ggk mentioned this issue Apr 21, 2022

[Feature] Cross Cluster Replication based on segment replication #3020

Open

dblock mentioned this issue May 6, 2022

Adding @Bukhtawar to OpenSearch maintainers. #3231

Merged

1 task

kartg mentioned this issue May 6, 2022

[RFC] Modeling writes as an extensible workflow #3237

Open

mch2 mentioned this issue May 9, 2022

Add a new Engine implementation for replicas with segment replication enabled. #3240

Merged

5 tasks

Bukhtawar mentioned this issue May 9, 2022

[Pluggable Translog] Decouple Translog from Engine #3241

Closed

Bukhtawar mentioned this issue May 9, 2022

[Remote Translog] Add extension points to support remote translog stores #3242

Closed

hdhalter mentioned this issue Jun 6, 2022

Pluggable Translog feature opensearch-project/documentation-website#644

Open

adnapibar mentioned this issue Jun 16, 2022

[RFC] Streaming Index API #3000

Open

2 tasks

Bukhtawar added the v3.0.0 Issues and PRs related to version 3.0.0 label Jun 20, 2022

ankitkala mentioned this issue Jun 22, 2022

[FEATURE] Use Pluggable translog for fetching the operations from leader opensearch-project/cross-cluster-replication#375

Closed

mch2 mentioned this issue Jul 25, 2022

Translog Functionality Documentation #1118

Closed

anasalkouz added Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed Indexing & Search distributed framework labels Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Proposal : Pluggable Translog #1319

Feature Proposal : Pluggable Translog #1319

Bukhtawar commented Sep 30, 2021 •

edited

Loading

nknize commented Jan 25, 2022 •

edited

Loading

jayeshathila commented Jan 30, 2022 •

edited

Loading

Bukhtawar commented Jan 30, 2022

jayeshathila commented Feb 1, 2022

jayeshathila commented Feb 1, 2022

redcape commented Feb 26, 2022

Bukhtawar commented Jun 16, 2022 •

edited

Loading

elfisher commented Jun 20, 2022

samuel-oci commented Dec 13, 2022 •

edited

Loading

Jeevananthan-23 commented Mar 14, 2023

anasalkouz commented Jul 13, 2023

Feature Proposal : Pluggable Translog #1319

Feature Proposal : Pluggable Translog #1319

Comments

Bukhtawar commented Sep 30, 2021 • edited Loading

Overview

Motivation

Background

Downsides of current translog implementation

Use Cases

High Level Proposal

Low-level proposal

Future Work

FAQs

nknize commented Jan 25, 2022 • edited Loading

jayeshathila commented Jan 30, 2022 • edited Loading

Bukhtawar commented Jan 30, 2022

jayeshathila commented Feb 1, 2022

jayeshathila commented Feb 1, 2022

redcape commented Feb 26, 2022

Bukhtawar commented Jun 16, 2022 • edited Loading

elfisher commented Jun 20, 2022

samuel-oci commented Dec 13, 2022 • edited Loading

Jeevananthan-23 commented Mar 14, 2023

anasalkouz commented Jul 13, 2023

Bukhtawar commented Sep 30, 2021 •

edited

Loading

nknize commented Jan 25, 2022 •

edited

Loading

jayeshathila commented Jan 30, 2022 •

edited

Loading

Bukhtawar commented Jun 16, 2022 •

edited

Loading

samuel-oci commented Dec 13, 2022 •

edited

Loading