RFC : Point In Time Search #1147

rajkthakur · 2021-08-24T08:24:40Z

Is your feature request related to a problem? Please describe.

Today, in OpenSearch, if you want to run different queries on the same data set chances are you will get different result as data is constantly changing. However, in real world scenario when analyzing data or trying to provide a consistent user experience to your end users you may want the result from a query not to change while the context remains the same and control when changes should appear in the result set. You want to be able to query the same data set and paginate through the data set expecting consistent result. This is not possible using current available options in OpenSearch.

Opensearch currently supports the following options to achieve pagination, each having a certain limitation:

Scroll API : Scroll API cannot share point in time context with other queries. Moreover, the scroll API only allows to move forwards(next page) in the search, cases when the client sends the request for a page but fails to get a response, a subsequent retry call skips the page(retried for) and returns the next page in the scroll.
Search After : The search_after mechanism doesn't preserve the state of data when the search was issued, so one can paginate using the key (search_after) and fetch subsequent pages while getting more recent results since the search was issued as the pagination progresses.
From To : This mechanism does not support deep pagination since every page request requires the shard to process all previous results and then filter the requested page which might be taxing deeper the pagination goes

Describe the solution you'd like

Point in Time allows users to run different queries against the same fixed data set in time. Point in time only takes data into account up until the moment it is created. Hence, none of the resources that are required to return the data from the initial request are modified or deleted. Segments are retained, even though the segment might already have been merged away and is not needed for the live data set. In short, Point in Time Search allows user to maintain a state which can be re-used by different queries in order to achieve consistent results.

Key goals:

Optimize resource consumption compared to a scroll by providing a consistent, shareable view of data set across queries. More segments are otherwise needed to be retained as needed by individual queries which means more file handles, more disk and more heap to keep metadata from segments in the heap.
Resilient to
1. Network failures : allows searches to move forward with a search_after parameter
2. Shard failures for read-only data : allows retries on other shard copies that share the same segments (Phase - II)
Replaces scroll API, as a more comprehensive solution for deep pagination when used with search_after
Point in Time will be supported by Asynchronous Search and Cross Cluster searches

APIs

Create Point In Time API

Unlike a Scroll, by creating a dedicated Point in Time, we decouple the context from a single query and make it re-usable across arbitrary search requests by passing the Point in Time Id. We can achieve this by using the Create Point in Time API.

POST <index>/_point_in_time?keep_alive=1m
{
   "id" : "s9O9QAIFaW5kZXgWOFVaMXFTc3pTV3lLMGE4VU42dmo4dwAWekthUVBmYnRUWk9XVzh4WW56TG5lZwAAAAAAAAAAARZQd3JkNlE4WlJicXRuS0M1VzNDaHV3BWluZGV4FjhVWjFxU3N6U1d5SzBhOFVONnZqOHcBFnpLYVFQZmJ0VFpPV1c4eFluekxuZWcAAAAAAAAAAAIWUHdyZDZROFpSYnF0bktDNVczQ2h1dwEWOFVaMXFTc3pTV3lLMGE4VU42dmo4dwAA",
   "created_time" : 1632727466283,
   "end_time" : 1632727526283
}

Delete Point In Time API

Point-in-times are automatically closed when the keep_alive is elapsed. However, keeping point-in-times has a cost; hence, point-in-times should be closed as soon as they are no longer used in search requests. We may also delete a Point in Time and free the resources before its keep alive using the Delete Point in Time API.

DELETE /_point_in_time/<id>

List All Active Point In Time API

A useful admin API to have is to list all active Points in Time and their keep-alives.

GET /_point_in_time

[
    {
        "point_in_time_id_1",
        "created_time" : 1632727466283,
        "end_time" : 1632727526283
    },
    {
        "point_in_time_id_2",
        "created_time" : 1674662833272,
        "end_time" : 1632727526283
    }
    ...
    ...
]

Using a Point in Time in a search request:

In the search request we pass the point in time id and (optionally) a keep alive to extend the Point In Time. (Passing PIT id in search request is supported in Opensearch)
Search request with PIT ID will not accept indices, preference, routing and indices options as these are already passed at the time of creating a Point In Time.

GET /_search
{
  "pit": {
        "id":  "ID_RETURNED_FROM_CREATE_POINT_IN_TIME_REQUEST", 
        "keep_alive": "1m" //optional to extend a Point In Time
  },
  "sort": [
    {
      "name.keyword": {
        "order": "desc"
      }
    }
  ],
  "search_after" : ["Opensearch", 1] //optional to fetch further results 
}

The text was updated successfully, but these errors were encountered:

stockholmux · 2021-08-25T21:27:28Z

@rajkthakur Can you make this issue a little more... fleshed out? I'm just seeing the dummy text.

anasalkouz · 2021-10-04T23:22:50Z

Hi @rajkthakur, are you actively working on this issue? if yes, please could you assign it to yourself and add a comment?

eirsep · 2021-10-05T05:33:03Z

@anasalkouz I am actively working on this issue.

stockholmux · 2021-10-07T16:16:49Z

Can we mark this as a proposal?

eirsep · 2021-10-19T11:33:17Z

I have done a small POC to check feasibility of the proposed APIs

We will not be able to provide a List All Points In Time APIs, as a point in time is not tied to any coordinator node.
Rather the Point In Time ID is simply a Base 64 encoded hash of list of reader context to node mappings and a UUID.

eirsep · 2021-10-19T11:36:01Z

We will provide new information in the Nodes stats api -
a nested object in the search section mentioning the number of currently active Point In Time contexts, total number of Point In Time contexts.

This will help keep track of point in time related statistics.

eirsep · 2021-10-19T11:37:03Z

We will provide settings to:

restrict the max open point in time contexts on a node
restrict the maximum keep alive allowed for point in times.

eirsep · 2021-10-19T11:38:11Z

similar to scroll, we will provide all points in time via the api by passing _all in the path for Delete Point In Time API.

eirsep · 2021-11-02T06:11:29Z

Currently doing a POC to add an API for PIT disk utilization i.e. segments retained by PIT Ids. I am trying to put out stats similar to cat segments, but for Points In Time.

nknize · 2021-12-03T17:25:29Z

Segments are retained

This can get expensive. I see the objective is to optimize resource consumption ✔️ . I think this is a specialized use case for archival or time based analysis use cases and not the normal search use case so this should be configured through an Index Scoped setting.

Segment replication will heavily change this design as it will push a lot of the PIT burden on the storage layer.

eirsep · 2021-12-03T18:00:39Z

Segments are retained

This can get expensive. I see the objective is to optimize resource consumption ✔️ . I think this is a specialized use case for archival or time based analysis use cases and not the normal search use case so this should be configured through an Index Scoped setting.

@nknize
I agree that segment retention is expensive, but this is exactly what scrolls do today. The resource consumption is being optimised only in comparison to Scrolls(i.e. being able to share PIT Id across queries while scrolls create different contexts per scroll).

(Among other things) PIT would be a replacement for scrolls and hence be the pagination solution for Opensearch, which is also an important use case to keep in mind. We would limit the number of PIT contexts that can be opened via a setting and would also provide an API to provide info about PITs' disk consumption

Will look into how-to and benefits of configuring PIT through Index Scoped setting.

Segment replication will heavily change this design as it will push a lot of the PIT burden on the storage layer.

Segments retained by PIT/Scrolls will not be replicated I think. Can you plz elaborate the caveats you see?

nknize · 2021-12-03T22:29:22Z

PIT would be a replacement for scrolls

👍

Can you plz elaborate the caveats you see?

A storage engine w/ verisoned backups (e.g., S3 buckets w/ versioning) can be used to restore files from backups. I think this along w/ Lucene sequence IDs enables this feature without having to retain as many historic segments. A storage engine w/o versioning (e.g., local NFS, or smb) could possibly use the segment retention logic provided by this feature.

eirsep · 2021-12-06T14:16:37Z

Can you plz elaborate the caveats you see?

A storage engine w/ verisoned backups (e.g., S3 buckets w/ versioning) can be used to restore files from backups. I think this along w/ Lucene sequence IDs enables this feature without having to retain as many historic segments. A storage engine w/o versioning (e.g., local NFS, or smb) could possibly use the segment retention logic provided by this feature.

Ack.
Thanks for elaborating! So, segment replication feature would have to handle how Scrolls would function which is currently doing segment retention (prolly by using versioned backups as you've mentioned). Hence, by extension, PITs will get handled too as PITs are simply re-using that idea.

eirsep · 2022-02-23T10:11:11Z

@CEHENKLE Can you please create a feature branch feature/point-in-time which will be used to run full test suite?

CEHENKLE · 2022-02-24T23:32:12Z

https://github.com/opensearch-project/OpenSearch/tree/feature/point-in-time created

elfisher · 2022-04-05T16:42:58Z

@rramachand21 is this still aiming for 2.0?

andrross · 2022-04-08T17:30:18Z

I think this is a specialized use case for archival or time based analysis use cases and not the normal search use case

I'm reading this feature as primarily a replacement/improvement over the scroll API. It seems like the primary use case is for pagination (which is currently solved by the scroll API but does have limitations). This essentially generalizes it a bit and gives semantics similar to snapshot isolation in a traditional database where you can do multiple queries within a transaction and observe a consistent view of the data. @nknize do you have any major concerns moving forward with this feature?

I do have a nitpick about the name, though, particularly the /_point_in_time API. "Point in time" is likely a phrase to be overloaded in the future, first and most obvious to me is something like "point in time restore". There also may in fact be other features to solve archival or time-based analysis requirements that would require the ability to do searches from points far in the past. I'd love to hear opinions from other folks, but given that this feature's scope is pretty narrowly focused on "establish a point in time now and allow me to do some searches against it for a very limited duration" it might be a good idea to give it a name that won't conflict/confuse with future point-in-time-related features.

andrross · 2022-04-08T17:45:10Z

Scroll API cannot share point in time context with other queries

The inability to share point-in-time contexts is mentioned several times, but what is the use case for sharing these contexts? The pagination use case makes total sense but I don't think that generally requires sharing the context. The ability to restrict maximum keep-alive duration also makes a lot of sense to put a cap on worst-case resource consumption. However, that limit is likely to be in tension with the usefulness of sharing the contexts if they are short-lived, so I'm curious about the use cases that are motivating the share-ability requirement.

Bukhtawar · 2022-04-08T18:39:20Z

I think the major use case of share-ability is the ability to execute different types of queries and derive better insights on the same consistent view of data. Then some queries might fail or timeout due to various reasons. Once PIT is created retries become simpler as it allows user to resume queries on the same view

"Point in time" is likely a phrase to be overloaded in the future, first and most obvious to me is something like "point in time restore".

+1 on the thought, @andrross does /_search/pit or /_search/_point_in_time sounds better?

andrross · 2022-04-08T19:13:35Z

@Bukhtawar I definitely like /_search/_point_in_time better than just /_point_in_time.

One last naming nitpick, I do prefer spelling it out as opposed to using the acronym "pit", but either way we should be consistent. If we stick with "point_in_time" then that should be used in the search request as well.

loretoparisi · 2022-08-30T18:05:04Z

@rajkthakur I'm getting this error { Message: "Your request: '/_pit' is not allowed." } when querying AWS OpenSearch, that it should support releases 1.3, 1.2, 1.1, 1.0. According to latest AWS annoucement they have deployed OS 1.3, but there are any details about PiT support.
Which OS version has this PR?

dhruv16dhr · 2022-09-01T15:24:12Z

@rajkthakur I'm getting this error { Message: "Your request: '/_pit' is not allowed." } when querying AWS OpenSearch, that it should support releases 1.3, 1.2, 1.1, 1.0. According to latest AWS annoucement they have deployed OS 1.3, but there are any details about PiT support. Which OS version has this PR?

@loretoparisi AWS OpenSearch will support point in time search in upcoming release. It is not supported in AWS OpenSearch 1.3. Point-in-time search will be supported in OpenSource OpenSearch 2.3.0 release.

dreamer-89 · 2022-09-07T18:30:58Z

@rajkthakur: I see this issue is labeled for v2.3.0 release, which has code freeze today i.e. Sep 7. I see open backport PRs to 2.x. Can you please prioritize review/merge.

stephen-crawford · 2022-10-26T13:15:17Z

Hi @rajkthakur, just checking in from the security team to see if there is anything you need finalized from us before the upcoming 2.4 freeze.

anasalkouz · 2022-10-28T21:07:21Z

@rajkthakur do you still track this for 2.4 release? code freeze on 11/3
Is there anything pending? otherwise, feel free to close it.

dhruv16dhr · 2022-10-29T04:56:12Z

@anasalkouz Yes we are tracking this for 2.4 release. Documentation is pending, we will be closing it before 11/3.
@bharath-techie Please check with @rajkthakur and close this by next week

bharath-techie · 2022-10-29T04:58:59Z

Documentation PR - opensearch-project/documentation-website#1753

Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation. Further details: - I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change. - ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147).

Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation. Further details: - I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change. - ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147). Signed-off-by: Bryce Seager van Dyk <bryce@vandyk.net.nz>

Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation. Further details: - I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change. - ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147). Signed-off-by: Bryce Seager van Dyk <bryce@vandyk.net.nz> (cherry picked from commit 3470787) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation. Further details: - I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change. - ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147). (cherry picked from commit 3470787) Signed-off-by: Bryce Seager van Dyk <bryce@vandyk.net.nz> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation. Further details: - I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change. - ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147). Signed-off-by: Bryce Seager van Dyk <bryce@vandyk.net.nz>

loretoparisi · 2024-04-22T21:08:51Z

supported

@dhruv16dhr I've recently attempted again to use PIT using AWS OpenSearch / Kibana 2.5, but I'm getting

{ Message: "Your request: '/_pit' is not allowed." }

the currently installed version of AWS OS is

"version" : {
    "number" : "7.10.2",
    "build_snapshot" : false,
    "lucene_version" : "9.4.2",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  }

rajkthakur added the enhancement Enhancement or improvement to existing feature or request label Aug 24, 2021

anasalkouz added the Indexing & Search label Sep 29, 2021

rajkthakur changed the title ~~Point In Time Search~~ RFC : Point In Time Search Oct 18, 2021

eirsep mentioned this issue Dec 3, 2021

Point In Time Search POC #1652

Closed

4 tasks

eirsep mentioned this issue Feb 21, 2022

Support for integration with Point In Time Search opensearch-project/OpenSearch-Dashboards#1268

Open

andrross mentioned this issue Apr 7, 2022

Create PIT API #2745

Merged

5 tasks

nknize mentioned this issue Apr 12, 2022

[Segment Replication] Support shard promotion. #2212

Closed

4 tasks

Bukhtawar added roadmap v2.3.0 'Issues and PRs related to version v2.3.0' labels Aug 1, 2022

This was referenced Aug 1, 2022

Add changes for Create PIT and Delete PIT rest layer and rest high level client #4064

Merged

Add changes to Point in time segments API service layer #4105

Merged

ajaymovva mentioned this issue Aug 3, 2022

Added Point In Time Node Stats API ServiceLayer Changes #4030

Merged

5 tasks

ajaymovva mentioned this issue Aug 15, 2022

Added RestLayer Changes for PIT stats #4217

Merged

5 tasks

This was referenced Sep 2, 2022

Added rest layer changes for List all PITs and PIT segments #4388

Merged

[Backport 2.x] [Point in time] Backport point in time changes #4406

Closed

dreamer-89 added v2.4.0 'Issues and PRs related to version v2.4.0' and removed v2.3.0 'Issues and PRs related to version v2.3.0' labels Sep 8, 2022

bharath-techie mentioned this issue Sep 28, 2022

[Backport 2.x] [Point in time] Backport point in time changes #4616

Merged

6 tasks

rajkthakur closed this as completed Nov 3, 2022

SingingTree mentioned this issue Apr 19, 2023

Remove references to _shard_doc opensearch-project/documentation-website#3808

Merged

bharath-techie mentioned this issue May 31, 2023

Point in time based searches should provide an option to ignore missing indices #7418

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC : Point In Time Search #1147

RFC : Point In Time Search #1147

rajkthakur commented Aug 24, 2021 •

edited

Loading

stockholmux commented Aug 25, 2021

anasalkouz commented Oct 4, 2021

eirsep commented Oct 5, 2021

stockholmux commented Oct 7, 2021

eirsep commented Oct 19, 2021

eirsep commented Oct 19, 2021

eirsep commented Oct 19, 2021

eirsep commented Oct 19, 2021

eirsep commented Nov 2, 2021

nknize commented Dec 3, 2021

eirsep commented Dec 3, 2021 •

edited

Loading

nknize commented Dec 3, 2021

eirsep commented Dec 6, 2021

eirsep commented Feb 23, 2022

CEHENKLE commented Feb 24, 2022

elfisher commented Apr 5, 2022

andrross commented Apr 8, 2022

andrross commented Apr 8, 2022

Bukhtawar commented Apr 8, 2022 •

edited

Loading

andrross commented Apr 8, 2022

loretoparisi commented Aug 30, 2022 •

edited

Loading

dhruv16dhr commented Sep 1, 2022

dreamer-89 commented Sep 7, 2022

stephen-crawford commented Oct 26, 2022

anasalkouz commented Oct 28, 2022 •

edited

Loading

dhruv16dhr commented Oct 29, 2022

bharath-techie commented Oct 29, 2022

loretoparisi commented Apr 22, 2024

RFC : Point In Time Search #1147

RFC : Point In Time Search #1147

Comments

rajkthakur commented Aug 24, 2021 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Key goals:

APIs

stockholmux commented Aug 25, 2021

anasalkouz commented Oct 4, 2021

eirsep commented Oct 5, 2021

stockholmux commented Oct 7, 2021

eirsep commented Oct 19, 2021

eirsep commented Oct 19, 2021

eirsep commented Oct 19, 2021

eirsep commented Oct 19, 2021

eirsep commented Nov 2, 2021

nknize commented Dec 3, 2021

eirsep commented Dec 3, 2021 • edited Loading

nknize commented Dec 3, 2021

eirsep commented Dec 6, 2021

eirsep commented Feb 23, 2022

CEHENKLE commented Feb 24, 2022

elfisher commented Apr 5, 2022

andrross commented Apr 8, 2022

andrross commented Apr 8, 2022

Bukhtawar commented Apr 8, 2022 • edited Loading

andrross commented Apr 8, 2022

loretoparisi commented Aug 30, 2022 • edited Loading

dhruv16dhr commented Sep 1, 2022

dreamer-89 commented Sep 7, 2022

stephen-crawford commented Oct 26, 2022

anasalkouz commented Oct 28, 2022 • edited Loading

dhruv16dhr commented Oct 29, 2022

bharath-techie commented Oct 29, 2022

loretoparisi commented Apr 22, 2024

rajkthakur commented Aug 24, 2021 •

edited

Loading

eirsep commented Dec 3, 2021 •

edited

Loading

Bukhtawar commented Apr 8, 2022 •

edited

Loading

loretoparisi commented Aug 30, 2022 •

edited

Loading

anasalkouz commented Oct 28, 2022 •

edited

Loading