Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write to multiple indexes #6240

Closed
soulmerge opened this issue May 20, 2014 · 14 comments
Closed

Write to multiple indexes #6240

soulmerge opened this issue May 20, 2014 · 14 comments
Labels

Comments

@soulmerge
Copy link

It is possible to create an index as an alias to one or more indexes. When such an alias-index points to multiple real indexes, it is not possible to write to that alias. It would be nice to have write access to the alias.

The rest of this request contains some thoughts on this matter. I will use the following terms:

  • "index-a": A normal index (not an alias)
  • "index-b": Another index (not an alias)
  • "alias-index": An alias index pointing to both "index-a" and "index-b"

When writing to the "alias-index", there are generally two options:

  • Write to both of them simultaneously.
  • Write to either "index-a" or "index-b".

We would need to declare which of the two options - let's call it the "target-definition" - we want. Again, there are two options:

  • The "target-definition" is part of the alias definition.
  • The "target-definition" is provided as an option to the index operation.

It is further possible to combine both approches, having a "target-definition" in the alias configuration and the option to override that definition when indexing a document.

In my opinion, making the "alias-index" behave like a real index will support the principle of least surprise and thus increase trust of users. So I think it would be best, if the "target-definition" was part of the "alias-index".

I also think that it is not really necessary to be able to change the "target-definition" per write operation: If the application already knows where it wants to write to, it can write to the specific index directly. The only advantage of such a feature would be that it could reduce network traffic if someone wanted to write to five indexes out of seven, for example. But then it would be more efficient to create another write index spanning the five indexes. Looks like I talked myself out of the second option for the "target-definition" :-)

Considering a writable alias, another question arises: what is the default behaviour? We have two options again:

  • All index operations insert into both/all indexes.
  • Writing is disabled by default.

The first option would further enforce the principle of least surprise, whereas the second option could prevent accidental inserts into the wrong index. I think this is just a matter of priorities (although I would prefer inserts to all indexes).

Thanks for reading all the way through the long text :-)

@yeroc
Copy link

yeroc commented Nov 5, 2014

Has this request been considered at all? It would be helpful for our use case where we need to be able to perform periodic re-indexing without going offline. See http://www.mavengineering.com/blog/2014/02/12/seamless-elasticsearch-reindexing/ for a nice write-up but note how they have to query the alias and separately index into each of the underlying indices. If this request were implemented this would be more efficient.

@clintongormley
Copy link

Hi @soulmerge

We've spent time today discussion this ticket. In fact I opened one that was very similar several years ago (#2309).

I think that there are too many different data models and ways of indexing to be able to support them all with more complex APIs. The append only use case is different from the update use case. Multiple indices that exist because you're transitioning mapping is different from multiple indices reflecting time periods.

Instead, we provide the primitives that allow you to do exactly what you need application side. You can use the primitives to implement your exact model (eg using one alias for search across multiple indices, and one alias for the current write index).

Thanks for opening the issue, but we've decided against doing anything more here for now.

@yeroc
Copy link

yeroc commented Nov 12, 2014

@clintongormley I agree that the proposal above was too complex and could be easily satisfied with multiple aliases (no need for the routing etc.) but that still doesn't address the situation where you need to index into multiple indexes while transitioning to a new index. It sounds like you've only considered the case where users only need to index into a single index (eg. for time-based partitioning of indices).

The other use case is where we actually need to write updates into two (or more) indices during index rebuilds. Right now, we have to query a write-alias for its underlying indices and write to each one individually. This leaves us open to race conditions (how often do we query the write-alias to get the current list of indices? If we query too often, we hurt performance, on the flip side we end up writing to an alias we shouldn't be...) so I respectfully disagree with the statement that the required primitives are in place today for this use case.

@clintongormley
Copy link

@yeroc no, I did consider that case as well, and I agree with you that the primitives are not yet in place to handle this easily. However, the multi-index alias is not the right solution for this as it comes with its own finicky problems. Instead, the changes API, which we are working on, should be a better solution.

@soulmerge
Copy link
Author

@clintongormley Is my understanding correct, that the underlying use-case of this bug-report (creating a new index for some database while still using the old index for reads and writes) will be solved with the upcoming changes API?

If this is the case and the changes API provides more or less the same features as the changes API in CouchDB (let's say "push notifications for changes in data"), I fail to understand how my use case can be implemented with the help of these new features?

Does this mean that writing to multiple indexes should be implemented using this API? This would just imply that the complexity of the multi-write implementation is moved from within the ES server (write to both indexes at once) to the clients (update the second index whenever the first index changes) and frankly does not add any value to the current scenario (write the same data to both indexes), as the race condition remains in both cases.

I would love to hear a clarification if I misunderstood your suggestion.

Cheers

@clintongormley
Copy link

@soulmerge the changes API would indeed be like the couchDB change log. The idea is that you would have an alias pointing to the old index. Your application talks only to the alias.

You set up a new index and reindex your data from old to new using the changes APi. At some point you switch the alias from old to new. Your application continues talking to the same alias, but now it is using the new index instead.

The only missing part is this: at the moment we switch the alias, there may still be some changes in the old index which have not yet made it into the new index. Somehow we need to pause writes to the old index until all transactions have been replicated. This could obviously be done in the application, but I'm still trying to figure out how we could do this natively so that it will be transparent to the user.

@mfn
Copy link
Contributor

mfn commented Apr 26, 2015

Has there been any progress on this? I'm now in the midst of my third ES implementation and this is the third time this issue came up :)

Is #1242 related? It bears the name "Changes API" but seems to resemble something else (notifications for outside ES?).

Currently I'm planning to have an alias for writing pointing to multiple indices but doing the writes manually by:

  • read the aliases to figure out to which indices it points
  • write to all these indices in a loop

The nasty things here are:

  • every time I update a document I've to read the aliases
  • handle logic if one indexing fails and the other doesn't (or, there could be even more indices, etc.)

@clintongormley
Copy link

@mfn The Changes API is definitely related, but a few bits of functionality need to be built to support this:

  • Sequence IDs (Add Sequence Numbers to write operations #10708) so that we know which documents have been safely replicated to the replica shards
  • Changes API which will allow us to follow all new documents in an index, sorted by sequence ID
  • Reindex API which can follow the changes API and index the same (or an updated version of the) document to a new index

We are working on all of these parts, but they are big changes to get right.

@alecbz
Copy link

alecbz commented Feb 18, 2020

Any update on this? What's the current "best practice" for how to handle this flow with elastic?

Just to fully explain the use-case I'm imagining (and I believe the commenters above were referring to the same thing):

  • We have an elastic index that contains derived data built from some primary source.
  • Mutations get dual-written to both the primary source and the elastic index.
  • Periodically, we re-build the index from scratch, to make sure things are in-sync.

The solution I'm imagining is:

  1. The rebuilder job creates a new index.
  2. Live writes begin dual-writing to both the old and new index (as well as the primary data source).
  3. The rebuilder job goes through and writes all data from the primary source to the new index.
  4. Once the rebuilder is done, reads start going to the new index.

It's unclear how to best implement step 2 -- how do live writers know to write to the second, new index?

@chenyiping111
Copy link

i have the same problem. Is there any useful suggestion? Our project has more than 100 APIs using ES,my new index change mapping(use many multi-fields) , but I can't change all APIs quickly. Too many APIs used to change data , I must make the new_index and old_index real time synchronization

@chenyiping111
Copy link

I guess maybe pipeline can do this, but i don't know how

@chenyiping111
Copy link

Has there been any progress on this? I'm now in the midst of my third ES implementation and this is the third time this issue came up :)

Is #1242 related? It bears the name "Changes API" but seems to resemble something else (notifications for outside ES?).

Currently I'm planning to have an alias for writing pointing to multiple indices but doing the writes manually by:

  • read the aliases to figure out to which indices it points
  • write to all these indices in a loop

The nasty things here are:

  • every time I update a document I've to read the aliases
  • handle logic if one indexing fails and the other doesn't (or, there could be even more indices, etc.)

I have the same problem, do you have any suggestion?

@yeroc
Copy link

yeroc commented Oct 23, 2020

@chenyiping111 I still don't see a good solution for this. The Changes API enhancement that @clintongormley mentions as an alternative remains open since 2011 with seemingly little to no traction. It would be great if Elastic would reconsider this ticket as having multiple writable indexes against a single alias seems like a much more achievable solution to this issue. Sadly, I believe Elastic are much more focused on the APM space than on the core search product.

@hauntingEcho
Copy link

I've opened a more-specific ticket around the reindexing use case in #68003 which hopefully might get more traction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants