Elasticsearch fails to start with error: "Failed to find metadata for index" on every restart #47276

redbaron4 · 2019-09-30T06:24:56Z

Elasticsearch version: 7.3.2

Plugins installed: []

JVM version (java -version): 1.8.0

OS version: Centos-7.4

We have been running an elasticsearch cluster consisting of 5 modes for quite some time now. After upgrade to v7, we have noticed a lot of times our nodes refuse to start with
an error nested: IOException[failed to find metadata for existing index XXX.

The first time I encountered this error, I searched the discuss board and found this which talks of stronger startup checks enforced by ES-7.x and points to data directory getting corrupted due to external factors. Thinking it may be the same probloem, I duly took the node offline and ran a disk check which reported no errors. So I deleted the data directory, started the node and that was that.

However, the next time I did a rolling upgrade of my cluster, a different node failed with a similar error (The index name was different). I followed the same emergency procedure (delete data directory and restart node) and cluster was fixed.

Now after every rolling upgrade I seem to run into this error with atleast one of my node. The index name always points to a closed index. The error occurs only on restart (never while elasticsearch is running).

I find it hard to believe that all 5 of my nodes have a disk problem because:

I have run fsck everytime this error has occurred and no errors have been reported.
Elasticsearch runs without a problem for days on end (A disk error or other programs corrupting the data would cause running elasticsearch to crash as had happended on one of my nodes about a year back).

Yesterday we had a power issue at the data-center which led to all nodes getting power cycled. Upon restart 4 out of 5 modes failed to start with same errors. On all 4 nodes, the names of indexes was different (The indexes in question were "closed"). I had no option but to delete all data on those 4 nodes (Thus losing about 80% of elasticsearch data).

The errors seen were

[2019-09-30T10:36:58,205][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [esnode3] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index ssl-2019.09.20 [location: GmslGWkHTLGQowmMHFut7A, generation: 11]];
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.3.2.jar:7.3.2]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
        at org.elasticsearch.node.Node.<init>(Node.java:617) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more
Caused by: java.io.IOException: failed to find metadata for existing index ssl-2019.09.20 [location: GmslGWkHTLGQowmMHFut7A, generation: 11]
        at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:99) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.upgradeMetaData(GatewayMetaState.java:141) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:95) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:492) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more
[2019-09-30T10:36:58,210][INFO ][o.e.x.m.p.NativeController] [esnode3] Native controller process has stopped - no new native processes can be started

and

[2019-09-30T10:39:59,737][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [esnode2] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index dns-2019.09.22 [location: ZMenLry9Qxe5-2-XNrWj2A, generation: 15]];
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.3.2.jar:7.3.2]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
        at org.elasticsearch.node.Node.<init>(Node.java:617) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more
Caused by: java.io.IOException: failed to find metadata for existing index dns-2019.09.22 [location: ZMenLry9Qxe5-2-XNrWj2A, generation: 15]
        at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:99) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.upgradeMetaData(GatewayMetaState.java:141) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:95) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:492) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more

Is it possible that data of closed indexes is not being persisted properly (leading to issues at restart)? Can this be mitigated somehow (Maybe rolling back to less stronger consistency checks)?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-09-30T06:27:51Z

Pinging @elastic/es-distributed

ywelsch · 2019-09-30T06:40:46Z

Hi @redbaron4. Thanks for reporting this. A few questions:

Are these master-eligible nodes?
Are these closed replicated indices (i.e. are there allocated shards for these indices)? Have these indices been closed in 6.x or 7.x?

ywelsch · 2019-09-30T07:16:00Z

I have now managed to reproduce this with data-only nodes:

public void testRelocatedClosedIndexIssue() throws Exception {
    final String indexName = "closed-index";
    List<String> dataNodes = internalCluster().startDataOnlyNodes(2);
    // allocate shard to first data node
    createIndex(indexName, Settings.builder()
        .put(IndexMetaData.SETTING_NUMBER_OF_SHARDS, 1)
        .put(IndexMetaData.SETTING_NUMBER_OF_REPLICAS, 0)
        .put("index.routing.allocation.include._name", String.join(",", dataNodes.get(0)))
        .build());
    indexRandom(randomBoolean(), randomBoolean(), randomBoolean(), IntStream.range(0, randomIntBetween(0, 50))
        .mapToObj(n -> client().prepareIndex(indexName, "_doc").setSource("num", n)).collect(toList()));
    assertAcked(client().admin().indices().prepareClose(indexName));
    // move single shard to second node
    client().admin().indices().prepareUpdateSettings(indexName).setSettings(Settings.builder()
        .put("index.routing.allocation.include._name", String.join(",", dataNodes.get(1)))).get();
    ensureGreen(indexName);
    // give first node a bit of time to clean up the index (including index metadata)
    Thread.sleep(100);

    internalCluster().fullRestart(); // fails as it can't find the index metadata on the first node
}

The same should not happen with master-eligible data nodes though. Can you clarify that point for us? If this only affects data-only nodes we might be able to provide instructions on how to get the node running again, without losing data.

redbaron4 · 2019-09-30T07:16:07Z

@ywelsch Thanks for looking at this

These are not master eligible nodes. There are 3 other master elegible nodes in the cluster (which don't store any data). While doing a rolling upgrade, the script processes all the master eligible nodes first and then moves to data nodes.

All data nodes (the kind which show failure) have similar config to one given below

cluster.name: nemo-cluster 

node.name: esnode1

bootstrap.memory_lock: true
network.host: _site:ipv4_
discovery.zen.ping.unicast.hosts: 
    - 10.44.0.43
    - 10.44.0.44
    - 10.44.0.45
    
discovery.zen.minimum_master_nodes: 1

# Fix 9201 for intra cluster comm
transport.port: 9201

path.logs: /var/log/elasticsearch
path.data: /data/elasticsearch

# Settings related to node

node.master: false
node.data: true
node.ingest: true

# Monitoring settings (6.3+)
xpack.monitoring.enabled: true

The master elgible nodes are 10.44.0.43,44,45 which don't show any failure.

Closed indices are mostly created on 7.x. For the latest issue, the indices mentioned in the logs above were created using 7.x

ywelsch · 2019-09-30T07:39:57Z

Ok, this confirms my findings. First of all, this is a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices. On non-master-eligible data nodes, it's possible for the node's manifest file (which tracks the relevant metadata state that the node should persist) to become out of sync with what's actually stored on disk, leading to an inconsistency that is then detected at startup, refusing for the node to start up. We will immediately start working on a bug fix for this.

In the meanwhile, the following workaround is applicable to get the node running again. This workaround should not lead to any data loss. However, great care must be taken before applying it, preferably backing up the data folder on the node before undertaking the following low-level surgery:

The first step is to make sure that the node is indeed a data-only node. Doing the following on a master-eligible node can and will put the whole cluster at risk of data loss.
For every path that is referenced in path.data, remove the file named manifest-N.st (where N is some number) which can be found under nodes/0/_state/ in the respective data path. After removing this file, the node should properly start up again.

redbaron4 · 2019-09-30T07:43:10Z

Thanks for the work around. I'll try it the next time we face this situation. I almost had the impulse of removing the manifest file but did not do it. I tried to remove offending index entries from the manifest file first which led to consistency checks failure. So I restored the manifest file and desisted from any more tinkerings with it :)

I hope the bug gets fixed soon.

…-only node (#47285) Fixes a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices (but incorrectly handles replicated closed indices). On non-master-eligible data nodes, it's possible for the node's manifest file (which tracks the relevant metadata state that the node should persist) to become out of sync with what's actually stored on disk, leading to an inconsistency that is then detected at startup, refusing for the node to start up. Closes #47276

redbaron4 changed the title ~~Failed to find metadata for closed index on every restart~~ Elasticsearch fails to start with error: "Failed to find metadata for index" on every restart Sep 30, 2019

ywelsch added the :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Sep 30, 2019

ywelsch mentioned this issue Sep 30, 2019

Omit writing index metadata for non-replicated closed indices on data-only node #47285

Merged

ywelsch closed this as completed in #47285 Sep 30, 2019

ywelsch added the >bug label Sep 30, 2019

DaveCTurner mentioned this issue Oct 2, 2019

elasticsearch-shard remove-corrupted-data doesn't work on missing metadata #47435

Closed

codebrain mentioned this issue Oct 25, 2019

7.4.1 meta ticket elastic/elasticsearch-net#4174

Closed

39 tasks

fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021

Create TestFixMe.md

a9fae03

fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021

Create Helloworld.md

1398a04

fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021

Update Helloworld.md

f68abab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch fails to start with error: "Failed to find metadata for index" on every restart #47276

Elasticsearch fails to start with error: "Failed to find metadata for index" on every restart #47276

redbaron4 commented Sep 30, 2019 •

edited

Loading

elasticmachine commented Sep 30, 2019

ywelsch commented Sep 30, 2019

ywelsch commented Sep 30, 2019 •

edited

Loading

redbaron4 commented Sep 30, 2019 •

edited

Loading

ywelsch commented Sep 30, 2019

redbaron4 commented Sep 30, 2019

Elasticsearch fails to start with error: "Failed to find metadata for index" on every restart #47276

Elasticsearch fails to start with error: "Failed to find metadata for index" on every restart #47276

Comments

redbaron4 commented Sep 30, 2019 • edited Loading

elasticmachine commented Sep 30, 2019

ywelsch commented Sep 30, 2019

ywelsch commented Sep 30, 2019 • edited Loading

redbaron4 commented Sep 30, 2019 • edited Loading

ywelsch commented Sep 30, 2019

redbaron4 commented Sep 30, 2019

redbaron4 commented Sep 30, 2019 •

edited

Loading

ywelsch commented Sep 30, 2019 •

edited

Loading

redbaron4 commented Sep 30, 2019 •

edited

Loading