[BUG] Close refresh listeners during primary relocation of remote enabled indexes #11320

ashking94 · 2023-11-24T04:56:14Z

Describe the bug
During the peer recovery relocation, we close Closeable internal refresh listeners. This is present to ensure that we drain all ongoing refreshes before primary hand-off. There is a code bug that is causing the listeners to not close.

OpenSearch/server/src/main/java/org/opensearch/index/shard/IndexShard.java

Lines 860 to 864 in fe2d585

    
           // Ensures all in-flight remote store operations drain, before we perform the handoff. 
        
           internalRefreshListener.stream() 
        
               .filter(refreshListener -> refreshListener instanceof Closeable) 
        
               .map(refreshListener -> (Closeable) refreshListener) 
        
               .close();

This has been found by the below check during shard creation.

OpenSearch/server/src/main/java/org/opensearch/index/remote/RemoteStoreUtils.java

Lines 85 to 102 in 5bb6cae

    
           public static void verifyNoMultipleWriters(List<String> mdFiles, Function<String, Tuple<String, String>> fn) { 
        
               Map<String, String> nodesByPrimaryTermAndGen = new HashMap<>(); 
        
               mdFiles.forEach(mdFile -> { 
        
                   Tuple<String, String> nodeIdByPrimaryTermAndGen = fn.apply(mdFile); 
        
                   if (nodeIdByPrimaryTermAndGen != null) { 
        
                       if (nodesByPrimaryTermAndGen.containsKey(nodeIdByPrimaryTermAndGen.v1()) 
        
                           && (!nodesByPrimaryTermAndGen.get(nodeIdByPrimaryTermAndGen.v1()).equals(nodeIdByPrimaryTermAndGen.v2()))) { 
        
                           throw new IllegalStateException( 
        
                               "Multiple metadata files from different nodes" 
        
                                   + "having same primary term and generations " 
        
                                   + nodeIdByPrimaryTermAndGen.v1() 
        
                                   + " detected " 
        
                           ); 
        
                       } 
        
                       nodesByPrimaryTermAndGen.put(nodeIdByPrimaryTermAndGen.v1(), nodeIdByPrimaryTermAndGen.v2()); 
        
                   } 
        
               }); 
        
           }

To Reproduce
Simulate conditions so that there is a retry ongoing in RemoteStoreRefreshListener. The upload can succeed after the new primary mode has started uploading segment and translog files.

Expected behavior
The older primary should not upload once the handoff has been done.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

ashking94 added bug Something isn't working untriaged labels Nov 24, 2023

ashking94 self-assigned this Nov 24, 2023

ashking94 added Storage:Durability Issues and PRs related to the durability framework Storage:Remote v2.12.0 Issues and PRs related to version 2.12.0 and removed untriaged labels Nov 24, 2023

ashking94 mentioned this issue Nov 24, 2023

[Remote Store] Handoff refreshes, translog uploads during relocation from old to new primary #11330

Merged

8 tasks

sachinpkale mentioned this issue Nov 25, 2023

[Remote Store] Release permits and reopen the refresh listeners & RemoteFsTranslog if the primary relocation fails #11323

Closed

sachinpkale closed this as completed in #11330 Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Close refresh listeners during primary relocation of remote enabled indexes #11320

[BUG] Close refresh listeners during primary relocation of remote enabled indexes #11320

ashking94 commented Nov 24, 2023

[BUG] Close refresh listeners during primary relocation of remote enabled indexes #11320

[BUG] Close refresh listeners during primary relocation of remote enabled indexes #11320

Comments

ashking94 commented Nov 24, 2023