Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step Metadata Update on Index Rollover Timeout #1174

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ abstract class Step(val name: String, val isSafeToDisableOn: Boolean = true) {
CONDITION_NOT_MET("condition_not_met"),
FAILED("failed"),
COMPLETED("completed"),
TIMED_OUT("timed_out"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a new field could cause problem when cluster having old and new version of code during upgrade. Maybe just use the "failed" state is good enough.

Copy link
Contributor

@sarthakaggarwal97 sarthakaggarwal97 May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @bowenlan-amzn for taking a look. Not sure if I fully understand this. What could be that scenario where adding a new step state can fail. I'm asking in case we need to add a new step state in the future.

PS: I'm okay with failed as well for this case. And in the failure message, we would have timed out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my thinking: this new field of Step status is saved in the metadata document of ISM system index. When user do a explain API using the old node with old software, it cannot understand this new field value so probably fail.

So the impact is not big, only during the upgrading when cluster is mixed with old and new software.

I just found this potential problem is already safened, new software won't be used if old software exists in cluster

if (skipExecFlag.flag) {
logger.info("Cluster still has nodes running old version ISM plugin, skip execution on new nodes until all nodes upgraded")
return@launch
}

So no problem now 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome @bowenlan-amzn
thanks for checking

;

override fun toString(): String {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ import org.opensearch.indexmanagement.spi.indexstatemanagement.model.ManagedInde
import org.opensearch.indexmanagement.spi.indexstatemanagement.model.PolicyRetryInfoMetaData
import org.opensearch.indexmanagement.spi.indexstatemanagement.model.StateMetaData
import org.opensearch.indexmanagement.spi.indexstatemanagement.model.StepContext
import org.opensearch.indexmanagement.spi.indexstatemanagement.model.StepMetaData
import org.opensearch.jobscheduler.spi.JobExecutionContext
import org.opensearch.jobscheduler.spi.LockModel
import org.opensearch.jobscheduler.spi.ScheduledJobParameter
Expand Down Expand Up @@ -330,14 +331,18 @@ object ManagedIndexRunner :
if (action?.hasTimedOut(currentActionMetaData) == true) {
val info = mapOf("message" to "Action timed out")
logger.error("Action=${action.type} has timed out")
val updated =
updateManagedIndexMetaData(
managedIndexMetaData
.copy(actionMetaData = currentActionMetaData?.copy(failed = true), info = info),
)

val updatedIndexMetaData = managedIndexMetaData.copy(
actionMetaData = currentActionMetaData?.copy(failed = true),
stepMetaData = step?.let { StepMetaData(it.name, System.currentTimeMillis(), Step.StepStatus.TIMED_OUT) },
info = info,
)

val updated = updateManagedIndexMetaData(updatedIndexMetaData)

if (updated.metadataSaved) {
disableManagedIndexConfig(managedIndexConfig)
publishErrorNotification(policy, managedIndexMetaData)
publishErrorNotification(policy, updatedIndexMetaData)
}
return
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,10 @@ package org.opensearch.indexmanagement.indexstatemanagement.action
import org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase
import org.opensearch.indexmanagement.indexstatemanagement.step.open.AttemptOpenStep
import org.opensearch.indexmanagement.indexstatemanagement.step.rollover.AttemptRolloverStep
import org.opensearch.indexmanagement.spi.indexstatemanagement.Step
import org.opensearch.indexmanagement.spi.indexstatemanagement.model.ActionMetaData
import org.opensearch.indexmanagement.spi.indexstatemanagement.model.ManagedIndexMetaData
import org.opensearch.indexmanagement.spi.indexstatemanagement.model.StepMetaData
import org.opensearch.indexmanagement.waitFor
import java.time.Instant
import java.util.Locale
Expand All @@ -22,9 +24,9 @@ class ActionTimeoutIT : IndexStateManagementRestTestCase() {
val policyID = "${testIndexName}_testPolicyName_1"
val testPolicy =
"""
{"policy":{"description":"Default policy","default_state":"rolloverstate","states":[
{"name":"rolloverstate","actions":[{"timeout":"1s","rollover":{"min_doc_count":100}}],
"transitions":[]}]}}
{"policy":{"description":"Default policy","default_state":"rolloverstate","states":[
{"name":"rolloverstate","actions":[{"timeout":"1s","rollover":{"min_doc_count":100}}],
"transitions":[]}]}}
""".trimIndent()

createPolicyJson(testPolicy, policyID)
Expand Down Expand Up @@ -60,11 +62,24 @@ class ActionTimeoutIT : IndexStateManagementRestTestCase() {
fun(actionMetaDataMap: Any?): Boolean =
assertActionEquals(
ActionMetaData(
name = RolloverAction.name, startTime = Instant.now().toEpochMilli(), index = 0,
failed = true, consumedRetries = 0, lastRetryTime = null, actionProperties = null,
name = RolloverAction.name,
startTime = Instant.now().toEpochMilli(),
index = 0,
failed = true,
consumedRetries = 0,
lastRetryTime = null,
actionProperties = null,
),
actionMetaDataMap,
),
StepMetaData.STEP to
fun(stepMetaDataMap: Any?): Boolean =
assertStepEquals(
StepMetaData(
"attempt_rollover", Instant.now().toEpochMilli(), Step.StepStatus.TIMED_OUT,
),
stepMetaDataMap,
),
),
),
getExplainMap(indexName),
Expand Down
Loading