Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better readability of failures in parallel fail fast pipelines #283

Merged
merged 1 commit into from
Jan 4, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ In addition, if the backends were configured then there will be an environment v

#### Attributes

##### Pipeline run root span
##### Pipeline, freestyle, and matrix project build spans

| Attribute | Description | Type |
|----------------------------------|--------------|------|
Expand All @@ -138,7 +138,14 @@ In addition, if the backends were configured then there will be an environment v
| ci.pipeline.parameter.name | Name of the parameters | String[] |
| ci.pipeline.parameter.value | Value of the parameters. "Sensitive" values are redacted | String[] |

##### Spans
##### Pipeline step spans

| Status Code | Status Description | Description |
|-------------|--------------------|-------------|
| OK | | for step and build success |
| UNSET | Machine readable status like `FlowInterruptedException:FailFastCause:Failed in branch failingBranch` | For interrupted steps of type fail fast parallel pipeline interruption, pipeline build superseded by a newer build, or pipeline build cancelled by user, the span status is set to `UNSET` rather than `ERROR` for readability |
| ERROR | Machine readable status like `FlowInterruptedException:ExceededTimeout:Timeout has been exceeded` | For other causes of step failure |


| Attribute | Description | Type |
|----------------------------------|--------------|------|
Expand All @@ -148,6 +155,7 @@ In addition, if the backends were configured then there will be an environment v
| jenkins.pipeline.step.plugin.name | Jenkins plugin for that particular step | String |
| jenkins.pipeline.step.plugin.version| Jenkins plugin version | String |
| jenkins.pipeline.step.agent.label | Labels attached to the agent | String |
| jenkins.pipeline.step.interruption.causes | List of machine readable causes of the interruption of the step like `FailFastCause:Failed in branch failingBranch`. <p/>Common causes of interruption: `CanceledCause: Superseded by my-pipeline#123`, `ExceededTimeout: Timeout has been exceeded`, `FailFastCause:Failed in branch the-failing-branch`, `UserInterruption: Aborted by a-user` | String[] |
| git.branch | Git branch name | String |
| git.repository | Git repository | String |
| git.username | Git user | String |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import io.jenkins.plugins.opentelemetry.semconv.JenkinsOtelSemanticAttributes;
import io.jenkins.plugins.opentelemetry.semconv.OTelEnvironmentVariablesConventions;
import io.opentelemetry.sdk.resources.Resource;
import jenkins.model.CauseOfInterruption;
import jenkins.model.GlobalConfiguration;
import jenkins.model.Jenkins;
import net.sf.json.JSONObject;
Expand All @@ -29,6 +30,7 @@
import org.jenkinsci.plugins.workflow.cps.nodes.StepStartNode;
import org.jenkinsci.plugins.workflow.graph.FlowNode;
import org.jenkinsci.plugins.workflow.steps.CoreStep;
import org.jenkinsci.plugins.workflow.support.steps.StageStepExecution;
import org.kohsuke.stapler.DataBoundConstructor;
import org.kohsuke.stapler.DataBoundSetter;
import org.kohsuke.stapler.QueryParameter;
Expand All @@ -42,15 +44,7 @@
import javax.inject.Inject;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.Optional;
import java.util.Properties;
import java.util.Set;
import java.util.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: list all the imports

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.logging.Level;
Expand Down Expand Up @@ -95,6 +89,19 @@ public class JenkinsOpenTelemetryPluginConfiguration extends GlobalConfiguration

private String serviceNamespace;

/**
* Interruption causes that should mark the span as error because they are external interruptions.
*
* TODO make this list configurable and accessible through {@link io.opentelemetry.sdk.autoconfigure.spi.ConfigProperties#getList(String)}
* @see CauseOfInterruption
* @see org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
*/
private List<String> statusUnsetCausesOfInterruption = Arrays.asList(
"org.jenkinsci.plugins.workflow.cps.steps.ParallelStep$FailFastCause",
StageStepExecution.CanceledCause.class.getName(),
CauseOfInterruption.UserInterruption.class.getName()
);

/**
* The previously used configuration. Kept in memory to prevent unneeded reconfigurations.
*/
Expand Down Expand Up @@ -242,6 +249,10 @@ public void setIgnoredSteps(String ignoredSteps) {
this.ignoredSteps = ignoredSteps;
}

public List<String> getStatusUnsetCausesOfInterruption() {
return statusUnsetCausesOfInterruption;
}

public String getDisabledResourceProviders() {
return disabledResourceProviders;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
import io.opentelemetry.context.Context;
import io.opentelemetry.context.Scope;
import io.opentelemetry.semconv.resource.attributes.ResourceAttributes;
import jenkins.model.CauseOfInterruption;
import org.apache.commons.compress.utils.Sets;
import org.jenkinsci.plugins.structs.SymbolLookup;
import org.jenkinsci.plugins.structs.describable.UninstantiatedDescribable;
Expand All @@ -42,6 +43,7 @@
import org.jenkinsci.plugins.workflow.graph.FlowNode;
import org.jenkinsci.plugins.workflow.job.WorkflowRun;
import org.jenkinsci.plugins.workflow.steps.CoreStep;
import org.jenkinsci.plugins.workflow.steps.FlowInterruptedException;
import org.jenkinsci.plugins.workflow.steps.Step;
import org.jenkinsci.plugins.workflow.steps.StepContext;
import org.jenkinsci.plugins.workflow.steps.StepDescriptor;
Expand All @@ -54,13 +56,15 @@
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.Set;
import java.util.function.Supplier;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.stream.Collectors;

import static com.google.common.base.Verify.verifyNotNull;

Expand All @@ -74,9 +78,16 @@ public class MonitoringPipelineListener extends AbstractPipelineListener impleme
private Set<String> ignoredSteps;
private List<StepHandler> stepHandlers;

/**
* Interruption causes that should mark the span as error because they are external interruptions.
*/
Set<String> statusUnsetCausesOfInterruption;

@PostConstruct
public void postConstruct() {
this.ignoredSteps = Sets.newHashSet(JenkinsOpenTelemetryPluginConfiguration.get().getIgnoredSteps().split(","));
final JenkinsOpenTelemetryPluginConfiguration jenkinsOpenTelemetryPluginConfiguration = JenkinsOpenTelemetryPluginConfiguration.get();
this.ignoredSteps = Sets.newHashSet(jenkinsOpenTelemetryPluginConfiguration.getIgnoredSteps().split(","));
this.statusUnsetCausesOfInterruption = new HashSet<>(jenkinsOpenTelemetryPluginConfiguration.getStatusUnsetCausesOfInterruption());
}

@Override
Expand Down Expand Up @@ -292,8 +303,32 @@ private void endCurrentSpan(FlowNode node, WorkflowRun run) {
span.setStatus(StatusCode.OK);
} else {
Throwable throwable = errorAction.getError();
span.recordException(throwable);
span.setStatus(StatusCode.ERROR, throwable.getMessage());
if (throwable instanceof FlowInterruptedException) {
FlowInterruptedException interruptedException = (FlowInterruptedException) throwable;
List<CauseOfInterruption> causesOfInterruption = interruptedException.getCauses();

List<String> causeDescriptions = causesOfInterruption.stream().map(cause -> cause.getClass().getSimpleName() + ": " + cause.getShortDescription()).collect(Collectors.toList());
span.setAttribute(JenkinsOtelSemanticAttributes.JENKINS_STEP_INTERRUPTION_CAUSES, causeDescriptions);

String statusDescription = throwable.getClass().getSimpleName() + ": " + causeDescriptions.stream().collect(Collectors.joining(", "));

boolean suppressSpanStatusCodeError = false;
for (CauseOfInterruption causeOfInterruption: causesOfInterruption) {
if (statusUnsetCausesOfInterruption.contains(causeOfInterruption.getClass().getName())) {
suppressSpanStatusCodeError = true;
break;
}
}
if (suppressSpanStatusCodeError) {
span.setStatus(StatusCode.UNSET, statusDescription);
} else {
span.recordException(throwable);
span.setStatus(StatusCode.ERROR, statusDescription);
}
} else {
span.recordException(throwable);
span.setStatus(StatusCode.ERROR, throwable.getMessage());
}
}
span.end();
LOGGER.log(Level.FINE, () -> run.getFullDisplayName() + " - < " + node.getDisplayFunctionName() + " - end " + OtelUtils.toDebugString(span));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -327,13 +327,13 @@ public void _onFinalized(@NonNull Run run) {
parentSpan.setAttribute(JenkinsOtelSemanticAttributes.CI_PIPELINE_RUN_RESULT, Objects.toString(runResult, null));

if (Result.SUCCESS.equals(runResult)) {
parentSpan.setStatus(StatusCode.OK);
parentSpan.setStatus(StatusCode.OK, runResult.toString());
} else if (Result.FAILURE.equals(runResult) || Result.UNSTABLE.equals(runResult)){
parentSpan.setAttribute(SemanticAttributes.EXCEPTION_TYPE, "PIPELINE_" + runResult);
parentSpan.setAttribute(SemanticAttributes.EXCEPTION_MESSAGE, "PIPELINE_" + runResult);
parentSpan.setStatus(StatusCode.ERROR);
parentSpan.setStatus(StatusCode.ERROR, runResult.toString());
} else if (Result.ABORTED.equals(runResult) || Result.NOT_BUILT.equals(runResult)) {
parentSpan.setStatus(StatusCode.UNSET);
parentSpan.setStatus(StatusCode.UNSET, runResult.toString());
}
}
// NODE
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ public final class JenkinsOtelSemanticAttributes {

public static final AttributeKey<String> JENKINS_STEP_AGENT_LABEL = AttributeKey.stringKey("jenkins.pipeline.step.agent.label");

public static final AttributeKey<List<String>> JENKINS_STEP_INTERRUPTION_CAUSES = AttributeKey.stringArrayKey("jenkins.pipeline.step.interruption.causes");

public static final String JENKINS = "jenkins";

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,14 @@
import io.jenkins.plugins.opentelemetry.semconv.JenkinsOtelSemanticAttributes;
import io.jenkins.plugins.opentelemetry.semconv.JenkinsSemanticMetrics;
import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.sdk.metrics.data.LongPointData;
import io.opentelemetry.sdk.metrics.data.MetricData;
import io.opentelemetry.sdk.metrics.data.MetricDataType;
import io.opentelemetry.sdk.testing.exporter.InMemoryMetricExporterProvider;
import io.opentelemetry.sdk.testing.exporter.InMemoryMetricExporterUtils;
import io.opentelemetry.sdk.trace.data.SpanData;
import io.opentelemetry.sdk.trace.data.StatusData;
import org.apache.commons.lang3.SystemUtils;
import org.hamcrest.CoreMatchers;
import org.hamcrest.MatcherAssert;
Expand All @@ -29,6 +32,7 @@
import org.junit.Test;
import org.jvnet.hudson.test.recipes.WithPlugin;

import java.util.Arrays;
import java.util.Collection;
import java.util.List;
import java.util.Map;
Expand Down Expand Up @@ -462,4 +466,39 @@ public void testPipelineWithoutCheckoutShallowSteps() throws Exception {
MatcherAssert.assertThat(attributes.get(JenkinsOtelSemanticAttributes.GIT_CLONE_SHALLOW), CoreMatchers.is(false));
MatcherAssert.assertThat(attributes.get(JenkinsOtelSemanticAttributes.GIT_CLONE_DEPTH), CoreMatchers.is(0L));
}

@Test
public void testFailFastParallelScriptedPipelineWithException() throws Exception {
assumeFalse(SystemUtils.IS_OS_WINDOWS);
String jobName = "fail-fast-parallel-scripted-pipeline-with-failure" + jobNameSuffix.incrementAndGet();

String pipelineScript = "node() {\n" +
" stage('ze-parallel-stage') {\n" +
" parallel failingBranch: {\n" +
" error 'the failure that will cause the interruption of other branches'\n" +
" }, branchThatWillBeInterrupted: {\n" +
" sleep 5\n" +
" }, failFast:true\n" +
" }\n" +
"}";
Node agent = jenkinsRule.createOnlineSlave();
WorkflowJob pipeline = jenkinsRule.createProject(WorkflowJob.class, jobName);
pipeline.setDefinition(new CpsFlowDefinition(pipelineScript, true));
WorkflowRun build = jenkinsRule.assertBuildStatus(Result.FAILURE, pipeline.scheduleBuild2(0));

Tree<SpanDataWrapper> spans = getGeneratedSpans();
checkChainOfSpans(spans, "sleep", "Parallel branch: branchThatWillBeInterrupted", "Stage: ze-parallel-stage", JenkinsOtelSemanticAttributes.AGENT_UI, "Phase: Run");

SpanData sleepSpanData = spans.breadthFirstSearchNodes(node -> "sleep".equals(node.getData().spanData.getName())).get().getData().spanData;
MatcherAssert.assertThat(sleepSpanData.getStatus().getStatusCode(), CoreMatchers.is(StatusCode.UNSET));

SpanData branchThatWillBeInterruptedSpanData = spans.breadthFirstSearchNodes(node -> "Parallel branch: branchThatWillBeInterrupted".equals(node.getData().spanData.getName())).get().getData().spanData;
MatcherAssert.assertThat(branchThatWillBeInterruptedSpanData.getStatus().getStatusCode(), CoreMatchers.is(StatusCode.UNSET));
MatcherAssert.assertThat(branchThatWillBeInterruptedSpanData.getStatus().getDescription(), CoreMatchers.is("FlowInterruptedException: FailFastCause: Failed in branch failingBranch"));
MatcherAssert.assertThat(branchThatWillBeInterruptedSpanData.getAttributes().get(JenkinsOtelSemanticAttributes.JENKINS_STEP_INTERRUPTION_CAUSES), CoreMatchers.is(Arrays.asList("FailFastCause: Failed in branch failingBranch")));

SpanData failingBranchSpanData = spans.breadthFirstSearchNodes(node -> "Parallel branch: failingBranch".equals(node.getData().spanData.getName())).get().getData().spanData;
MatcherAssert.assertThat(failingBranchSpanData.getStatus().getStatusCode(), CoreMatchers.is(StatusCode.ERROR));
MatcherAssert.assertThat(failingBranchSpanData.getStatus().getDescription(), CoreMatchers.is("the failure that will cause the interruption of other branches"));
}
}