Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace and debug logging is extremely expensive (x32 slower) for large resource diffs #721

Closed
lawrencejones opened this issue Apr 11, 2023 · 7 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@lawrencejones
Copy link

Hey team 👋

Was unsure if this would come under bug or enhancement, but given that this is preventing my provider from creating a plan, I went for bug.

tl;dr: we have a resource that accepts a large and deeply nested object. Trace and debug logs produced by the SDK around setting defaults/state modifications/etc are so verbose that it causes planning to become much more expensive (16s -> 527s, so 32x).

Having profiled the provider as it produces the plan, it appears the majority of the cost comes from the context based logger. Generally logging this much can be expensive, but particularly reflection-based runtime loggers that pull context.Value frequently, as the time adds up.

Attaching the profile below where you can see of the 28.52s sample, we spent 22.38s in context.(*valueCtx).Value as part of the logging library, and the rest in logging adjacent functions.

(pprof) top 100Showing nodes accounting for 28.08s, 98.46% of 28.52s totalDropped 105 nodes (cum <= 0.14s)
      flat  flat%   sum%        cum   cum%    15.34s 53.79% 53.79%     19.21s 67.36%  runtime.efaceeq     3.76s 13.18% 66.97%      3.77s 13.22%  runtime.strequal
     3.49s 12.24% 79.21%      3.49s 12.24%  syscall.syscall     3.16s 11.08% 90.29%     22.38s 78.47%  context.value     1.50s  5.26% 95.55%      1.50s  5.26%  runtime.madvise
     0.66s  2.31% 97.86%      0.66s  2.31%  runtime.memclrNoHeapPointers
     0.16s  0.56% 98.42%      0.20s   0.7%  runtime.scanobject
     0.01s 0.035% 98.46%      0.76s  2.66%  runtime.mallocgc
         0     0% 98.46%     22.38s 78.47%  context.(*valueCtx).Value
         0     0% 98.46%      4.21s 14.76%  github.com/hashicorp/go-hclog.(*intLogger).Trace
         0     0% 98.46%      4.29s 15.04%  github.com/hashicorp/go-hclog.(*intLogger).log
         0     0% 98.46%      0.80s  2.81%  github.com/hashicorp/go-hclog.(*intLogger).logJSON
         0     0% 98.46%      3.49s 12.24%  github.com/hashicorp/go-hclog.(*writer).Flush
         0     0% 98.46%      0.65s  2.28%  github.com/hashicorp/go-hclog.intLogger.jsonMapEntry
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-framework/internal/fwserver.(*Server).PlanResourceChange
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-framework/internal/fwserver.MarkComputedNilsAsUnknown.func1
         0     0% 98.46%      0.67s  2.35%  github.com/hashicorp/terraform-plugin-framework/internal/logging.FrameworkDebug (inline)
         0     0% 98.46%     26.07s 91.41%  github.com/hashicorp/terraform-plugin-framework/internal/logging.FrameworkTrace (inline)
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-framework/internal/proto6server.(*Server).PlanResourceChange
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-go/tfprotov6/internal/tfplugin6._Provider_PlanResourceChange_Handler
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-go/tfprotov6/tf6server.(*server).PlanResourceChange
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-go/tftypes.Transform (inline)
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-go/tftypes.transform
         0     0% 98.46%     22.39s 78.51%  github.com/hashicorp/terraform-plugin-log/internal/logging.GetSDKSubsystemLogger
         0     0% 98.46%      0.67s  2.35%  github.com/hashicorp/terraform-plugin-log/tfsdklog.SubsystemDebug
         0     0% 98.46%     26.07s 91.41%  github.com/hashicorp/terraform-plugin-log/tfsdklog.SubsystemTrace
         0     0% 98.46%     26.76s 93.83%  google.golang.org/grpc.(*Server).handleStream
         0     0% 98.46%     26.76s 93.83%  google.golang.org/grpc.(*Server).processUnaryRPC
         0     0% 98.46%     26.76s 93.83%  google.golang.org/grpc.(*Server).serveStreams.func1.2
         0     0% 98.46%      3.49s 12.24%  internal/poll.(*FD).Write
         0     0% 98.46%      3.49s 12.24%  internal/poll.ignoringEINTRIO (inline)
         0     0% 98.46%      3.49s 12.24%  os.(*File).Write
         0     0% 98.46%      3.49s 12.24%  os.(*File).write (inline)
         0     0% 98.46%      1.54s  5.40%  runtime.(*mheap).alloc.func1
         0     0% 98.46%      1.54s  5.40%  runtime.(*mheap).allocSpan
         0     0% 98.46%      0.20s   0.7%  runtime.gcBgMarkWorker
         0     0% 98.46%      0.20s   0.7%  runtime.gcBgMarkWorker.func2
         0     0% 98.46%      0.20s   0.7%  runtime.gcDrain
         0     0% 98.46%      0.62s  2.17%  runtime.mapassign_faststr
         0     0% 98.46%      0.59s  2.07%  runtime.newobject (inline)
         0     0% 98.46%      1.50s  5.26%  runtime.sysUsed (inline)
         0     0% 98.46%      1.50s  5.26%  runtime.sysUsedOS (inline)
         0     0% 98.46%      1.75s  6.14%  runtime.systemstack
         0     0% 98.46%      3.49s 12.24%  syscall.Write (inline)
         0     0% 98.46%      3.49s 12.24%  syscall.write

Module version

github.com/hashicorp/terraform-plugin-framework v1.2.0

Definitely running latest as I've pulled the repo locally to try investigating this.

Relevant provider source code

PlanResourceChange (https://github.com/hashicorp/terraform-plugin-framework/blob/main/internal/fwserver/server_planresourcechange.go#L44) is where we're spending all our time, which is due to the many log calls.

Applying this patch is what gives the 32x improvement in plan speed:

diff --git a/internal/logging/framework.go b/internal/logging/framework.go
index eb3e8da..f704972 100644
--- a/internal/logging/framework.go
+++ b/internal/logging/framework.go
@@ -13,6 +13,8 @@ const (
 
 // FrameworkDebug emits a framework subsystem log at DEBUG level.
 func FrameworkDebug(ctx context.Context, msg string, additionalFields ...map[string]interface{}) {
+       return
+
        tfsdklog.SubsystemDebug(ctx, SubsystemFramework, msg, additionalFields...)
 }
 
@@ -23,6 +25,8 @@ func FrameworkError(ctx context.Context, msg string, additionalFields ...map[str
 
 // FrameworkTrace emits a framework subsystem log at TRACE level.
 func FrameworkTrace(ctx context.Context, msg string, additionalFields ...map[string]interface{}) {
+       return
+
        tfsdklog.SubsystemTrace(ctx, SubsystemFramework, msg, additionalFields...)
 }

Terraform Configuration Files

I've got a big nested entries attribute on my resource that looks like this:

resource "incident_catalog_entries" "services" {
  id = incident_catalog_type.service.id

  entries = {
    for _, entry in local.catalog :
    # Map from our external ID => value for each of the attributes.
    entry["id"] => {
      name = entry["type_data"]["name"],

      attribute_values = {
        // lots of values here
      }
    }
  }
}

I can share the provider but I think the only part that is probably relevant is the schema:

func (r *IncidentCatalogEntriesResource) Schema(ctx context.Context, req resource.SchemaRequest, resp *resource.SchemaResponse) {
	resp.Schema = schema.Schema{
		MarkdownDescription: apischema.TagDocstring("Catalog V2"),
		Attributes: map[string]schema.Attribute{
			"id": schema.StringAttribute{
				MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "catalog_type_id"),
				PlanModifiers: []planmodifier.String{
					stringplanmodifier.RequiresReplace(),
				},
				Required: true,
			},
			"entries": schema.MapNestedAttribute{
				Required: true,
				NestedObject: schema.NestedAttributeObject{
					Attributes: map[string]schema.Attribute{
						"id": schema.StringAttribute{
							MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "id"),
							Computed:            true,
							PlanModifiers: []planmodifier.String{
								stringplanmodifier.UseStateForUnknown(),
							},
						},
						"name": schema.StringAttribute{
							MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "name"),
							Required:            true,
						},
						"alias": schema.StringAttribute{
							MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "alias"),
							Optional:            true,
						},
						"rank": schema.Int64Attribute{
							MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "rank"),
							Optional:            true,
							Computed:            true,
							Default:             int64default.StaticInt64(0),
						},
						"attribute_values": schema.MapNestedAttribute{
							Required: true,
							NestedObject: schema.NestedAttributeObject{
								Attributes: map[string]schema.Attribute{
									"value": schema.StringAttribute{
										Description: `The value of this attribute, in a format suitable for this attribute type.`,
										Optional:    true,
									},
									"array_value": schema.ListAttribute{
										ElementType: types.StringType,
										Description: `The value of this element of the array, in a format suitable for this attribute type.`,
										Optional:    true,
									},
								},
							},
						},
					},
				},
			},
		},
	}
}

Debug Output

Debug and trace logs kinda are the problem here, so I'll point people at the top profile output.

Expected Behavior

My plan executes in a normal time (<30s).

Actual Behavior

My plan takes so long to execute that most people would give up (30s+, actually only finishes ~10m later).

Steps to Reproduce

Build a highly nested object with many entries and attempt a plan.

References

Not that I'm aware. I did search for logging and performance in this repo and a few others, but couldn't find anything.

@lawrencejones lawrencejones added the bug Something isn't working label Apr 11, 2023
@lawrencejones
Copy link
Author

There are a couple of possible fixes that come to mind:

  1. Set log-level at process boot inside the internal/logging package and use the cached log-level to shortcut calling into the log library
  2. Cache the log-level for a period of time (1s?) to help shortcut calls

I'd defer to people more familiar with the project on this, though.

lawrencejones added a commit to incident-io/terraform-plugin-framework that referenced this issue Apr 11, 2023
This commit addresses this issue:
hashicorp#721

When working with large terraform resource attributes, logging can
become the primary cost to running plan/apply. This is because the
context logger used by tfsdklog is very expensive, the SDK framework
itself is very verbose, and even the log level means Debug/Trace logs
won't be emitted we still build the log.

As we log via the internal logging package anyway, we can avoid this
performance issue by maintaining a package-local log-level which is
initialised at the same time as we init the contextual logger. Then our
logging package can avoid calling into the tfsdklog (which is where we
pay the performance penalty) unless it believes the level is active,
which in most cases avoids 99% of the log cost.
@lawrencejones
Copy link
Author

This PR avoids this issue when trace/debug levels are not active (most production situations) by maintaining a log-level guard inside of the framework code: #722

We're going to run with this for now, very interested in people's thoughts more generally. If I had more time, I'd probably be inclined to (1) get this merged and (2) investigate tfsdklog's performance later, as I expect something weird is going on around context nesting and key length that makes usage really expensive.

lawrencejones added a commit to incident-io/terraform-provider-incident that referenced this issue Apr 11, 2023
Introduce a resource that can manage entries in bulk, which:

1. Solves terraform performance issues when dealing with a lot of
   entries
2. Reduces pressure on our API when working with entries

To achieve this we've forked the terraform sdk framework to improve
performance when logging, which is necessary to avoid massive plan times
on large nested objects.

Relevant links are:
- PR for logging change
  (hashicorp/terraform-plugin-framework#722)
- Issue in upstream
  (hashicorp/terraform-plugin-framework#721)
lawrencejones added a commit to incident-io/terraform-provider-incident that referenced this issue Apr 11, 2023
Introduce a resource that can manage entries in bulk, which:

1. Solves terraform performance issues when dealing with a lot of
   entries
2. Reduces pressure on our API when working with entries

To achieve this we've forked the terraform sdk framework to improve
performance when logging, which is necessary to avoid massive plan times
on large nested objects.

Relevant links are:
- PR for logging change
  (hashicorp/terraform-plugin-framework#722)
- Issue in upstream
  (hashicorp/terraform-plugin-framework#721)
lawrencejones added a commit to incident-io/terraform-provider-incident that referenced this issue Apr 11, 2023
Introduce a resource that can manage entries in bulk, which:

1. Solves terraform performance issues when dealing with a lot of
   entries
2. Reduces pressure on our API when working with entries

To achieve this we've forked the terraform sdk framework to improve
performance when logging, which is necessary to avoid massive plan times
on large nested objects.

Relevant links are:
- PR for logging change
  (hashicorp/terraform-plugin-framework#722)
- Issue in upstream
  (hashicorp/terraform-plugin-framework#721)
lawrencejones added a commit to incident-io/terraform-provider-incident that referenced this issue Apr 11, 2023
Introduce a resource that can manage entries in bulk, which:

1. Solves terraform performance issues when dealing with a lot of
   entries
2. Reduces pressure on our API when working with entries

To achieve this we've forked the terraform sdk framework to improve
performance when logging, which is necessary to avoid massive plan times
on large nested objects.

Relevant links are:
- PR for logging change
  (hashicorp/terraform-plugin-framework#722)
- Issue in upstream
  (hashicorp/terraform-plugin-framework#721)
lawrencejones added a commit to incident-io/terraform-plugin-framework that referenced this issue Apr 11, 2023
This commit addresses this issue:
hashicorp#721

When working with large terraform resource attributes, logging can
become the primary cost to running plan/apply. This is because the
context logger used by tfsdklog is very expensive, the SDK framework
itself is very verbose, and even the log level means Debug/Trace logs
won't be emitted we still build the log.

As we log via the internal logging package anyway, we can avoid this
performance issue by maintaining a package-local log-level which is
initialised at the same time as we init the contextual logger. Then our
logging package can avoid calling into the tfsdklog (which is where we
pay the performance penalty) unless it believes the level is active,
which in most cases avoids 99% of the log cost.
bflad added a commit to hashicorp/terraform-plugin-log that referenced this issue Apr 19, 2023
Reference: hashicorp/terraform-plugin-framework#721

This change introduces log level checks outside the `context.Context` of a request to improve the performance of logging functions when logs would not be emitted at the configured level. Almost exclusively, logging levels are not expected to change during provider (and therefore SDK) runtime since they are environment variable driven. Even if they did, this level check will either immediately constrict logging to match an updated higher level or cause additional `context.Context` reads with an updated lower level, which is no different than the current behavior.

The following benchmark was ran prior to the introduction of the level checks and mutexes:

```
BenchmarkSubsystemTraceDisabled100-10            4616656               258.0 ns/op
BenchmarkSubsystemTraceEnabled100-10              936865              1138 ns/op
BenchmarkSubsystemTraceDisabled1000-10           4552549               261.1 ns/op
BenchmarkSubsystemTraceEnabled1000-10             956137              1126 ns/op
BenchmarkSubsystemTraceDisabled10000-10          4475106               261.4 ns/op
BenchmarkSubsystemTraceEnabled10000-10            943819              1121 ns/op
```

After the addition of level checks and mutexes:

```
BenchmarkSubsystemTraceDisabled100-10           86043471                13.78 ns/op
BenchmarkSubsystemTraceEnabled100-10              906649              1137 ns/op
BenchmarkSubsystemTraceDisabled1000-10          83207169                13.81 ns/op
BenchmarkSubsystemTraceEnabled1000-10             935115              1122 ns/op
BenchmarkSubsystemTraceDisabled10000-10         83775966                13.77 ns/op
BenchmarkSubsystemTraceEnabled10000-10            909601              1134 ns/op
```

This enhancement could also be considered for provider loggers, however SDK logging tends to be more prevalent in practice, so this only targets the `tfsdklog` package handling.
bflad added a commit to hashicorp/terraform-plugin-log that referenced this issue Apr 19, 2023
Reference: hashicorp/terraform-plugin-framework#721

This change introduces log level checks outside the `context.Context` of a request to improve the performance of logging functions when logs would not be emitted at the configured level. Almost exclusively, logging levels are not expected to change during provider (and therefore SDK) runtime since they are environment variable driven. Even if they did, this level check will either immediately constrict logging to match an updated higher level or cause additional `context.Context` reads with an updated lower level, which is no different than the current behavior.

The following benchmark was ran prior to the introduction of the level checks and mutexes:

```
BenchmarkSubsystemTraceDisabled-10            4616656               258.0 ns/op
BenchmarkSubsystemTraceEnabled-10              936865              1138 ns/op
```

After the addition of level checks and mutexes:

```
BenchmarkSubsystemTraceDisabled-10           86043471                13.78 ns/op
BenchmarkSubsystemTraceEnabled-10              906649              1137 ns/op
```

This enhancement could also be considered for provider loggers, however SDK logging tends to be more prevalent in practice, so this only targets the `tfsdklog` package handling.
@bflad
Copy link
Contributor

bflad commented Apr 19, 2023

Hi @lawrencejones 👋 Thank you for reporting this issue and sorry for the frustrating behavior.

Over in hashicorp/terraform-plugin-log#149 it appears we can introduce some level caching and checks to avoid unnecessary context.Context handling. The unit testing benchmarks are promising, although ideally it would be great if you could also try it out.

@bflad bflad self-assigned this Apr 19, 2023
@bflad bflad added this to the v1.3.0 milestone Apr 19, 2023
@lawrencejones
Copy link
Author

Thank you a lot for putting together! I'm at kubecon this week so a bit hectic, but I'll be happy to try this out when I get some time.

We have a terraform codebase that is un-runnable without this type of patch, so should be easy to confirm it works/doesn't.

@bflad
Copy link
Contributor

bflad commented May 15, 2023

Hi again, @lawrencejones 👋 Did you happen to have an opportunity to try out the terraform-plugin-log change? No worries if not.

@bflad bflad added the waiting-response Issues or pull requests waiting for an external response label May 15, 2023
@github-actions github-actions bot removed the waiting-response Issues or pull requests waiting for an external response label May 15, 2023
@bflad bflad added the waiting-response Issues or pull requests waiting for an external response label May 15, 2023
bflad added a commit to hashicorp/terraform-plugin-log that referenced this issue May 31, 2023
#149)

Reference: hashicorp/terraform-plugin-framework#721

This change introduces log level checks outside the `context.Context` of a request to improve the performance of logging functions when logs would not be emitted at the configured level. Almost exclusively, logging levels are not expected to change during provider (and therefore SDK) runtime since they are environment variable driven. Even if they did, this level check will either immediately constrict logging to match an updated higher level or cause additional `context.Context` reads with an updated lower level, which is no different than the current behavior.

The following benchmark was ran prior to the introduction of the level checks and mutexes:

```
BenchmarkSubsystemTraceDisabled-10            4616656               258.0 ns/op
BenchmarkSubsystemTraceEnabled-10              936865              1138 ns/op
```

After the addition of level checks and mutexes:

```
BenchmarkSubsystemTraceDisabled-10           86043471                13.78 ns/op
BenchmarkSubsystemTraceEnabled-10              906649              1137 ns/op
```

This enhancement could also be considered for provider loggers, however SDK logging tends to be more prevalent in practice, so this only targets the `tfsdklog` package handling.
@bflad
Copy link
Contributor

bflad commented Jun 1, 2023

The framework now pulls in some upstream terraform-plugin-log changes via #744 so we can roll this out across the other SDKs as well -- if we still find there are performance issues, we can determine next steps in a fresh new issue. 👍

@bflad bflad closed this as completed Jun 1, 2023
@github-actions github-actions bot removed the waiting-response Issues or pull requests waiting for an external response label Jun 1, 2023
@github-actions
Copy link

github-actions bot commented Jul 2, 2023

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants