Trace and debug logging is extremely expensive (x32 slower) for large resource diffs #721

lawrencejones · 2023-04-11T09:12:20Z

Hey team 👋

Was unsure if this would come under bug or enhancement, but given that this is preventing my provider from creating a plan, I went for bug.

tl;dr: we have a resource that accepts a large and deeply nested object. Trace and debug logs produced by the SDK around setting defaults/state modifications/etc are so verbose that it causes planning to become much more expensive (16s -> 527s, so 32x).

Having profiled the provider as it produces the plan, it appears the majority of the cost comes from the context based logger. Generally logging this much can be expensive, but particularly reflection-based runtime loggers that pull context.Value frequently, as the time adds up.

Attaching the profile below where you can see of the 28.52s sample, we spent 22.38s in context.(*valueCtx).Value as part of the logging library, and the rest in logging adjacent functions.

(pprof) top 100Showing nodes accounting for 28.08s, 98.46% of 28.52s totalDropped 105 nodes (cum <= 0.14s)
      flat  flat%   sum%        cum   cum%    15.34s 53.79% 53.79%     19.21s 67.36%  runtime.efaceeq     3.76s 13.18% 66.97%      3.77s 13.22%  runtime.strequal
     3.49s 12.24% 79.21%      3.49s 12.24%  syscall.syscall     3.16s 11.08% 90.29%     22.38s 78.47%  context.value     1.50s  5.26% 95.55%      1.50s  5.26%  runtime.madvise
     0.66s  2.31% 97.86%      0.66s  2.31%  runtime.memclrNoHeapPointers
     0.16s  0.56% 98.42%      0.20s   0.7%  runtime.scanobject
     0.01s 0.035% 98.46%      0.76s  2.66%  runtime.mallocgc
         0     0% 98.46%     22.38s 78.47%  context.(*valueCtx).Value
         0     0% 98.46%      4.21s 14.76%  github.com/hashicorp/go-hclog.(*intLogger).Trace
         0     0% 98.46%      4.29s 15.04%  github.com/hashicorp/go-hclog.(*intLogger).log
         0     0% 98.46%      0.80s  2.81%  github.com/hashicorp/go-hclog.(*intLogger).logJSON
         0     0% 98.46%      3.49s 12.24%  github.com/hashicorp/go-hclog.(*writer).Flush
         0     0% 98.46%      0.65s  2.28%  github.com/hashicorp/go-hclog.intLogger.jsonMapEntry
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-framework/internal/fwserver.(*Server).PlanResourceChange
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-framework/internal/fwserver.MarkComputedNilsAsUnknown.func1
         0     0% 98.46%      0.67s  2.35%  github.com/hashicorp/terraform-plugin-framework/internal/logging.FrameworkDebug (inline)
         0     0% 98.46%     26.07s 91.41%  github.com/hashicorp/terraform-plugin-framework/internal/logging.FrameworkTrace (inline)
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-framework/internal/proto6server.(*Server).PlanResourceChange
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-go/tfprotov6/internal/tfplugin6._Provider_PlanResourceChange_Handler
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-go/tfprotov6/tf6server.(*server).PlanResourceChange
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-go/tftypes.Transform (inline)
         0     0% 98.46%     26.76s 93.83%  github.com/hashicorp/terraform-plugin-go/tftypes.transform
         0     0% 98.46%     22.39s 78.51%  github.com/hashicorp/terraform-plugin-log/internal/logging.GetSDKSubsystemLogger
         0     0% 98.46%      0.67s  2.35%  github.com/hashicorp/terraform-plugin-log/tfsdklog.SubsystemDebug
         0     0% 98.46%     26.07s 91.41%  github.com/hashicorp/terraform-plugin-log/tfsdklog.SubsystemTrace
         0     0% 98.46%     26.76s 93.83%  google.golang.org/grpc.(*Server).handleStream
         0     0% 98.46%     26.76s 93.83%  google.golang.org/grpc.(*Server).processUnaryRPC
         0     0% 98.46%     26.76s 93.83%  google.golang.org/grpc.(*Server).serveStreams.func1.2
         0     0% 98.46%      3.49s 12.24%  internal/poll.(*FD).Write
         0     0% 98.46%      3.49s 12.24%  internal/poll.ignoringEINTRIO (inline)
         0     0% 98.46%      3.49s 12.24%  os.(*File).Write
         0     0% 98.46%      3.49s 12.24%  os.(*File).write (inline)
         0     0% 98.46%      1.54s  5.40%  runtime.(*mheap).alloc.func1
         0     0% 98.46%      1.54s  5.40%  runtime.(*mheap).allocSpan
         0     0% 98.46%      0.20s   0.7%  runtime.gcBgMarkWorker
         0     0% 98.46%      0.20s   0.7%  runtime.gcBgMarkWorker.func2
         0     0% 98.46%      0.20s   0.7%  runtime.gcDrain
         0     0% 98.46%      0.62s  2.17%  runtime.mapassign_faststr
         0     0% 98.46%      0.59s  2.07%  runtime.newobject (inline)
         0     0% 98.46%      1.50s  5.26%  runtime.sysUsed (inline)
         0     0% 98.46%      1.50s  5.26%  runtime.sysUsedOS (inline)
         0     0% 98.46%      1.75s  6.14%  runtime.systemstack
         0     0% 98.46%      3.49s 12.24%  syscall.Write (inline)
         0     0% 98.46%      3.49s 12.24%  syscall.write

Module version

github.com/hashicorp/terraform-plugin-framework v1.2.0

Definitely running latest as I've pulled the repo locally to try investigating this.

Relevant provider source code

PlanResourceChange (https://github.com/hashicorp/terraform-plugin-framework/blob/main/internal/fwserver/server_planresourcechange.go#L44) is where we're spending all our time, which is due to the many log calls.

Applying this patch is what gives the 32x improvement in plan speed:

diff --git a/internal/logging/framework.go b/internal/logging/framework.go
index eb3e8da..f704972 100644
--- a/internal/logging/framework.go
+++ b/internal/logging/framework.go
@@ -13,6 +13,8 @@ const (
 
 // FrameworkDebug emits a framework subsystem log at DEBUG level.
 func FrameworkDebug(ctx context.Context, msg string, additionalFields ...map[string]interface{}) {
+       return
+
        tfsdklog.SubsystemDebug(ctx, SubsystemFramework, msg, additionalFields...)
 }
 
@@ -23,6 +25,8 @@ func FrameworkError(ctx context.Context, msg string, additionalFields ...map[str
 
 // FrameworkTrace emits a framework subsystem log at TRACE level.
 func FrameworkTrace(ctx context.Context, msg string, additionalFields ...map[string]interface{}) {
+       return
+
        tfsdklog.SubsystemTrace(ctx, SubsystemFramework, msg, additionalFields...)
 }

Terraform Configuration Files

I've got a big nested entries attribute on my resource that looks like this:

resource "incident_catalog_entries" "services" {
  id = incident_catalog_type.service.id

  entries = {
    for _, entry in local.catalog :
    # Map from our external ID => value for each of the attributes.
    entry["id"] => {
      name = entry["type_data"]["name"],

      attribute_values = {
        // lots of values here
      }
    }
  }
}

I can share the provider but I think the only part that is probably relevant is the schema:

func (r *IncidentCatalogEntriesResource) Schema(ctx context.Context, req resource.SchemaRequest, resp *resource.SchemaResponse) {
	resp.Schema = schema.Schema{
		MarkdownDescription: apischema.TagDocstring("Catalog V2"),
		Attributes: map[string]schema.Attribute{
			"id": schema.StringAttribute{
				MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "catalog_type_id"),
				PlanModifiers: []planmodifier.String{
					stringplanmodifier.RequiresReplace(),
				},
				Required: true,
			},
			"entries": schema.MapNestedAttribute{
				Required: true,
				NestedObject: schema.NestedAttributeObject{
					Attributes: map[string]schema.Attribute{
						"id": schema.StringAttribute{
							MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "id"),
							Computed:            true,
							PlanModifiers: []planmodifier.String{
								stringplanmodifier.UseStateForUnknown(),
							},
						},
						"name": schema.StringAttribute{
							MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "name"),
							Required:            true,
						},
						"alias": schema.StringAttribute{
							MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "alias"),
							Optional:            true,
						},
						"rank": schema.Int64Attribute{
							MarkdownDescription: apischema.Docstring("CatalogEntryV2ResponseBody", "rank"),
							Optional:            true,
							Computed:            true,
							Default:             int64default.StaticInt64(0),
						},
						"attribute_values": schema.MapNestedAttribute{
							Required: true,
							NestedObject: schema.NestedAttributeObject{
								Attributes: map[string]schema.Attribute{
									"value": schema.StringAttribute{
										Description: `The value of this attribute, in a format suitable for this attribute type.`,
										Optional:    true,
									},
									"array_value": schema.ListAttribute{
										ElementType: types.StringType,
										Description: `The value of this element of the array, in a format suitable for this attribute type.`,
										Optional:    true,
									},
								},
							},
						},
					},
				},
			},
		},
	}
}

Debug Output

Debug and trace logs kinda are the problem here, so I'll point people at the top profile output.

Expected Behavior

My plan executes in a normal time (<30s).

Actual Behavior

My plan takes so long to execute that most people would give up (30s+, actually only finishes ~10m later).

Steps to Reproduce

Build a highly nested object with many entries and attempt a plan.

References

Not that I'm aware. I did search for logging and performance in this repo and a few others, but couldn't find anything.

The text was updated successfully, but these errors were encountered:

lawrencejones · 2023-04-11T09:14:36Z

There are a couple of possible fixes that come to mind:

Set log-level at process boot inside the internal/logging package and use the cached log-level to shortcut calling into the log library
Cache the log-level for a period of time (1s?) to help shortcut calls

I'd defer to people more familiar with the project on this, though.

This commit addresses this issue: hashicorp#721 When working with large terraform resource attributes, logging can become the primary cost to running plan/apply. This is because the context logger used by tfsdklog is very expensive, the SDK framework itself is very verbose, and even the log level means Debug/Trace logs won't be emitted we still build the log. As we log via the internal logging package anyway, we can avoid this performance issue by maintaining a package-local log-level which is initialised at the same time as we init the contextual logger. Then our logging package can avoid calling into the tfsdklog (which is where we pay the performance penalty) unless it believes the level is active, which in most cases avoids 99% of the log cost.

lawrencejones · 2023-04-11T13:11:09Z

This PR avoids this issue when trace/debug levels are not active (most production situations) by maintaining a log-level guard inside of the framework code: #722

We're going to run with this for now, very interested in people's thoughts more generally. If I had more time, I'd probably be inclined to (1) get this merged and (2) investigate tfsdklog's performance later, as I expect something weird is going on around context nesting and key length that makes usage really expensive.

Introduce a resource that can manage entries in bulk, which: 1. Solves terraform performance issues when dealing with a lot of entries 2. Reduces pressure on our API when working with entries To achieve this we've forked the terraform sdk framework to improve performance when logging, which is necessary to avoid massive plan times on large nested objects. Relevant links are: - PR for logging change (hashicorp/terraform-plugin-framework#722) - Issue in upstream (hashicorp/terraform-plugin-framework#721)

This commit addresses this issue: hashicorp#721 When working with large terraform resource attributes, logging can become the primary cost to running plan/apply. This is because the context logger used by tfsdklog is very expensive, the SDK framework itself is very verbose, and even the log level means Debug/Trace logs won't be emitted we still build the log. As we log via the internal logging package anyway, we can avoid this performance issue by maintaining a package-local log-level which is initialised at the same time as we init the contextual logger. Then our logging package can avoid calling into the tfsdklog (which is where we pay the performance penalty) unless it believes the level is active, which in most cases avoids 99% of the log cost.

Reference: hashicorp/terraform-plugin-framework#721 This change introduces log level checks outside the `context.Context` of a request to improve the performance of logging functions when logs would not be emitted at the configured level. Almost exclusively, logging levels are not expected to change during provider (and therefore SDK) runtime since they are environment variable driven. Even if they did, this level check will either immediately constrict logging to match an updated higher level or cause additional `context.Context` reads with an updated lower level, which is no different than the current behavior. The following benchmark was ran prior to the introduction of the level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled100-10 4616656 258.0 ns/op BenchmarkSubsystemTraceEnabled100-10 936865 1138 ns/op BenchmarkSubsystemTraceDisabled1000-10 4552549 261.1 ns/op BenchmarkSubsystemTraceEnabled1000-10 956137 1126 ns/op BenchmarkSubsystemTraceDisabled10000-10 4475106 261.4 ns/op BenchmarkSubsystemTraceEnabled10000-10 943819 1121 ns/op ``` After the addition of level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled100-10 86043471 13.78 ns/op BenchmarkSubsystemTraceEnabled100-10 906649 1137 ns/op BenchmarkSubsystemTraceDisabled1000-10 83207169 13.81 ns/op BenchmarkSubsystemTraceEnabled1000-10 935115 1122 ns/op BenchmarkSubsystemTraceDisabled10000-10 83775966 13.77 ns/op BenchmarkSubsystemTraceEnabled10000-10 909601 1134 ns/op ``` This enhancement could also be considered for provider loggers, however SDK logging tends to be more prevalent in practice, so this only targets the `tfsdklog` package handling.

Reference: hashicorp/terraform-plugin-framework#721 This change introduces log level checks outside the `context.Context` of a request to improve the performance of logging functions when logs would not be emitted at the configured level. Almost exclusively, logging levels are not expected to change during provider (and therefore SDK) runtime since they are environment variable driven. Even if they did, this level check will either immediately constrict logging to match an updated higher level or cause additional `context.Context` reads with an updated lower level, which is no different than the current behavior. The following benchmark was ran prior to the introduction of the level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled-10 4616656 258.0 ns/op BenchmarkSubsystemTraceEnabled-10 936865 1138 ns/op ``` After the addition of level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled-10 86043471 13.78 ns/op BenchmarkSubsystemTraceEnabled-10 906649 1137 ns/op ``` This enhancement could also be considered for provider loggers, however SDK logging tends to be more prevalent in practice, so this only targets the `tfsdklog` package handling.

bflad · 2023-04-19T00:55:56Z

Hi @lawrencejones 👋 Thank you for reporting this issue and sorry for the frustrating behavior.

Over in hashicorp/terraform-plugin-log#149 it appears we can introduce some level caching and checks to avoid unnecessary context.Context handling. The unit testing benchmarks are promising, although ideally it would be great if you could also try it out.

lawrencejones · 2023-04-19T05:25:41Z

Thank you a lot for putting together! I'm at kubecon this week so a bit hectic, but I'll be happy to try this out when I get some time.

We have a terraform codebase that is un-runnable without this type of patch, so should be easy to confirm it works/doesn't.

bflad · 2023-05-15T16:22:39Z

Hi again, @lawrencejones 👋 Did you happen to have an opportunity to try out the terraform-plugin-log change? No worries if not.

#149) Reference: hashicorp/terraform-plugin-framework#721 This change introduces log level checks outside the `context.Context` of a request to improve the performance of logging functions when logs would not be emitted at the configured level. Almost exclusively, logging levels are not expected to change during provider (and therefore SDK) runtime since they are environment variable driven. Even if they did, this level check will either immediately constrict logging to match an updated higher level or cause additional `context.Context` reads with an updated lower level, which is no different than the current behavior. The following benchmark was ran prior to the introduction of the level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled-10 4616656 258.0 ns/op BenchmarkSubsystemTraceEnabled-10 936865 1138 ns/op ``` After the addition of level checks and mutexes: ``` BenchmarkSubsystemTraceDisabled-10 86043471 13.78 ns/op BenchmarkSubsystemTraceEnabled-10 906649 1137 ns/op ``` This enhancement could also be considered for provider loggers, however SDK logging tends to be more prevalent in practice, so this only targets the `tfsdklog` package handling.

bflad · 2023-06-01T13:46:47Z

The framework now pulls in some upstream terraform-plugin-log changes via #744 so we can roll this out across the other SDKs as well -- if we still find there are performance issues, we can determine next steps in a fresh new issue. 👍

github-actions · 2023-07-02T02:25:30Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

lawrencejones added the bug Something isn't working label Apr 11, 2023

lawrencejones mentioned this issue Apr 11, 2023

Shortcut logging when the level isn't active #722

Closed

lawrencejones mentioned this issue Apr 11, 2023

incident_catalog_entries incident-io/terraform-provider-incident#19

Merged

bflad mentioned this issue Apr 19, 2023

tfsdklog: Introduce outside context level checks for logging functions hashicorp/terraform-plugin-log#149

Merged

bflad self-assigned this Apr 19, 2023

bflad added this to the v1.3.0 milestone Apr 19, 2023

bflad added the waiting-response Issues or pull requests waiting for an external response label May 15, 2023

github-actions bot removed the waiting-response Issues or pull requests waiting for an external response label May 15, 2023

bflad added the waiting-response Issues or pull requests waiting for an external response label May 15, 2023

bflad closed this as completed Jun 1, 2023

github-actions bot removed the waiting-response Issues or pull requests waiting for an external response label Jun 1, 2023

github-actions bot locked as resolved and limited conversation to collaborators Jul 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace and debug logging is extremely expensive (x32 slower) for large resource diffs #721

Trace and debug logging is extremely expensive (x32 slower) for large resource diffs #721

lawrencejones commented Apr 11, 2023

lawrencejones commented Apr 11, 2023

lawrencejones commented Apr 11, 2023

bflad commented Apr 19, 2023

lawrencejones commented Apr 19, 2023

bflad commented May 15, 2023

bflad commented Jun 1, 2023

github-actions bot commented Jul 2, 2023

Trace and debug logging is extremely expensive (x32 slower) for large resource diffs #721

Trace and debug logging is extremely expensive (x32 slower) for large resource diffs #721

Comments

lawrencejones commented Apr 11, 2023

Module version

Relevant provider source code

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Steps to Reproduce

References

lawrencejones commented Apr 11, 2023

lawrencejones commented Apr 11, 2023

bflad commented Apr 19, 2023

lawrencejones commented Apr 19, 2023

bflad commented May 15, 2023

bflad commented Jun 1, 2023

github-actions bot commented Jul 2, 2023