Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDTEST-409] Send telemetry events in background thread to remove blocking HTTP calls from critical path #3718

Merged

Conversation

anmarchenko
Copy link
Member

@anmarchenko anmarchenko commented Jun 13, 2024

What does this PR do?
Moves all telemetry events sending to the background thread. Event "app-started" is now being sent when the Telemetry::Worker (only one successful event dispatch per application run), no additional logic to determine when to call telemetry.started! is required.

Notable changes include:

  • Telemetry::Client is renamed to Telemetry::Component to better reflect its purpose: it is no longer dispatches HTTP requests, but provides a public API to enqueue them.
  • Telemetry::Heartbeat is renamed to Telemetry::Worker. Now it includes Core::Workers::Queue
    as well as buffer with limited capacity.
  • I introduced a new type of OnlyOnce util: OnlyOnceSuccessful. It shares most of its functionality with OnlyOnce with important difference: it resets ran_once var if the block returns falsey value such as nil or false. This ensures that app-started event is sent once, successfully (before that, if agent was misconfigured on application start, telemetry would fail to start correctly)
  • Core::Configuration no longer includes custom logic to determine whether to call telemetry.started!: now it is handled internally by Telemetry::Worker, there is no need for Core to handle specific component's lifecycle.

Motivation:
Currently Telemetry::Client class performs HTTP requests to the agent synchronously which can cause delays to the instrumented application.

How to test the change?
Tested using empty rails app, here are the debug logs:

D, [2024-06-17T17:04:27.935501 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/configuration/settings.rb:712:in `block (3 levels) in <class:Settings>') Development environment detected, disabling Remote Configuration. You can enable it with DD_REMOTE_CONFIGURATION_ENABLED=true.

# ---> Telemetry starts for the first time before Datadog.configure with default agent config, app-started event sending failed
D, [2024-06-17T17:04:27.935963 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x000000012d2b5240>
I, [2024-06-17T17:04:27.936216 #83355]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-17T15:04:27Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}
D, [2024-06-17T17:04:27.966562 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/http/adapters/net.rb:47:in `rescue in post') Unable to send telemetry event to agent: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126)

# --> this log is a bit misleading here, sending was attempted but the result was an error (code: nil)
D, [2024-06-17T17:04:27.966644 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: nil)
D, [2024-06-17T17:04:27.967371 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:107:in `block in started!') Error sending telemetry app-started event, retry after heartbeat interval...
# --> here worker for the first telemetry component was stopped

=> Booting Puma
=> Rails 7.1.3.4 application starting in development 
=> Run `bin/rails server --help` for more startup options

# --> second configuration, on require 'auto_instrument'
D, [2024-06-17T17:04:28.105978 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x000000012da77dc0>
I, [2024-06-17T17:04:28.106370 #83355]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-17T15:04:28Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}

# --> again agent connection is not configured yet, telemetry failed to send app-started event
D, [2024-06-17T17:04:28.110379 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/http/adapters/net.rb:47:in `rescue in post') Unable to send telemetry event to agent: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126)
D, [2024-06-17T17:04:28.110538 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: nil)
D, [2024-06-17T17:04:28.110613 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:107:in `block in started!') Error sending telemetry app-started event, retry after heartbeat interval...

# ---> another reconfiguration
D, [2024-06-17T17:04:28.135660 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x000000012f9171e0>
I, [2024-06-17T17:04:28.135971 #83355]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-17T15:04:28Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":"local","service":"quotes-rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}

# ---> and another reconfiguration concurrently
D, [2024-06-17T17:04:28.211560 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x000000012f6ffbf0>

# ----> this time app-started event was sent
D, [2024-06-17T17:04:28.344994 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: 202)
D, [2024-06-17T17:04:28.345096 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:101:in `block in started!') Telemetry app-started event is successfully sent
# --> immediately after, app-dependencies-loaded event was sent
D, [2024-06-17T17:04:28.444926 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-dependencies-loaded` (code: 202)

# ---> polling cycle starts
D, [2024-06-17T17:04:28.445225 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:82:in `flush_events') Sending 1 telemetry events
D, [2024-06-17T17:04:28.528851 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-17T17:04:28.528897 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:82:in `flush_events') Sending 0 telemetry events
D, [2024-06-17T17:04:28.597040 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-17T17:04:28.597083 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:82:in `flush_events') Sending 0 telemetry events
D, [2024-06-17T17:04:28.627644 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-integrations-change` (code: 202)
I, [2024-06-17T17:04:28.627892 #83355]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-17T15:04:28Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":"local","service":"quotes-rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}
Puma starting in single mode...
* Puma version: 6.4.2 (ruby 3.3.3-p89) ("The Eagle of Durango")
*  Min threads: 5
*  Max threads: 5
*  Environment: development
*          PID: 83355
* Listening on http://127.0.0.1:3000
* Listening on http://[::1]:3000
Use Ctrl-C to stop


# ---> heartbeat after 1 minute and app-integrations-change event from queue
D, [2024-06-17T17:05:28.795742 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-17T17:05:28.795781 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:82:in `flush_events') Sending 1 telemetry events
D, [2024-06-17T17:05:28.885538 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-integrations-change` (code: 202)

# ---> next heartbeat after 1 minute
D, [2024-06-17T17:06:29.071874 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-17T17:06:29.071914 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:82:in `flush_events') Sending 0 telemetry events

# ---> next heartbeat after 1 minute
D, [2024-06-17T17:07:29.242708 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-17T17:07:29.242747 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:82:in `flush_events') Sending 0 telemetry events

# ---> next heartbeat after 1 minute
D, [2024-06-17T17:08:29.459809 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-17T17:08:29.459855 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:82:in `flush_events') Sending 0 telemetry events

# --> stop signal
^C- Gracefully stopping, waiting for requests to finish
=== puma shutdown: 2024-06-17 17:09:28 +0200 ===
- Goodbye!
Exiting

# ---> stopping telemetry, flushing app-closing event
D, [2024-06-17T17:09:28.069619 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:82:in `flush_events') Sending 1 telemetry events
D, [2024-06-17T17:09:28.242605 #83355] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-closing` (code: 202)

@github-actions github-actions bot added the core Involves Datadog core libraries label Jun 13, 2024
@anmarchenko anmarchenko changed the title Anmarchenko/telemetry_send_events_in_background_thread [NO-TICKET] Send telemetry events only in background thread Jun 13, 2024
@anmarchenko anmarchenko changed the title [NO-TICKET] Send telemetry events only in background thread [SDTEST-409] Send telemetry events only in background thread Jun 13, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jun 14, 2024

Codecov Report

Attention: Patch coverage is 98.53748% with 8 lines in your changes missing coverage. Please review.

Project coverage is 98.10%. Comparing base (75278d6) to head (3b4e4d4).
Report is 11 commits behind head on master.

Files Patch % Lines
spec/datadog/core/telemetry/component_spec.rb 97.41% 3 Missing ⚠️
spec/datadog/core/telemetry/worker_spec.rb 97.95% 3 Missing ⚠️
lib/datadog/core/telemetry/worker.rb 97.46% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #3718    +/-   ##
========================================
  Coverage   98.10%   98.10%            
========================================
  Files        1225     1227     +2     
  Lines       72866    73014   +148     
  Branches     3493     3508    +15     
========================================
+ Hits        71485    71632   +147     
- Misses       1381     1382     +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@anmarchenko anmarchenko changed the title [SDTEST-409] Send telemetry events only in background thread [SDTEST-409] Send telemetry events in background thread, remove blocking HTTP calls from critical path Jun 14, 2024
@anmarchenko anmarchenko changed the title [SDTEST-409] Send telemetry events in background thread, remove blocking HTTP calls from critical path [SDTEST-409] Send telemetry events in background thread to remove blocking HTTP calls from critical path Jun 14, 2024
end
)
end

# Should only be called the first time components are built
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are able to remove this logic because Telemetry::Worker now handles that app-started is sent once. Note that before it was not guaranteed that app-started event was sent: if agent URL was not configured via environment variable, telemetry would silently fail to start. This is handled correctly now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love removing "special behavior" like this!

@anmarchenko anmarchenko marked this pull request as ready for review June 18, 2024 08:50
@anmarchenko anmarchenko requested review from a team as code owners June 18, 2024 08:50
Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a really solid re-work of telemetry! Nice work!

Only have a minor nit (non-blocking), and a question about the behavior of "only once successful" (regarding retries). Otherwise I'm good with this.

end
)
end

# Should only be called the first time components are built
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love removing "special behavior" like this!

lib/datadog/core/telemetry/worker.rb Outdated Show resolved Hide resolved
lib/datadog/core/telemetry/worker.rb Show resolved Hide resolved
Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this. Nice work!

@ivoanjo
Copy link
Member

ivoanjo commented Jun 24, 2024

Nice!

Here's my original reproducer for the blocking calls issue:

require 'socket'
require 'datadog'
server = TCPServer.new(8126) # fake agent which never replies
puts Datadog::Tracing.enabled?

Here's what I get in master:

$ time DD_INSTRUMENTATION_TELEMETRY_ENABLED=true bundle exec ruby repro-telemetry.rb 
I, [2024-06-24T19:53:39.862601 #110133]  INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-24T18:53:39Z","os_name":"x86_64-pc-linux-gnu","version":"2.1.0","lang":"ruby","lang_version":"3.1.4","env":null,"service":"repro-telemetry","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.1.4","health_metrics_enabled":false,"profiling_enabled":false}
true

real  1m30.505s
user  0m0.336s
sys 0m0.082s

and here's what I get with this PR:

$ time DD_INSTRUMENTATION_TELEMETRY_ENABLED=true bundle exec ruby repro-telemetry.rb 
I, [2024-06-24T19:55:31.875510 #111270]  INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-24T18:55:31Z","os_name":"x86_64-pc-linux-gnu","version":"2.1.0","lang":"ruby","lang_version":"3.1.4","env":null,"service":"repro-telemetry","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.1.4","health_metrics_enabled":false,"profiling_enabled":false}
true

real  0m30.444s
user  0m0.330s
sys 0m0.076s

The printing of true is almost immediate, but the shutdown timeout seems too long still.

@anmarchenko
Copy link
Member Author

@ivoanjo thanks for the reproducer! I have a suspicion that workers shutdown might not work properly if worker is blocked by IO. I will play around with shorter HTTP timeout values and reproducer you provided (I think 30 seconds timeout for telemetry is too long anyway).

@anmarchenko
Copy link
Member Author

anmarchenko commented Jun 25, 2024

@ivoanjo this was a good catch, the timeout was indeed too long for telemetry use case, I lowered HTTP timeout to 2 seconds, the result:

$ time DD_INSTRUMENTATION_TELEMETRY_ENABLED=true bundle exec ruby repro-telemetry.rb                                                                                                                                                                                                          rbenv:3.3.0 
I, [2024-06-25T12:18:37.134810 #82675]  INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-25T10:18:37Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.0","env":null,"service":"repro-telemetry","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.0","health_metrics_enabled":false,"profiling_enabled":false}
true

0.34s user 0.14s system 18% cpu 2.608 total

@anmarchenko anmarchenko force-pushed the anmarchenko/telemetry_send_events_in_background_thread branch from 24a0409 to 5729921 Compare June 25, 2024 11:59
@anmarchenko anmarchenko merged commit 154f72e into master Jun 25, 2024
166 checks passed
@anmarchenko anmarchenko deleted the anmarchenko/telemetry_send_events_in_background_thread branch June 25, 2024 14:16
@github-actions github-actions bot added this to the 2.2.0 milestone Jun 25, 2024
@TonyCTHsu TonyCTHsu mentioned this pull request Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants