Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDTEST-409] fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead #3743

Merged
merged 3 commits into from
Jun 28, 2024

Conversation

anmarchenko
Copy link
Member

@anmarchenko anmarchenko commented Jun 28, 2024

What does this PR do?
Fix deadlock issue that was happening in telemetry component when stopping the Datadog context right after start.

The reason for the deadlock was this overridden #stop method:

        def stop(force_stop = false, timeout = @shutdown_timeout)
          buffer.close if running?

          flush_events(dequeue) if work_pending?

          super
        end

This method is called by the main thread when Component#shutdown! is called: it tries to acquire TELEMETRY_STARTED_ONCE lock possibly already acquired by a telemetry worker, which with some unfortunate timing could cause deadlock. I added this #flush_events call to ensure that pending events in the queue are flushed before worker is stopped (it was not happening for a reason unknown to me).

It turned out that Worker with Core::Workers::Queue and Core::Workers::Polling indeed does not flush pending events by default because of #work_pending? method being overridden several times:

In Core::Workers::IntervalLoop (included together with Core::Workers::Polling):

        def work_pending?
          run_loop?
        end

In Core::Workers::Queue:

        def work_pending?
          !buffer.empty?
        end

In telemetry case as Polling is included after Queue, Polling wins and work is considered pending only if loop is currently running. This logic completely disregards the fact that there are some events waiting to be flushed.

What we actually want in this case is the combination of 2 approaches: work is pending when loop is running or when there are some events in buffer. This can be solved by overriding this method in Worker class:

        def work_pending?
          run_loop? || !buffer.empty?
        end

It allows us to remove flush_events call from the main thread eliminating (or decreasing?) deadlock probability.

How to test the change?
Confirm that deadlock issue is gone using reproducer by @ivoanjo:

require 'socket'
require 'datadog'
server = TCPServer.new(8126) # fake agent which never replies
puts Datadog::Tracing.enabled?

Run it with:

$ time DD_INSTRUMENTATION_TELEMETRY_ENABLED=true bundle exec ruby repro-telemetry.rb 

I, [2024-06-28T09:31:01.995300 #308]  INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-28T09:31:01Z","os_name":"aarch64-unknown-linux-gnu","version":"2.1.0","lang":"ruby","lang_version":"3.1.2","env":null,"service":"repro","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.1.2","health_metrics_enabled":false,"profiling_enabled":false}
true

real    0m1.161s
user    0m0.595s
sys     0m0.096s

... 100 times in different Ruby versions to confirm that deadlock exception does not appear.

Confirm that app-closing event is still being sent when stopping the app using empty rails app with DD_TRACE_DEBUG and telemetry enabled:

D, [2024-06-28T11:39:05.341113 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/configuration/settings.rb:712:in `block (3 levels) in <class:Settings>') Development environment detected, disabling Remote Configuration. You can enable it with DD_REMOTE_CONFIGURATION_ENABLED=true.
D, [2024-06-28T11:39:05.341532 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x00000001215f0650>
I, [2024-06-28T11:39:05.341773 #48226]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-28T09:39:05Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}
D, [2024-06-28T11:39:05.367022 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/http/adapters/net.rb:47:in `rescue in post') Unable to send telemetry event to agent: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126)
D, [2024-06-28T11:39:05.367681 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: nil)
D, [2024-06-28T11:39:05.367922 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:119:in `block in started!') Error sending telemetry app-started event, retry after heartbeat interval...
=> Booting Puma
=> Rails 7.1.3.4 application starting in development 
=> Run `bin/rails server --help` for more startup options
D, [2024-06-28T11:39:05.484789 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x0000000123475be8>
I, [2024-06-28T11:39:05.485100 #48226]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-28T09:39:05Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}
D, [2024-06-28T11:39:05.488417 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/http/adapters/net.rb:47:in `rescue in post') Unable to send telemetry event to agent: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126)
D, [2024-06-28T11:39:05.488472 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: nil)
D, [2024-06-28T11:39:05.488518 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:119:in `block in started!') Error sending telemetry app-started event, retry after heartbeat interval...
D, [2024-06-28T11:39:05.502629 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x00000001240d4420>
D, [2024-06-28T11:39:05.505530 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/http/adapters/net.rb:47:in `rescue in post') Unable to send telemetry event to agent: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126)
D, [2024-06-28T11:39:05.505556 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: nil)
D, [2024-06-28T11:39:05.505568 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:119:in `block in started!') Error sending telemetry app-started event, retry after heartbeat interval...
I, [2024-06-28T11:39:05.505705 #48226]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-28T09:39:05Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":"local","service":"quotes-rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}
D, [2024-06-28T11:39:05.564050 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x00000001238d4f68>
D, [2024-06-28T11:39:05.599066 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: 202)
D, [2024-06-28T11:39:05.599174 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:113:in `block in started!') Telemetry app-started event is successfully sent
D, [2024-06-28T11:39:05.721944 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-dependencies-loaded` (code: 202)
D, [2024-06-28T11:39:05.813240 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-28T11:39:05.813287 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:88:in `flush_events') Sending 0 telemetry events
D, [2024-06-28T11:39:05.813732 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-28T11:39:05.813755 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:88:in `flush_events') Sending 0 telemetry events
D, [2024-06-28T11:39:06.139217 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-28T11:39:06.139277 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:88:in `flush_events') Sending 1 telemetry events
D, [2024-06-28T11:39:06.254879 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-integrations-change` (code: 202)
I, [2024-06-28T11:39:06.255109 #48226]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-28T09:39:06Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":"local","service":"quotes-rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}
Puma starting in single mode...
* Puma version: 6.4.2 (ruby 3.3.3-p89) ("The Eagle of Durango")
*  Min threads: 5
*  Max threads: 5
*  Environment: development
*          PID: 48226
* Listening on http://127.0.0.1:3000
* Listening on http://[::1]:3000
Use Ctrl-C to stop



D, [2024-06-28T11:40:05.930896 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-28T11:40:05.931120 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:88:in `flush_events') Sending 1 telemetry events
D, [2024-06-28T11:40:06.031016 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-integrations-change` (code: 202)


^C- Gracefully stopping, waiting for requests to finish
=== puma shutdown: 2024-06-28 11:40:26 +0200 ===
- Goodbye!
Exiting
D, [2024-06-28T11:40:27.062213 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-28T11:40:27.062340 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:88:in `flush_events') Sending 1 telemetry events

# ---------------> here is the app-closing event that has to be sent on shutdown
D, [2024-06-28T11:40:27.177922 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-closing` (code: 202)

@github-actions github-actions bot added the core Involves Datadog core libraries label Jun 28, 2024
@anmarchenko anmarchenko marked this pull request as ready for review June 28, 2024 09:59
@anmarchenko anmarchenko requested a review from a team as a code owner June 28, 2024 09:59
@anmarchenko anmarchenko changed the title fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead [SDTEST-409] fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead Jun 28, 2024
@anmarchenko anmarchenko force-pushed the anmarchenko/fix_telemetry_threads_deadlock branch from 1fc79e4 to 5e8360f Compare June 28, 2024 10:54
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.11%. Comparing base (9a6e1d5) to head (5e8360f).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3743   +/-   ##
=======================================
  Coverage   98.10%   98.11%           
=======================================
  Files        1235     1235           
  Lines       73747    73749    +2     
  Branches     3561     3560    -1     
=======================================
+ Hits        72353    72357    +4     
+ Misses       1394     1392    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@anmarchenko anmarchenko merged commit 4cd328d into master Jun 28, 2024
166 checks passed
@anmarchenko anmarchenko deleted the anmarchenko/fix_telemetry_threads_deadlock branch June 28, 2024 13:49
@github-actions github-actions bot added this to the 2.2.0 milestone Jun 28, 2024
@TonyCTHsu TonyCTHsu mentioned this pull request Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants