Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect Ruby VM internal event metrics (object allocation) #2164

Closed
marcotc opened this issue Jul 19, 2022 · 1 comment · Fixed by #2635
Closed

Collect Ruby VM internal event metrics (object allocation) #2164

marcotc opened this issue Jul 19, 2022 · 1 comment · Fixed by #2635
Labels
feature-request A request for a new feature or change to an existing one

Comments

@marcotc
Copy link
Member

marcotc commented Jul 19, 2022

Motivation

We would like more visibility over the internal overhead of Ruby into our application execution.
It can be useful to know if a web request has created many more objects that all other requests in the same application, for example.

How

Ruby has internal events related to object lifecycle (NEWOBJ, FREEOBJ) that are broadcast to any registered internal tracepoint event listener.

Here are all the events available in Ruby 3.1.2:

#define RUBY_INTERNAL_EVENT_NEWOBJ          0x100000 /**< Object allocated. */
#define RUBY_INTERNAL_EVENT_FREEOBJ         0x200000 /**< Object swept. */
#define RUBY_INTERNAL_EVENT_GC_START        0x400000 /**< GC started. */
#define RUBY_INTERNAL_EVENT_GC_END_MARK     0x800000 /**< GC ended mark phase. */
#define RUBY_INTERNAL_EVENT_GC_END_SWEEP   0x1000000 /**< GC ended sweep phase. */
#define RUBY_INTERNAL_EVENT_GC_ENTER       0x2000000 /**< `gc_enter()` is called. */
#define RUBY_INTERNAL_EVENT_GC_EXIT        0x4000000 /**< `gc_exit()` is called. */

Here's how to register an internal tracepoint event listener:

    VALUE tpval = rb_tracepoint_new(0, RUBY_INTERNAL_EVENT_NEWOBJ | RUBY_INTERNAL_EVENT_FREEOBJ |
				    RUBY_INTERNAL_EVENT_GC_START | RUBY_INTERNAL_EVENT_GC_END_MARK |
				    RUBY_INTERNAL_EVENT_GC_END_SWEEP,
				    tracepoint_track_objspace_events_i, &track);

One detail to keep in mind is that despite TracePoint being a public Ruby class, it is not allowed to trace RUBY_INTERNAL_EVENT_* events. Some of these events are too high frequency (e.g. NEWOBJ would be too slow to trace with Ruby code) or happen at sensitive times (GC events don't play well with Ruby objects being allocated during GC), thus not available for Ruby callback tracing, only native tracing.

What should be done

We should create native tracepoints to collect object allocation (RUBY_INTERNAL_EVENT_NEWOBJ). These numbers should be collected on a per-thread basis, in order to reduce the noise and allow for measuring allocations on a single trace.

The number of allocations should be stored on the top-level span.
Another option is to store this number for every span: if the overhead is not significant, this provides more value and makes it easier to pinpoint which span is creating an excessive number of objects, for example.

Other considerations

As you can see above, we have GC internal events as well.
As tempting as they may seem to collect, because GC runs are global, we have enough information in the public GC.stat method to gather GC metrics. GC.stat should be used instead, unless we see value in collecting internal GC details like Mark and Sweep events.

@marcotc marcotc added the feature-request A request for a new feature or change to an existing one label Jul 19, 2022
@delner
Copy link
Contributor

delner commented Jul 20, 2022

Nice write-up!

  • My understanding was that tracepoint comes with significant performance cost, even on the C-level. What's your expectation on the overhead of this approach?
  • Regarding where we put allocation numbers: is top-level span useful enough? (Is this too "zoomed out"?) How expensive/plausible would it be to decorate each span instead?
  • How does the tracepoint event mechanically decorate the span in question? How does it retrieve the span? Does it have to access & increment the allocation count repeatedly?

ivoanjo added a commit that referenced this issue Feb 20, 2023
**What does this PR do?**:

This PR adds a new profiler public API:
`Datadog::Profiling.allocation_count`.

The public documentation for this API is as follows:

> Returns an ever-increasing counter of the number of allocations
> observed by the profiler in this thread.
>
> Note 1: This counter may not start from zero on new threads. It
> should only be used to measure how many
> allocations have happened between two calls to this API:
> ```ruby
> allocations_before = Datadog::Profiling.allocation_count
> do_some_work()
> allocations_after = Datadog::Profiling.allocation_count
> puts "Allocations during do_some_work: #{allocations_after - allocations_before}"
> ```
> (This is similar to some OS-based time representations.)
>
> Note 2: All fibers in the same thread will share the same counter
> values.
>
> Only available when the profiler is running, the new CPU Profiling
> 2.0 profiler is in use, and allocation-related
> features are not disabled via configuration.
> For instructions on enabling CPU Profiling 2.0 see the ddtrace
> release notes.

As long as CPU Profiling 2.0 is in use, this API is enabled by
default. To disable it, this PR adds a new setting:

```ruby
Datadog.configure do |c|
  c.profiling.advanced.allocation_counting_enabled = # ...
end
```

**Motivation**:

This feature has long been something we want to provide with ddtrace,
see issues #2164 and #468, as well as PRs #1891, #1805, #597

As part of the ongoing work of enabling allocation profiling,
counting the number of allocations comes at a very cheap cost since
the profiler needs to have a `RUBY_INTERNAL_EVENT_NEWOBJ`
tracepoint anyway -- it's just a matter of also incrementing a
counter inside it.

**Additional Notes**:

Note that this does not yet change any user-visible feature for
ddtrace. I'm counting on @marcotc to pick up the task of using this
API to make some tracing magic :)

**How to test the change?**:

This change includes code coverage.

---

Fixes #2164
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A request for a new feature or change to an existing one
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants