Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[elasticsearch] Expand quantization, to cut out all request segments containing numbers #458

Merged
merged 10 commits into from
Jun 18, 2018

Conversation

pawelchcki
Copy link
Contributor

@pawelchcki pawelchcki commented Jun 13, 2018

This PR fixes quantization of URLs in ES client. It does so by replacing any path segment that has numbers in them with '?'.

assert_equal('/my/thing/?/z', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1two3/z'))

@pawelchcki pawelchcki added the bug Involves a bug label Jun 13, 2018
@pawelchcki pawelchcki modified the milestones: 0.13.0, 0.12.2 Jun 13, 2018
@pawelchcki pawelchcki added the integrations Involves tracing integrations label Jun 13, 2018
@pawelchcki pawelchcki force-pushed the bugfix/elasticsearch_quantization_fix branch from ddb41b3 to 20eda5d Compare June 13, 2018 13:38
@pawelchcki pawelchcki requested a review from palazzem June 13, 2018 14:44
@pawelchcki pawelchcki force-pushed the bugfix/elasticsearch_quantization_fix branch from a90a367 to cf26198 Compare June 13, 2018 16:19
@palazzem
Copy link
Contributor

Let's ask for @delner and @gbbr reviews.

@palazzem palazzem requested a review from delner June 14, 2018 08:10
@palazzem palazzem removed their assignment Jun 14, 2018
@palazzem palazzem removed their request for review June 14, 2018 08:11
@gbbr
Copy link

gbbr commented Jun 14, 2018

I'm not familiar with any changes like this within the Go tracer so I won't be able to express any productive views by the means of a review. The only thing that comes to mind is any potential performance considerations. But given that previously we were using multiple regexp and now we're just using one it looks better to me 😄

Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions regarding our quantization strategy and test cases.

INDEX_PLACEHOLDER = '?'.freeze
# Based on regexp from https://github.com/DataDog/dd-trace-java/blob/master/dd-trace-ot/src/main/java/datadog/opentracing/decorators/URLAsResourceName.java#L16
# Matches any path segments with numbers in them.
CAPTURE_PATH_SEGMENTS_WITH_NUMBERS_REGEXP = %r{(?<=/)(?:[^/\d]*[\d]+[^?/]*)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this replace all path segments which contain both letters and numbers? That might work for UUIDs, but it might be too zealous and quantize other meaningful path names. It's conceivable that there's a segment that has both letters and numbers which is not a UUID.

Would it be possible to just quantize the segment that trails or immediately precedes /_termvector, if present? If that segment contains only letters and numbers within the UUID range?

Also, consider that look-behinds in Ruby can be expensive, so if there's a way to avoid that, that would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree its pretty zealous however, we can't have any way to predict if a string is not id with high cardinality...

Hmm now I thinking we're actually not zealous enough.

Because both type and document id's can actually unique ids with high cardinality, and they might as well be string based like uuid or imgur links https://imgur.com/Gi5tE or addresses like warehouse.mythic.mischievous
Its been working so far because most of document ids are probably only number based.

/_termvector here is only a REST 'action' that can be executed on the document.

The ES Rest api seems pretty well defined. First three segments (/index/type/id) are user settable and we should quantize them one way or the other.

Fun fact ids can have / in them but they need to be URL escaped - luckily. I think its still something we should test for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all of these segments are well-known, and reliably appear in the same place, then we should be able to leverage that to target specific segments instead of looking at segment content and guessing what part it is.

By default, id should always be quantized to ? (integer, UUID, or otherwise), and all other parts (such as index, type, document) should not. Its atypical to have high cardinality on those segments, even if it is possible.

For applications that do possess high cardinality in those segments, we should provide users the option to quantize them by configuring their tracer settings. Given our current quantize option, I'd suggest that looks something like:

Datadog.configure do |c|
  c.use :elasticsearch, quantize: {
    # New section defining URL quantization options
    url: {
      # Usually :show by default
      index: :show,
      type: :show,
      document: :show,
      # Usually :hide by default
      id: :hide
    }
    # Refactor existing quantize options into this body section
    body: {
      show: [],
      exclude: []
    }
  }
end

If we decide to go with something like this, it'd be important to ensure backwards compatibility; any use show or exclude keys within quantize at the top level should raise a deprecation warning, and be automatically remapped to the equivalent body option.

Let me know your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of something very similar. I like the configuration option you proposed, its seems cleaner than what I had in mind 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However since document GET API that "causes" this problem is only one of a few, some more investigation is needed to ensure nothing will break.

@@ -9,7 +9,7 @@ def test_id
assert_equal('/my/thing/?/', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1/'))
assert_equal('/my/thing/?/is/cool', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1/is/cool'))
assert_equal('/my/thing/??is=cool', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1?is=cool'))
assert_equal('/my/thing/1two3/z', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1two3/z'))
assert_equal('/my/thing/?/z', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1two3/z'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see more test cases that try many more different possible combinations of UUIDs (all letters, some letters, no letters, letters out of range, etc), _termvector (present, not present, present with trailing slash, etc), and other edge cases that we found were not working correctly. Right now, I don't trust the tests enough to verify that the bug was properly addressed, or that we're not introducing new ones.

@delner delner modified the milestones: 0.12.2, 0.13.0 Jun 15, 2018
@pawelchcki pawelchcki force-pushed the bugfix/elasticsearch_quantization_fix branch from e16d9db to 62eaec3 Compare June 18, 2018 09:28
@pawelchcki pawelchcki changed the base branch from master to merge_master_into_0.13-dev June 18, 2018 10:51
Add test without leading slash, cleanup a few other testcases
@pawelchcki pawelchcki force-pushed the bugfix/elasticsearch_quantization_fix branch from 19cb32d to 77e1172 Compare June 18, 2018 11:51
Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pawelchcki pawelchcki changed the base branch from merge_master_into_0.13-dev to 0.13-dev June 18, 2018 15:12
@pawelchcki pawelchcki merged commit b6851ca into 0.13-dev Jun 18, 2018
@pawelchcki pawelchcki deleted the bugfix/elasticsearch_quantization_fix branch June 18, 2018 15:15
@delner delner mentioned this pull request Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug integrations Involves tracing integrations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants