-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[elasticsearch] Expand quantization, to cut out all request segments containing numbers #458
Conversation
ddb41b3
to
20eda5d
Compare
a90a367
to
cf26198
Compare
I'm not familiar with any changes like this within the Go tracer so I won't be able to express any productive views by the means of a review. The only thing that comes to mind is any potential performance considerations. But given that previously we were using multiple regexp and now we're just using one it looks better to me 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions regarding our quantization strategy and test cases.
INDEX_PLACEHOLDER = '?'.freeze | ||
# Based on regexp from https://github.com/DataDog/dd-trace-java/blob/master/dd-trace-ot/src/main/java/datadog/opentracing/decorators/URLAsResourceName.java#L16 | ||
# Matches any path segments with numbers in them. | ||
CAPTURE_PATH_SEGMENTS_WITH_NUMBERS_REGEXP = %r{(?<=/)(?:[^/\d]*[\d]+[^?/]*)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this replace all path segments which contain both letters and numbers? That might work for UUIDs, but it might be too zealous and quantize other meaningful path names. It's conceivable that there's a segment that has both letters and numbers which is not a UUID.
Would it be possible to just quantize the segment that trails or immediately precedes /_termvector
, if present? If that segment contains only letters and numbers within the UUID range?
Also, consider that look-behinds in Ruby can be expensive, so if there's a way to avoid that, that would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree its pretty zealous however, we can't have any way to predict if a string is not id with high cardinality...
Hmm now I thinking we're actually not zealous enough.
Because both type
and document
id's can actually unique ids with high cardinality, and they might as well be string based like uuid or imgur links https://imgur.com/Gi5tE or addresses like warehouse.mythic.mischievous
Its been working so far because most of document ids are probably only number based.
/_termvector
here is only a REST 'action' that can be executed on the document.
The ES Rest api seems pretty well defined. First three segments (/index/type/id
) are user settable and we should quantize them one way or the other.
Fun fact ids can have /
in them but they need to be URL escaped - luckily. I think its still something we should test for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all of these segments are well-known, and reliably appear in the same place, then we should be able to leverage that to target specific segments instead of looking at segment content and guessing what part it is.
By default, id
should always be quantized to ?
(integer, UUID, or otherwise), and all other parts (such as index
, type
, document
) should not. Its atypical to have high cardinality on those segments, even if it is possible.
For applications that do possess high cardinality in those segments, we should provide users the option to quantize them by configuring their tracer settings. Given our current quantize
option, I'd suggest that looks something like:
Datadog.configure do |c|
c.use :elasticsearch, quantize: {
# New section defining URL quantization options
url: {
# Usually :show by default
index: :show,
type: :show,
document: :show,
# Usually :hide by default
id: :hide
}
# Refactor existing quantize options into this body section
body: {
show: [],
exclude: []
}
}
end
If we decide to go with something like this, it'd be important to ensure backwards compatibility; any use show
or exclude
keys within quantize
at the top level should raise a deprecation warning, and be automatically remapped to the equivalent body
option.
Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of something very similar. I like the configuration option you proposed, its seems cleaner than what I had in mind 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However since document GET API that "causes" this problem is only one of a few, some more investigation is needed to ensure nothing will break.
@@ -9,7 +9,7 @@ def test_id | |||
assert_equal('/my/thing/?/', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1/')) | |||
assert_equal('/my/thing/?/is/cool', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1/is/cool')) | |||
assert_equal('/my/thing/??is=cool', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1?is=cool')) | |||
assert_equal('/my/thing/1two3/z', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1two3/z')) | |||
assert_equal('/my/thing/?/z', Datadog::Contrib::Elasticsearch::Quantize.format_url('/my/thing/1two3/z')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see more test cases that try many more different possible combinations of UUIDs (all letters, some letters, no letters, letters out of range, etc), _termvector
(present, not present, present with trailing slash, etc), and other edge cases that we found were not working correctly. Right now, I don't trust the tests enough to verify that the bug was properly addressed, or that we're not introducing new ones.
- strip all segments with numbers in them - remove the query string
e16d9db
to
62eaec3
Compare
Add test without leading slash, cleanup a few other testcases
19cb32d
to
77e1172
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This PR fixes quantization of URLs in ES client. It does so by replacing any path segment that has numbers in them with '?'.
dd-trace-rb/test/contrib/elasticsearch/quantize_test.rb
Line 12 in 20eda5d