Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - "Transform" Processor plugin #2667

Closed
PauloAugusto-Asos opened this issue Apr 13, 2017 · 14 comments
Closed

Feature Request - "Transform" Processor plugin #2667

PauloAugusto-Asos opened this issue Apr 13, 2017 · 14 comments
Labels
feature request Requests for new plugin and for new features to existing plugins
Milestone

Comments

@PauloAugusto-Asos
Copy link

Feature Request

Requesting a "Transform" processor plugin.

I am trying to import Web access logs into InfluxDB with Telegraf. However, some of the URL PATHs include identifiers (session IDs, product IDs, etc). Ex:
/products/cars/12345/view
/shoppingBasket/1234567890/view

The URL PATH is being shipped as a Tag Value (obviously). I need to to be able to replace those identifiers from the PATH Tag Value before shipping the data to Influx (or whatever other DB) so that they become easily recognizable as the «same» URL PATH for searches and aggregations and to prevent an explosion of "series" in InfluxDB or Graphite.

Proposal:

[[processors.transformer]]
tagpass = "ApacheLog"
tagname = "path"
matcher = "/products/cars/(\d+)/view/"
matchertype = "regex" # "literal"
replaceMatchedIndex = 1 # i0 being whole match. To replace *only* the ID
replacement = "{CarID}"
tagexclude = "ApacheLog"
[[processors.transformer]]
tagpass = "ApacheLog"
tagname = "path"
matcher = "/shoppingBasket/(\\d+)/view"
matchertype = "regex" # literal
replaceMatchedIndex = 1
replacement = "{SessionID}"
tagexclude = "ApacheLog"

Simpler Proposal:

[[processors.transformer]]
tagpass = "ApacheLog"
tagname = "path"
matcher = "/products/cars/\\d+/view/"
matchertype = "regex" # "literal"
# replaceMatchedIndex = 1
replacement = "/products/cars/{CarID}/view/"
tagexclude = "ApacheLog"

SimplerSimpler Proposal:

[[processors.transformer]]
tagpass = "ApacheLog"
tagname = "path"
replaceDigits = 3 # replace all sequences of X+ digits
replaceGuids = true
replaceTrimmedGuids = true # guids stripped of dashes
tagexclude = "ApacheLog"

@danielnelson
Copy link
Contributor

I like it. First version seems best, I think we need captures. I suggest we change a few names:

  • matchertype -> transformation
  • matcher -> regex_pattern
  • literal -> replace

Some things to think about:

  • It would be nice if we could transform tag names, fieldkeys, and field values. How should we select these?
  • Which transformations should we try to fit into this processor, should it only be regex? I already know we need type conversions and enum matching. We should name the processor based on this decision.

@PauloAugusto-Asos
Copy link
Author

PauloAugusto-Asos commented May 15, 2017

Hi Danielsan,

[...] transform tag names, fieldkeys, and field values. How should we select these?

We could select which element (field, tag, or?) we want to "transform" with:
tagname = "myField123"
Now that I think of it, "tagname" sounds incorrect, as it hints at InfluxDB "Tags" vs "Fields"...

Which transformations should we try to fit into this processor [...]?

I focused on this particular requirement of having to transform strings but you're right, there's many other transformation requirements that can be thought of.

You're right - if the only thing this transform plugin does is to transform strings, it should be named something more specific like "string transform" plugin.

Regarding enums, what type of "enum matching" are you thinking of? Could we do a string transform per Enum option? Each transform would try to grab a specific string and if it matches - adds/replaces/inserts the enum string?

Other requirements might be mathematical transformations, like "/60", though that doesn't feel really required (we can / should be able to do transformation math on the queries to the database).


Regarding the terminology (mind you I'm not a native English speaker):

matchertype -> transformation
matcher -> regex_pattern
literal -> replace

If we call the element "transformation" it seems to me unclear that the element is referring to how it's going to match/grab. Erm, I don't know whether we're actually thinking the same thing?

My reasoning was:

tagname = "myField123"
^ meaning what field to try to transform.

matchertype = "regex" or matchertype = "literal"
^ meaning how am I going to try to match if the transform should occur.

matcher = "regex_pattern|or|literal_string"
^ meaning this is what it's going to try to find inside "myField123".

replaceMatchedIndex = 1
^ meaning it will try to replace only the regex-match-group n1 , instead of replacing the whole thing (or index 0 to replace the whole thing).

replacement = "{CarID}"
^ meaning to replace whatever was matched with this.

@danielnelson
Copy link
Contributor

Regarding enums, what type of "enum matching" are you thinking of?

An example is mapping strings to ints "green" -> 0, "yellow" -> 1, "red" -> 2.

@danielnelson
Copy link
Contributor

I think we should probably scope this to regular expressions, we can create separate processors for enums, type conversions, math, etc.

I'm also flip-flopping on backreferences, it seems like its not too bad to not have them or at least I'm not able to come up with a good example to justify the extra config complexity.

We could select tags/fields using subtables. Literal replacements could still be done with regex, so we wouldn't need a type option.

If we go with that an example config could be:

[[processors.regex]]
  namepass = ["apache"]

  [[processors.regex.tags]]
    key = "path"
    pattern = '/products/cars/\d+/view/'
    replacement = "/products/cars/{id}/view/"

  [[processors.regex.fields]]
    key = "path"
    pattern = '/products/cars/\d+/view/'
    replacement = "/products/cars/{id}/view/"

@PauloAugusto-Asos
Copy link
Author

mapping strings to ints "green" -> 0, "yellow" -> 1, "red" -> 2.

That could be trivial, it seems:
(what type each field is, is specified separately, right?)

[[processors.transformer]]
tagname = "myStringField"
matcher = "^green$"
replacement = "0"

[[processors.transformer]]
tagname = "myStringField"
matcher = "^yellow$"
replacement = "1"

[[processors.transformer]]
tagname = "myStringField"
matcher = "^red$"
replacement = "2"


I think we should probably scope this to regular expressions
Literal replacements could still be done with regex,

^ For simplicity that would probably be the best, I agree. There's nothing we can do with "literal" match-strings that we can't do with Regexes.

I'm also flip-flopping on backreferences

^ Not sure what "backreferences" are...

We could select tags/fields using subtables.

^ Why can't we just search for a «column» regardless of whether it is an Influx "Tag" or "Field"? If we can be abstracted from that, that would be ideal...

an example config could be:
pattern = '/products/cars/\d+/view/'
replacement = "/products/cars/{id}/view/"

^ Could we still have the possibility of specifying the match-group? That would allow us to replace only parts of the original string. Use case example:
#Replace IDs - sequences of 3+ digits
pattern = '.+(\d{3,}).+'
replaceMatchedIndex = 1
replacement = "{id}"

@danielnelson
Copy link
Contributor

On the enum/case example, this might be somewhat slow and somewhat verbose but perhaps it would meet the requirements. If we stick to string replacements you might need to follow it up with a type conversion, so that you get 0i instead of "0".

Not sure what "backreferences" are...

I was referring to captures groups and the replaceMatchedIndex here, more on that below.

Why can't we just search for a «column» regardless of whether it is an Influx "Tag" or "Field"?

It turns out you can have a tag and field with the same key: foo,value=bar value=42. It is a bad idea though, so maybe we shouldn't worry too much about it. Here is a mention of it in the docs, perhaps we should borrow this syntax? value::tag would specify the tag.

Could we still have the possibility of specifying the match-group?

Yeah I guess we should keep them. Perhaps backreferences in the replacement string could do the job:

pattern = '(.+)\d{3,}(.+)'
replacement = '\1{id}\2'

@PauloAugusto-Asos
Copy link
Author

If we stick to string replacements you might need to follow it up with a type conversion, so that you get 0i instead of "0".

I honestly don't know this but: wouldn't we need to specify the type anyway, regardless? Meh - just thinking out loud - don't even bother answering me, you know better than me and I'm just raising the question.

backreferences in the replacement string could do the job:
pattern = '(.+)\d{3,}(.+)'
replacement = '\1{id}\2'

Ho, wow, and I thought I knew the gist of everything there was to know about Regexes... I had never heard of backreferences. Live and learn! That sounds really neat and it would solve the trick, indeed - you're right.

The only considerations I have about it are that the replacement would have to also be treated as a Regex (to access the backreferences), in which case we'd need to, for example, also escape the "{".
replacement = '\1\{id\}\2'
This might become a bit confusing/tricky, maybe catch people off guard thinking the replacement was a literal replacement.

Also, I'm wondering whether you can access the backreferences caught in the matching from the replacement in whatever Regex libraries (Go STD?) Telegraf is using.

Apart from those considerations I like the idea of backreferences - really cool feature of Regexes that I was unaware of.

@danielnelson
Copy link
Contributor

wouldn't we need to specify the type anyway

I've just been thinking about this operating on strings so far. However, it would be possible to add a type option such as type = 'float' which would attempt to apply a conversion after transforming the string. If we do this we will have to consider what happens if the conversion fails.

The replacement string wouldn't be a regex, but would use the https://golang.org/pkg/regexp/#Regexp.ReplaceAll function to expand the replacement. I pasted the wrong syntax above, it looks like go format would be '$1{id}$2', you wouldn't need to escape braces, but to enter a literal $ you would use $$.

@tbolon
Copy link

tbolon commented Aug 7, 2017

This improvement could greatly simplify the tracking of IIS / aspnet apps, since they combine usage of IIS Site Name (text) and IIS Site Id (numeric) as tags, and manual mapping is necessary.

With such a feature, we could replace IIS Site Id in tags by IIS Site Name (per serveur) to ease the correlation of measurements.

@danielnelson
Copy link
Contributor

@tbolon What plugin are you using to capture these stats? Can you give an example of the current and desired schema?

@tbolon
Copy link

tbolon commented Aug 7, 2017

Currently win_perf_counters.

Some counters are returned with an internal id. Exemple:

  [[inputs.win_perf_counters.object]]
    # IIS, ASP.NET Applications
    ObjectName = "ASP.NET Applications"
    Counters = ["Requests/Sec"]
    Instances = ["*"]
    Measurement = "iis_aspnet_app"

And the corresponding output:

C:\Program Files\Telegraf>telegraf.exe --config "d:\telegraf.conf" --test
* Plugin: inputs.win_perf_counters, Collection 1
> iis_aspnet,instance=*,objectname=ASP.NET,dc=azure-westeu,host=frontweb3 Requests_Current=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_1_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_14_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_15_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_3_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_4_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_5_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_6_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_7_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_9_ROOT,objectname=ASP.NET\ Applications,host=xxx,dc=azure-westeu Requests_persec=0 1502134454000000000

"instance" tag value can vary from server to server based on the order the websites are created, so, before sending them to influxdb, I could prefer to have a way to transform them to use a better name.

I only need a bunch of harcoded replacements in my telegraf config : "_LM_W3SVC_9_ROOT" => "SomeWebsite", etc.
These ID will never change (unless you delete/recreate websites).

I can't do such a thing on my dashboard, since the "_LM_W3SVC_9_ROOT" id can map to different sites based on the host.

Other performance counters are already using IIS Site name as instance name:

  [[inputs.win_perf_counters.object]]
    # IIS, Web Service
    ObjectName = "Web Service"
    Counters = ["Current Connections"]
    Instances = ["*"]
    Measurement = "iis_websvc"

It will give the following output (mostly redacted):

C:\Program Files\Telegraf>telegraf.exe --config "d:\telegraf.conf" --test
* Plugin: inputs.win_perf_counters, Collection 1
> iis_websvc,instance=MyWebsite1,objectname=Web\ Service,dc=azure-westeu,host=xxx Current_Connections=0 1502134786000000000
> iis_websvc,dc=azure-westeu,host=xxx,instance=MyWebsite\ 2,objectname=Web\ Service Current_Connections=0 1502134786000000000
...

I hope this helps.

@fenneh
Copy link

fenneh commented Aug 8, 2017

This would also be useful for the.Net Data Provider for SqlServer perf counters as they will include the PID or the .Net process connecting to SQL within the counter. Upon a service restart, you'll get a new PID and lose that link between the metrics.

@MonkeyDo
Copy link

I am also interested in this feature.
My use case is to strip part of incoming mqtt topics with a regexp in the mqtt_consumer plugin.

@44px 44px mentioned this issue Feb 26, 2018
3 tasks
@danielnelson danielnelson added this to the 1.7.0 milestone May 21, 2018
@danielnelson danielnelson added the feature request Requests for new plugin and for new features to existing plugins label May 21, 2018
@danielnelson
Copy link
Contributor

Will be included in 1.7, thanks to @44px!

I encourage everyone to give the regex processor a shot before the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

No branches or pull requests

6 participants