Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splunk Metrics serializer #4339

Merged
merged 13 commits into from
Sep 11, 2018

Conversation

ronnocol
Copy link
Contributor

This serializer properly formats the metrics data according to the Splunk metrics JSON specification, it can be used with any output that supports serializers, including file or HTTP. I decided that just formatting the data according to the Splunk spec was a better investment than managing the whole output pipeline. Now you can use telegraf without requiring a HEC (by using a file output), or you can use the well tested and maintained HTTP output and just set the extra headers required by the HEC.

Please see the docs in README.md

Required for all PRs:

  • [ x ] Signed CLA.
  • [ x ] Associated README.md updated.
  • [ x ] Has appropriate unit tests.

in a format that can be consumed by a Splunk metrics index.

Can be used with any output including file or HTTP.

Please see the docs in README.md
obj["_value"] = v

dataGroup.Event = "metric"
dataGroup.Time = float64(metric.Time().UnixNano() / int64(s.TimestampUnits))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see what you're intending to do here, but this won't result in the correct behavior. https://play.golang.org/p/VEDBzbF0j0e

If you want to do math between 2 ints, and get a float result, you need to convert the int to float before the arithmetic.

return "", err
}

metricString = string(metricJson)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picking on the string vs byte slice thing some more, you're converting a byte slice to a string, and then later on converting that string back into a byte slice (up on line 50). It would use less memory to keep it as a byte slice.

}

for _, m := range objects {
serialized = serialized + m + "\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if serialized were a byte slice. Every time you append to a string, a new string has to be allocated. If your batches are large (in terms of count or size), this can suck up a lot of memory. With byte slice, an allocation is performed only if the length of the slice exceeds the capacity.
Ditto for other places where this is stored in a string (e.g. line 23).

Also I'm not seeing much point of the objects slice. You should be able to append to serialized directly.

return []byte(serialized), nil
}

func (s *serializer) SerializeBatch(metrics []telegraf.Metric) ([]byte, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SerializeBatch seems like it's just a reinvention of json.Encoder. It might be better to use the existing standard lib functionality. It should also simplify your code a lot.


for _, metric := range metrics {
m, err := s.createObject(metric)
if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error should be logged somewhere when not nil. Otherwise metrics are going to be getting dropped and the user won't have any idea why.

@@ -73,6 +74,8 @@ func NewSerializer(config *Config) (Serializer, error) {
serializer, err = NewGraphiteSerializer(config.Prefix, config.Template, config.GraphiteTagSupport)
case "json":
serializer, err = NewJsonSerializer(config.TimestampUnits)
case "splunkmetric":
serializer, err = NewSplunkmetricSerializer(config.TimestampUnits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splunk only supports seconds. The serializer should not be using config.TimestampUnits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splunk absolutely support sub second timestamps. From the default datetime.xml:

<define name="_utcepoch" extract="utcepoch, subsecond">
        <!-- update regex before '2017' -->
    <text><![CDATA[((?<=^|[\s#,"=\(\[\|\{])(?:1[012345]|9)\d{8}|^@[\da-fA-F]{16,24})(?:\.?(\d{1,6}))?(?![\d\(])]]></text>
</define>```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't say it doesn't. It only supports seconds as the unit. It supports more than that as the precision. Unit != precision.


func (s *serializer) createObject(metric telegraf.Metric) (metricString string, err error) {

/* Splunk supports one metric per line and has the following required names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not entirely accurate. Splunk's http event collector is not line-oriented. It's object oriented. You can shove multiple JSON objects on a single line and it's happy to consume them.
See this example: https://docs.splunk.com/Documentation/Splunk/7.1.1/Data/HTTPEventCollectortokenmanagement#Send_multiple_events_to_HEC_in_one_request

{"event": "Pony 1 has left the barn"}{"event": "Pony 2 has left the barn"}{"event": "Pony 3 has left the barn", "nested": {"key1": "value1"}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's not line oriented, but it doesn't support reading an array of JSON objects as metrics. This is OK:

{"event": "Pony 1 has left the barn"}{"event": "Pony 2 has left the barn"}{"event": "Pony 3 has left the barn", "nested": {"key1": "value1"}}

This is not OK:

[{"event": "Pony 1 has left the barn"}{"event": "Pony 2 has left the barn"}{"event": "Pony 3 has left the barn", "nested": {"key1": "value1"}}]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that developer documentation should be accurate. We shouldn't tell the developers it's "one metric per line" if it's not.

** metric_name: The name of the metric
** _value: The value for the metric
** _time: The timestamp for the metric
** All other index fields become deminsions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't think this is a good structure to follow. This will prevent using this serializer with pretty much every single input plugin, as none of them follow this format. I think the implementation over on #4185 has a much more flexible implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is wrong, it should be time, not _time (I'll fix), but metric_name and _value are required fields for the metrics store. Naming the fields this way means you don't need custom props.conf or transforms.conf. In fact this is the same naming convention that is used in the PR you reference. The point of this serializer is to format the metrics into this format so that it can be used with the generic, well tested, HTTP output. Or the file output if you're running telegraf on a machine that is running a Splunk forwarder.

@phemmer
Copy link
Contributor

phemmer commented Jun 24, 2018

This is now our 3rd splunk output proposal in a few days :-/
(ok, maybe not a "few" days, but within a relatively short time frame)
See also:
#4185
#4300

@phemmer phemmer mentioned this pull request Jun 24, 2018
3 tasks
These updates cleanup a lot of the handling of the data and allows
to configure if the output should be in a HEC compatible format or
a file/stdout format. Additional documentation files also updated.
@ronnocol
Copy link
Contributor Author

@phemmer Thank you for the comprehensive review, I appreciate all of the feedback and believe I have addressed all/most of the issues with this latest commit.

I know that there were some PRs for Splunk HEC outputs, but this was specifically written as a serializer so that you could have Splunk compatible metrics generated for any reasonable output (e.g. file or HTTP.) Furthermore, it will allow you to easily redirect different types of metrics to different metrics indexes, which the output based PRs do not allow for. The output based PRs also require that you have/deploy a HEC to be used to get metrics into the Splunk infrastructure vs. being able to use existing infrastructure (forwarders) to get the data into a metrics index.

@dkulig
Copy link

dkulig commented Jul 19, 2018

Looks like it doesn't work properly with the zookeeper input,

2018-07-19T12:50:41Z D! Attempting connection to output: file
2018-07-19T12:50:41Z D! Successfully connected to output: file
2018-07-19T12:50:41Z I! Starting Telegraf v1.8.0~caf97dc5
2018-07-19T12:50:41Z I! Loaded inputs: inputs.zookeeper
2018-07-19T12:50:41Z I! Loaded aggregators:
2018-07-19T12:50:41Z I! Loaded processors:
2018-07-19T12:50:41Z I! Loaded outputs: file
2018-07-19T12:50:41Z I! Tags enabled: host=mbpdk
2018-07-19T12:50:41Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"mbpdk", Flush Interval:10s
2018-07-19T12:50:51Z D! Output [file] buffer fullness: 1 / 10000 metrics.
2018-07-19T12:50:51Z E! [serializer.splunkmetric] Dropping invalid metric
2018-07-19T12:50:51Z E! Error writing to output [file]: failed to serialize message: can not parse value

my config:

[[outputs.file]]
   files = ["stdout"]
   data_format = "splunkmetric"
   hec_routing = false

[[inputs.zookeeper]]
  servers = [":2181"]

… numeric.

Some inputs return non-numeric values (e.g. zookeeper returns a version string)
Splunk requires that metrics be numeric, but previously the serializer would drop
the entire input vs. just the offending metric. Now just drop the offencing key/value
and improve error logging:
2018-07-20T02:37:00Z E! Can not parse value: 3.4.12-e5259e437540f349646870ea94dc2658c4e44b3b for key: version
@arohter
Copy link

arohter commented Aug 8, 2018

This looks great, and 👍 for leveraging the existing http plugin. Can we get this merged and into an official release?


```toml
[[outputs.http]]
# ## URL is the address to send metrics to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for the initial # through line 252


```toml
[[outputs.http]]
# ## URL is the address to send metrics to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, leading # isn't necessary

An example configuration of a file based output is:

```toml
# # Send telegraf metrics to file(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same #


import (
"encoding/json"
// "errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this commented import

m, err := s.createObject(metric)
if err != nil {
log.Printf("E! [serializer.splunkmetric] Dropping invalid metric: %v [%v]", metric, m)
return []byte(""), err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantic, but maybe return nil, err here

Fixup README files as requested in review
Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should unify the documentation into just a single place. Let's link from the DATA_FORMATS_OUTPUT documents to the README and sometime before 1.8 is final we can move the other serializers documentation to follow suite.

@ronnocol
Copy link
Contributor Author

ronnocol commented Sep 3, 2018

@danielnelson For the sake of clarity, you want me to remove the text in DATA_FORMATS_OUTPUT and replace it with a link to the README. Possibly merging some of the text from that I put in the DATA_FORMAT_OUTPUT file into the README if additional clarity is required.

@danielnelson
Copy link
Contributor

That's right, thank you

Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on the docs, thanks for taking care of that for me. I took another look over the pull request and noticed a few additional things we should discuss before merging:

plugins/serializers/registry.go Show resolved Hide resolved
## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
data_format = "splunkmetric"
## Provides time, index, source overrides for the HEC
hec_routing = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this variable splunk_hec_routing, due to how the parser variables are injected directly into the plugin we have started prefixing the variables. Long term, these will probably become tables and the parsers will become more independent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep things consistent, lets call it splunkmetric_hec_routing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do.


m, err := s.createObject(metric)
if err != nil {
log.Printf("D! [serializer.splunkmetric] Dropping invalid metric: %v [%v]", metric, m)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return this as an error: return nil, fmt.Errorf("Dropping invalid metric: %s", metric.Name())

for _, metric := range metrics {
m, err := s.createObject(metric)
if err != nil {
log.Printf("D! [serializer.splunkmetric] Dropping invalid metric: %v [%v]", metric, m)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should return error. If there are situations where a metric cannot be serialized but it is not an error, have createObject return nil, nil and check if m is nil before appending.

for k, v := range metric.Fields() {

if !verifyValue(v) {
log.Printf("D! Can not parse value: %v for key: %v", v, k)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we should just continue on the next field. Otherwise we will never be able to serialize a metric with a string field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will include in the other changes being made.


dataGroup := HECTimeSeries{}

for k, v := range metric.Fields() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use metric.FieldList() so that no allocation is performed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the refactoring mentioned below (thank for the suggestion)


dataGroup.Event = "metric"
// Convert ns to float seconds since epoch.
dataGroup.Time = float64(metric.Time().UnixNano()) / float64(1000000000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric.Time().Unix() will give you unix time in seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want seconds, I want ms but as a float (this is Splunk's spec...) e.g. 1536295485.123

}

func (s *serializer) createObject(metric telegraf.Metric) (metricJson []byte, err error) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do a bad job of explaining what is guaranteed in a telegraf.Metric, and actually as I write this I notice some additional checks I need to add.

Here is a brief rundown of things you may or may not need to check:

  • metric name may be an empty string
  • zero or more tags, tag keys are any non-empty string, tag values may be empty strings
  • zero, yes zero, or more fields, field keys are any non-empty string, field values may be any float64 (including NaN, +/-Inf),int64,uint64,string,bool
  • time is any time.Time.

The part about tag/field keys not being empty strings is not true right now, but after writing this I am going to ensure this is the case in 1.8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this actually caused me to re-examine the the serializer with several different inputs, and I found a case in which metrics were lost (dropped), so I'm refactoring some of the code to deal with that. (Also, it's been a crazy week...so hope to get this done over the next few days.)

@@ -0,0 +1,139 @@
# Splunk Metrics serialzier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling

@danielnelson
Copy link
Contributor

@ronnocol We are hoping to release 1.8.0-rc1 on Wednesday afternoon, let me know if you think that could be a problem for this pr.

Tested http input to Splunk 7.1.2 (w/ hec routing)
Verified output of file output.
@ronnocol
Copy link
Contributor Author

@danielnelson I believe I have resolved all of your requested changes as well as resolved an issue where metrics might have been dropped. I know that you want to cut an RC1 on Wednesday, I believe that this serializer is ready to be included in that.

@danielnelson danielnelson merged commit c80aab0 into influxdata:master Sep 11, 2018
rgitzel pushed a commit to rgitzel/telegraf that referenced this pull request Oct 17, 2018
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
dupondje pushed a commit to dupondje/telegraf that referenced this pull request Apr 22, 2019
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants