From c058db6b893109430c7dd646ce0ba06f92954c38 Mon Sep 17 00:00:00 2001 From: Daniel Nelson Date: Thu, 23 Aug 2018 10:47:13 -0700 Subject: [PATCH] Remerge data format docs --- docs/DATA_FORMATS_INPUT.md | 269 ++++++++++++++++++------------------- 1 file changed, 133 insertions(+), 136 deletions(-) diff --git a/docs/DATA_FORMATS_INPUT.md b/docs/DATA_FORMATS_INPUT.md index ea6d2f7670a31..7e57d9657aae1 100644 --- a/docs/DATA_FORMATS_INPUT.md +++ b/docs/DATA_FORMATS_INPUT.md @@ -12,7 +12,7 @@ Telegraf is able to parse the following input data formats into metrics: 1. [Grok](#grok) 1. [Logfmt](#logfmt) 1. [Wavefront](#wavefront) -1. [CSV] (#CSV) +1. [CSV](#csv) Telegraf metrics, like InfluxDB [points](https://docs.influxdata.com/influxdb/v0.10/write_protocols/line/), @@ -108,28 +108,28 @@ but can be overridden using the `name_override` config option. #### JSON Configuration: -The JSON data format supports specifying "tag_keys", "string_keys", and "json_query". -If specified, keys in "tag_keys" and "string_keys" will be searched for in the root-level -and any nested lists of the JSON blob. All int and float values are added to fields by default. -If the key(s) exist, they will be applied as tags or fields to the Telegraf metrics. +The JSON data format supports specifying "tag_keys", "string_keys", and "json_query". +If specified, keys in "tag_keys" and "string_keys" will be searched for in the root-level +and any nested lists of the JSON blob. All int and float values are added to fields by default. +If the key(s) exist, they will be applied as tags or fields to the Telegraf metrics. If "string_keys" is specified, the string will be added as a field. -The "json_query" configuration is a gjson path to an JSON object or -list of JSON objects. If this path leads to an array of values or -single data point an error will be thrown. If this configuration +The "json_query" configuration is a gjson path to an JSON object or +list of JSON objects. If this path leads to an array of values or +single data point an error will be thrown. If this configuration is specified, only the result of the query will be parsed and returned as metrics. The "json_name_key" configuration specifies the key of the field whos value will be added as the metric name. -Object paths are specified using gjson path format, which is denoted by object keys -concatenated with "." to go deeper in nested JSON objects. +Object paths are specified using gjson path format, which is denoted by object keys +concatenated with "." to go deeper in nested JSON objects. Additional information on gjson paths can be found here: https://github.com/tidwall/gjson#path-syntax -The JSON data format also supports extracting time values through the -config "json_time_key" and "json_time_format". If "json_time_key" is set, -"json_time_format" must be specified. The "json_time_key" describes the -name of the field containing time information. The "json_time_format" +The JSON data format also supports extracting time values through the +config "json_time_key" and "json_time_format". If "json_time_key" is set, +"json_time_format" must be specified. The "json_time_key" describes the +name of the field containing time information. The "json_time_format" must be a recognized Go time format. If there is no year provided, the metrics will have the current year. More info on time formats can be found here: https://golang.org/pkg/time/#Parse @@ -162,8 +162,8 @@ For example, if you had this configuration: ## List of field names to extract from JSON and add as string fields # json_string_fields = [] - ## gjson query path to specify a specific chunk of JSON to be parsed with - ## the above configuration. If not specified, the whole file will be parsed. + ## gjson query path to specify a specific chunk of JSON to be parsed with + ## the above configuration. If not specified, the whole file will be parsed. ## gjson query paths are described here: https://github.com/tidwall/gjson#path-syntax # json_query = "" @@ -192,8 +192,8 @@ Your Telegraf metrics would get tagged with "my_tag_1" exec_mycollector,my_tag_1=foo a=5,b_c=6 ``` -If the JSON data is an array, then each element of the array is -parsed with the configured settings. Each resulting metric will +If the JSON data is an array, then each element of the array is +parsed with the configured settings. Each resulting metric will be output with the same timestamp. For example, if the following configuration: @@ -221,7 +221,7 @@ For example, if the following configuration: ## List of field names to extract from JSON and add as string fields # string_fields = [] - ## gjson query path to specify a specific chunk of JSON to be parsed with + ## gjson query path to specify a specific chunk of JSON to be parsed with ## the above configuration. If not specified, the whole file will be parsed # json_query = "" @@ -265,7 +265,7 @@ exec_mycollector,my_tag_1=foo,my_tag_2=baz b_c=6 1136387040000000000 exec_mycollector,my_tag_1=bar,my_tag_2=baz b_c=8 1168527840000000000 ``` -If you want to only use a specific portion of your JSON, use the "json_query" +If you want to only use a specific portion of your JSON, use the "json_query" configuration to specify a path to a JSON object. For example, with the following config: @@ -289,7 +289,7 @@ For example, with the following config: ## List of field names to extract from JSON and add as string fields string_fields = ["last"] - ## gjson query path to specify a specific chunk of JSON to be parsed with + ## gjson query path to specify a specific chunk of JSON to be parsed with ## the above configuration. If not specified, the whole file will be parsed json_query = "obj.friends" @@ -787,50 +787,6 @@ The best way to get acquainted with grok patterns is to read the logstash docs, which are available here: https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html -#### Grok Configuration: -```toml -[[inputs.file]] - ## Files to parse each interval. - ## These accept standard unix glob matching rules, but with the addition of - ## ** as a "super asterisk". ie: - ## /var/log/**.log -> recursively find all .log files in /var/log - ## /var/log/*/*.log -> find all .log files with a parent dir in /var/log - ## /var/log/apache.log -> only tail the apache log file - files = ["/var/log/apache/access.log"] - - ## The dataformat to be read from files - ## Each data format has its own unique set of configuration options, read - ## more about them here: - ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md - data_format = "grok" - - ## This is a list of patterns to check the given log file(s) for. - ## Note that adding patterns here increases processing time. The most - ## efficient configuration is to have one pattern. - ## Other common built-in patterns are: - ## %{COMMON_LOG_FORMAT} (plain apache & nginx access logs) - ## %{COMBINED_LOG_FORMAT} (access logs + referrer & agent) - grok_patterns = ["%{COMBINED_LOG_FORMAT}"] - - ## Full path(s) to custom pattern files. - grok_custom_pattern_files = [] - - ## Custom patterns can also be defined here. Put one pattern per line. - grok_custom_patterns = ''' - ''' - - ## Timezone allows you to provide an override for timestamps that - ## don't already include an offset - ## e.g. 04/06/2016 12:41:45 data one two 5.43µs - ## - ## Default: "" which renders UTC - ## Options are as follows: - ## 1. Local -- interpret based on machine localtime - ## 2. "Canada/Eastern" -- Unix TZ values like those found in https://en.wikipedia.org/wiki/List_of_tz_database_time_zones - ## 3. UTC -- or blank/unspecified, will return timestamp in UTC - grok_timezone = "Canada/Eastern" -``` - The grok parser uses a slightly modified version of logstash "grok" patterns, with the format: @@ -981,30 +937,133 @@ will be processed based on the current machine timezone configuration. Lastly, i timezone from the list of Unix [timezones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones), grok will offset the timestamp accordingly. +#### TOML Escaping + +When saving patterns to the configuration file, keep in mind the different TOML +[string](https://github.com/toml-lang/toml#string) types and the escaping +rules for each. These escaping rules must be applied in addition to the +escaping required by the grok syntax. Using the Multi-line line literal +syntax with `'''` may be useful. + +The following config examples will parse this input file: + +``` +|42|\uD83D\uDC2F|'telegraf'| +``` + +Since `|` is a special character in the grok language, we must escape it to +get a literal `|`. With a basic TOML string, special characters such as +backslash must be escaped, requiring us to escape the backslash a second time. + +```toml +[[inputs.file]] + grok_patterns = ["\\|%{NUMBER:value:int}\\|%{UNICODE_ESCAPE:escape}\\|'%{WORD:name}'\\|"] + grok_custom_patterns = "UNICODE_ESCAPE (?:\\\\u[0-9A-F]{4})+" +``` + +We cannot use a literal TOML string for the pattern, because we cannot match a +`'` within it. However, it works well for the custom pattern. +```toml +[[inputs.file]] + grok_patterns = ["\\|%{NUMBER:value:int}\\|%{UNICODE_ESCAPE:escape}\\|'%{WORD:name}'\\|"] + grok_custom_patterns = 'UNICODE_ESCAPE (?:\\u[0-9A-F]{4})+' +``` + +A multi-line literal string allows us to encode the pattern: +```toml +[[inputs.file]] + grok_patterns = [''' + \|%{NUMBER:value:int}\|%{UNICODE_ESCAPE:escape}\|'%{WORD:name}'\| + '''] + grok_custom_patterns = 'UNICODE_ESCAPE (?:\\u[0-9A-F]{4})+' +``` + +#### Tips for creating patterns + +Writing complex patterns can be difficult, here is some advice for writing a +new pattern or testing a pattern developed [online](https://grokdebug.herokuapp.com). + +Create a file output that writes to stdout, and disable other outputs while +testing. This will allow you to see the captured metrics. Keep in mind that +the file output will only print once per `flush_interval`. + +```toml +[[outputs.file]] + files = ["stdout"] +``` + +- Start with a file containing only a single line of your input. +- Remove all but the first token or piece of the line. +- Add the section of your pattern to match this piece to your configuration file. +- Verify that the metric is parsed successfully by running Telegraf. +- If successful, add the next token, update the pattern and retest. +- Continue one token at a time until the entire line is successfully parsed. + +# Logfmt +This parser implements the logfmt format by extracting and converting key-value pairs from log text in the form `=`. +At the moment, the plugin will produce one metric per line and all keys +are added as fields. +A typical log +``` +method=GET host=influxdata.org ts=2018-07-24T19:43:40.275Z +connect=4ms service=8ms status=200 bytes=1653 +``` +will be converted into +``` +logfmt method="GET",host="influxdata.org",ts="2018-07-24T19:43:40.275Z",connect="4ms",service="8ms",status=200i,bytes=1653i + +``` +Additional information about the logfmt format can be found [here](https://brandur.org/logfmt). + +# Wavefront: + +Wavefront Data Format is metrics are parsed directly into Telegraf metrics. +For more information about the Wavefront Data Format see +[here](https://docs.wavefront.com/wavefront_data_format.html). + +There are no additional configuration options for Wavefront Data Format line-protocol. + +#### Wavefront Configuration: + +```toml +[[inputs.exec]] + ## Commands array + commands = ["/tmp/test.sh", "/usr/bin/mycollector --foo=bar"] + + ## measurement name suffix (for separating different commands) + name_suffix = "_mycollector" + + ## Data format to consume. + ## Each data format has its own unique set of configuration options, read + ## more about them here: + ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md + data_format = "wavefront" +``` + # CSV Parse out metrics from a CSV formatted table. By default, the parser assumes there is no header and will read data from the first line. If `csv_header_row_count` is set to anything besides 0, the parser -will extract column names from the first number of rows. Headers of more than 1 row will have their +will extract column names from the first number of rows. Headers of more than 1 row will have their names concatenated together. Any unnamed columns will be ignored by the parser. The `csv_skip_rows` config indicates the number of rows to skip before looking for header information or data to parse. By default, no rows will be skipped. -The `csv_skip_columns` config indicates the number of columns to be skipped before parsing data. These +The `csv_skip_columns` config indicates the number of columns to be skipped before parsing data. These columns will not be read out of the header. Naming with the `csv_column_names` will begin at the first parsed column after skipping the indicated columns. By default, no columns are skipped. -To assign custom column names, the `csv_column_names` config is available. If the `csv_column_names` -config is used, all columns must be named as additional columns will be ignored. If `csv_header_row_count` +To assign custom column names, the `csv_column_names` config is available. If the `csv_column_names` +config is used, all columns must be named as additional columns will be ignored. If `csv_header_row_count` is set to 0, `csv_column_names` must be specified. Names listed in `csv_column_names` will override names extracted from the header. The `csv_tag_columns` and `csv_field_columns` configs are available to add the column data to the metric. -The name used to specify the column is the name in the header, or if specified, the corresponding +The name used to specify the column is the name in the header, or if specified, the corresponding name assigned in `csv_column_names`. If neither config is specified, no data will be added to the metric. -Additional configs are available to dynamically name metrics and set custom timestamps. If the -`csv_column_names` config is specified, the parser will assign the metric name to the value found +Additional configs are available to dynamically name metrics and set custom timestamps. If the +`csv_column_names` config is specified, the parser will assign the metric name to the value found in that column. If the `csv_timestamp_column` is specified, the parser will extract the timestamp from that column. If `csv_timestamp_column` is specified, the `csv_timestamp_format` must also be specified or an error will be thrown. @@ -1013,12 +1072,12 @@ or an error will be thrown. ```toml data_format = "csv" - ## Indicates how many rows to treat as a header. By default, the parser assumes + ## Indicates how many rows to treat as a header. By default, the parser assumes ## there is no header and will parse the first row as data. If set to anything more ## than 1, column names will be concatenated with the name listed in the next header row. ## If `csv_column_names` is specified, the column names in header will be overridden. # csv_header_row_count = 0 - + ## Indicates the number of rows to skip before looking for header information. # csv_skip_rows = 0 @@ -1061,65 +1120,3 @@ or an error will be thrown. ## this must be specified if `csv_timestamp_column` is specified # csv_timestamp_format = "" ``` - -#### Tips for creating patterns - -Writing complex patterns can be difficult, here is some advice for writing a -new pattern or testing a pattern developed [online](https://grokdebug.herokuapp.com). - -Create a file output that writes to stdout, and disable other outputs while -testing. This will allow you to see the captured metrics. Keep in mind that -the file output will only print once per `flush_interval`. - -```toml -[[outputs.file]] - files = ["stdout"] -``` - -- Start with a file containing only a single line of your input. -- Remove all but the first token or piece of the line. -- Add the section of your pattern to match this piece to your configuration file. -- Verify that the metric is parsed successfully by running Telegraf. -- If successful, add the next token, update the pattern and retest. -- Continue one token at a time until the entire line is successfully parsed. - -# Logfmt -This parser implements the logfmt format by extracting and converting key-value pairs from log text in the form `=`. -At the moment, the plugin will produce one metric per line and all keys -are added as fields. -A typical log -``` -method=GET host=influxdata.org ts=2018-07-24T19:43:40.275Z -connect=4ms service=8ms status=200 bytes=1653 -``` -will be converted into -``` -logfmt method="GET",host="influxdata.org",ts="2018-07-24T19:43:40.275Z",connect="4ms",service="8ms",status=200i,bytes=1653i - -``` -Additional information about the logfmt format can be found [here](https://brandur.org/logfmt). - -# Wavefront: - -Wavefront Data Format is metrics are parsed directly into Telegraf metrics. -For more information about the Wavefront Data Format see -[here](https://docs.wavefront.com/wavefront_data_format.html). - -There are no additional configuration options for Wavefront Data Format line-protocol. - -#### Wavefront Configuration: - -```toml -[[inputs.exec]] - ## Commands array - commands = ["/tmp/test.sh", "/usr/bin/mycollector --foo=bar"] - - ## measurement name suffix (for separating different commands) - name_suffix = "_mycollector" - - ## Data format to consume. - ## Each data format has its own unique set of configuration options, read - ## more about them here: - ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md - data_format = "wavefront" -```