From c058db6b893109430c7dd646ce0ba06f92954c38 Mon Sep 17 00:00:00 2001
From: Daniel Nelson <daniel.nelson@influxdb.com>
Date: Thu, 23 Aug 2018 10:47:13 -0700
Subject: [PATCH] Remerge data format docs

---
 docs/DATA_FORMATS_INPUT.md | 269 ++++++++++++++++++-------------------
 1 file changed, 133 insertions(+), 136 deletions(-)

diff --git a/docs/DATA_FORMATS_INPUT.md b/docs/DATA_FORMATS_INPUT.md
index ea6d2f7670a31..7e57d9657aae1 100644
--- a/docs/DATA_FORMATS_INPUT.md
+++ b/docs/DATA_FORMATS_INPUT.md
@@ -12,7 +12,7 @@ Telegraf is able to parse the following input data formats into metrics:
 1. [Grok](#grok)
 1. [Logfmt](#logfmt)
 1. [Wavefront](#wavefront)
-1. [CSV] (#CSV)
+1. [CSV](#csv)
 
 Telegraf metrics, like InfluxDB
 [points](https://docs.influxdata.com/influxdb/v0.10/write_protocols/line/),
@@ -108,28 +108,28 @@ but can be overridden using the `name_override` config option.
 
 #### JSON Configuration:
 
-The JSON data format supports specifying "tag_keys", "string_keys", and "json_query". 
-If specified, keys in "tag_keys" and "string_keys" will be searched for in the root-level 
-and any nested lists of the JSON blob. All int and float values are added to fields by default. 
-If the key(s) exist, they will be applied as tags or fields to the Telegraf metrics.  
+The JSON data format supports specifying "tag_keys", "string_keys", and "json_query".
+If specified, keys in "tag_keys" and "string_keys" will be searched for in the root-level
+and any nested lists of the JSON blob. All int and float values are added to fields by default.
+If the key(s) exist, they will be applied as tags or fields to the Telegraf metrics.
 If "string_keys" is specified, the string will be added as a field.
 
-The "json_query" configuration is a gjson path to an JSON object or 
-list of JSON objects. If this path leads to an array of values or 
-single data point an error will be thrown.  If this configuration 
+The "json_query" configuration is a gjson path to an JSON object or
+list of JSON objects. If this path leads to an array of values or
+single data point an error will be thrown.  If this configuration
 is specified, only the result of the query will be parsed and returned as metrics.
 
 The "json_name_key" configuration specifies the key of the field whos value will be
 added as the metric name.
 
-Object paths are specified using gjson path format, which is denoted by object keys 
-concatenated with "." to go deeper in nested JSON objects.  
+Object paths are specified using gjson path format, which is denoted by object keys
+concatenated with "." to go deeper in nested JSON objects.
 Additional information on gjson paths can be found here: https://github.com/tidwall/gjson#path-syntax
 
-The JSON data format also supports extracting time values through the 
-config "json_time_key" and "json_time_format". If "json_time_key" is set, 
-"json_time_format" must be specified.  The "json_time_key" describes the 
-name of the field containing time information.  The "json_time_format" 
+The JSON data format also supports extracting time values through the
+config "json_time_key" and "json_time_format". If "json_time_key" is set,
+"json_time_format" must be specified.  The "json_time_key" describes the
+name of the field containing time information.  The "json_time_format"
 must be a recognized Go time format.
 If there is no year provided, the metrics will have the current year.
 More info on time formats can be found here: https://golang.org/pkg/time/#Parse
@@ -162,8 +162,8 @@ For example, if you had this configuration:
   ## List of field names to extract from JSON and add as string fields
   # json_string_fields = []
 
-  ## gjson query path to specify a specific chunk of JSON to be parsed with 
-  ## the above configuration. If not specified, the whole file will be parsed. 
+  ## gjson query path to specify a specific chunk of JSON to be parsed with
+  ## the above configuration. If not specified, the whole file will be parsed.
   ## gjson query paths are described here: https://github.com/tidwall/gjson#path-syntax
   # json_query = ""
 
@@ -192,8 +192,8 @@ Your Telegraf metrics would get tagged with "my_tag_1"
 exec_mycollector,my_tag_1=foo a=5,b_c=6
 ```
 
-If the JSON data is an array, then each element of the array is 
-parsed with the configured settings.  Each resulting metric will 
+If the JSON data is an array, then each element of the array is
+parsed with the configured settings.  Each resulting metric will
 be output with the same timestamp.
 
 For example, if the following configuration:
@@ -221,7 +221,7 @@ For example, if the following configuration:
   ## List of field names to extract from JSON and add as string fields
   # string_fields = []
 
-  ## gjson query path to specify a specific chunk of JSON to be parsed with 
+  ## gjson query path to specify a specific chunk of JSON to be parsed with
   ## the above configuration. If not specified, the whole file will be parsed
   # json_query = ""
 
@@ -265,7 +265,7 @@ exec_mycollector,my_tag_1=foo,my_tag_2=baz b_c=6 1136387040000000000
 exec_mycollector,my_tag_1=bar,my_tag_2=baz b_c=8 1168527840000000000
 ```
 
-If you want to only use a specific portion of your JSON, use the "json_query" 
+If you want to only use a specific portion of your JSON, use the "json_query"
 configuration to specify a path to a JSON object.
 
 For example, with the following config:
@@ -289,7 +289,7 @@ For example, with the following config:
   ## List of field names to extract from JSON and add as string fields
   string_fields = ["last"]
 
-  ## gjson query path to specify a specific chunk of JSON to be parsed with 
+  ## gjson query path to specify a specific chunk of JSON to be parsed with
   ## the above configuration. If not specified, the whole file will be parsed
   json_query = "obj.friends"
 
@@ -787,50 +787,6 @@ The best way to get acquainted with grok patterns is to read the logstash docs,
 which are available here:
   https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
 
-#### Grok Configuration:
-```toml
-[[inputs.file]]
-  ## Files to parse each interval.
-  ## These accept standard unix glob matching rules, but with the addition of
-  ## ** as a "super asterisk". ie:
-  ##   /var/log/**.log     -> recursively find all .log files in /var/log
-  ##   /var/log/*/*.log    -> find all .log files with a parent dir in /var/log
-  ##   /var/log/apache.log -> only tail the apache log file
-  files = ["/var/log/apache/access.log"]
-
-  ## The dataformat to be read from files
-  ## Each data format has its own unique set of configuration options, read
-  ## more about them here:
-  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
-  data_format = "grok"
-
-  ## This is a list of patterns to check the given log file(s) for.
-  ## Note that adding patterns here increases processing time. The most
-  ## efficient configuration is to have one pattern.
-  ## Other common built-in patterns are:
-  ##   %{COMMON_LOG_FORMAT}   (plain apache & nginx access logs)
-  ##   %{COMBINED_LOG_FORMAT} (access logs + referrer & agent)
-  grok_patterns = ["%{COMBINED_LOG_FORMAT}"]
-
-  ## Full path(s) to custom pattern files.
-  grok_custom_pattern_files = []
-
-  ## Custom patterns can also be defined here. Put one pattern per line.
-  grok_custom_patterns = '''
-  '''
-
-  ## Timezone allows you to provide an override for timestamps that
-  ## don't already include an offset
-  ## e.g. 04/06/2016 12:41:45 data one two 5.43µs
-  ##
-  ## Default: "" which renders UTC
-  ## Options are as follows:
-  ##   1. Local             -- interpret based on machine localtime
-  ##   2. "Canada/Eastern"  -- Unix TZ values like those found in https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
-  ##   3. UTC               -- or blank/unspecified, will return timestamp in UTC
-  grok_timezone = "Canada/Eastern"
-```
-
 The grok parser uses a slightly modified version of logstash "grok"
 patterns, with the format:
 
@@ -981,30 +937,133 @@ will be processed based on the current machine timezone configuration. Lastly, i
 timezone from the list of Unix [timezones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones),
 grok will offset the timestamp accordingly.
 
+#### TOML Escaping
+
+When saving patterns to the configuration file, keep in mind the different TOML
+[string](https://github.com/toml-lang/toml#string) types and the escaping
+rules for each.  These escaping rules must be applied in addition to the
+escaping required by the grok syntax.  Using the Multi-line line literal
+syntax with `'''` may be useful.
+
+The following config examples will parse this input file:
+
+```
+|42|\uD83D\uDC2F|'telegraf'|
+```
+
+Since `|` is a special character in the grok language, we must escape it to
+get a literal `|`.  With a basic TOML string, special characters such as
+backslash must be escaped, requiring us to escape the backslash a second time.
+
+```toml
+[[inputs.file]]
+  grok_patterns = ["\\|%{NUMBER:value:int}\\|%{UNICODE_ESCAPE:escape}\\|'%{WORD:name}'\\|"]
+  grok_custom_patterns = "UNICODE_ESCAPE (?:\\\\u[0-9A-F]{4})+"
+```
+
+We cannot use a literal TOML string for the pattern, because we cannot match a
+`'` within it.  However, it works well for the custom pattern.
+```toml
+[[inputs.file]]
+  grok_patterns = ["\\|%{NUMBER:value:int}\\|%{UNICODE_ESCAPE:escape}\\|'%{WORD:name}'\\|"]
+  grok_custom_patterns = 'UNICODE_ESCAPE (?:\\u[0-9A-F]{4})+'
+```
+
+A multi-line literal string allows us to encode the pattern:
+```toml
+[[inputs.file]]
+  grok_patterns = ['''
+    \|%{NUMBER:value:int}\|%{UNICODE_ESCAPE:escape}\|'%{WORD:name}'\|
+  ''']
+  grok_custom_patterns = 'UNICODE_ESCAPE (?:\\u[0-9A-F]{4})+'
+```
+
+#### Tips for creating patterns
+
+Writing complex patterns can be difficult, here is some advice for writing a
+new pattern or testing a pattern developed [online](https://grokdebug.herokuapp.com).
+
+Create a file output that writes to stdout, and disable other outputs while
+testing.  This will allow you to see the captured metrics.  Keep in mind that
+the file output will only print once per `flush_interval`.
+
+```toml
+[[outputs.file]]
+  files = ["stdout"]
+```
+
+- Start with a file containing only a single line of your input.
+- Remove all but the first token or piece of the line.
+- Add the section of your pattern to match this piece to your configuration file.
+- Verify that the metric is parsed successfully by running Telegraf.
+- If successful, add the next token, update the pattern and retest.
+- Continue one token at a time until the entire line is successfully parsed.
+
+# Logfmt
+This parser implements the logfmt format by extracting and converting key-value pairs from log text in the form `<key>=<value>`.
+At the moment, the plugin will produce one metric per line and all keys
+are added as fields.
+A typical log
+```
+method=GET host=influxdata.org ts=2018-07-24T19:43:40.275Z
+connect=4ms service=8ms status=200 bytes=1653
+```
+will be converted into
+```
+logfmt method="GET",host="influxdata.org",ts="2018-07-24T19:43:40.275Z",connect="4ms",service="8ms",status=200i,bytes=1653i
+
+```
+Additional information about the logfmt format can be found [here](https://brandur.org/logfmt).
+
+# Wavefront:
+
+Wavefront Data Format is metrics are parsed directly into Telegraf metrics.
+For more information about the Wavefront Data Format see
+[here](https://docs.wavefront.com/wavefront_data_format.html).
+
+There are no additional configuration options for Wavefront Data Format line-protocol.
+
+#### Wavefront Configuration:
+
+```toml
+[[inputs.exec]]
+  ## Commands array
+  commands = ["/tmp/test.sh", "/usr/bin/mycollector --foo=bar"]
+
+  ## measurement name suffix (for separating different commands)
+  name_suffix = "_mycollector"
+
+  ## Data format to consume.
+  ## Each data format has its own unique set of configuration options, read
+  ## more about them here:
+  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
+  data_format = "wavefront"
+```
+
 # CSV
 Parse out metrics from a CSV formatted table. By default, the parser assumes there is no header and
 will read data from the first line. If `csv_header_row_count` is set to anything besides 0, the parser
-will extract column names from the first number of rows. Headers of more than 1 row will have their 
+will extract column names from the first number of rows. Headers of more than 1 row will have their
 names concatenated together.  Any unnamed columns will be ignored by the parser.
 
 The `csv_skip_rows` config indicates the number of rows to skip before looking for header information or data
 to parse. By default, no rows will be skipped.
 
-The `csv_skip_columns` config indicates the number of columns to be skipped before parsing data. These 
+The `csv_skip_columns` config indicates the number of columns to be skipped before parsing data. These
 columns will not be read out of the header.  Naming with the `csv_column_names` will begin at the first
 parsed column after skipping the indicated columns.  By default, no columns are skipped.
 
-To assign custom column names, the `csv_column_names` config is available. If the `csv_column_names` 
-config is used, all columns must be named as additional columns will be ignored. If `csv_header_row_count` 
+To assign custom column names, the `csv_column_names` config is available. If the `csv_column_names`
+config is used, all columns must be named as additional columns will be ignored. If `csv_header_row_count`
 is set to 0, `csv_column_names` must be specified.  Names listed in `csv_column_names` will override names extracted
 from the header.
 
 The `csv_tag_columns` and `csv_field_columns` configs are available to add the column data to the metric.
-The name used to specify the column is the name in the header, or if specified, the corresponding 
+The name used to specify the column is the name in the header, or if specified, the corresponding
 name assigned in `csv_column_names`. If neither config is specified, no data will be added to the metric.
 
-Additional configs are available to dynamically name metrics and set custom timestamps.  If the 
-`csv_column_names` config is specified, the parser will assign the metric name to the value found 
+Additional configs are available to dynamically name metrics and set custom timestamps.  If the
+`csv_column_names` config is specified, the parser will assign the metric name to the value found
 in that column. If the `csv_timestamp_column` is specified, the parser will extract the timestamp from
 that column. If `csv_timestamp_column` is specified, the `csv_timestamp_format` must also be specified
 or an error will be thrown.
@@ -1013,12 +1072,12 @@ or an error will be thrown.
 ```toml
   data_format = "csv"
 
-  ## Indicates how many rows to treat as a header. By default, the parser assumes 
+  ## Indicates how many rows to treat as a header. By default, the parser assumes
   ## there is no header and will parse the first row as data. If set to anything more
   ## than 1, column names will be concatenated with the name listed in the next header row.
   ## If `csv_column_names` is specified, the column names in header will be overridden.
   # csv_header_row_count = 0
-  
+
   ## Indicates the number of rows to skip before looking for header information.
   # csv_skip_rows = 0
 
@@ -1061,65 +1120,3 @@ or an error will be thrown.
   ## this must be specified if `csv_timestamp_column` is specified
   # csv_timestamp_format = ""
   ```
-
-#### Tips for creating patterns
-
-Writing complex patterns can be difficult, here is some advice for writing a
-new pattern or testing a pattern developed [online](https://grokdebug.herokuapp.com).
-
-Create a file output that writes to stdout, and disable other outputs while
-testing.  This will allow you to see the captured metrics.  Keep in mind that
-the file output will only print once per `flush_interval`.
-
-```toml
-[[outputs.file]]
-  files = ["stdout"]
-```
-
-- Start with a file containing only a single line of your input.
-- Remove all but the first token or piece of the line.
-- Add the section of your pattern to match this piece to your configuration file.
-- Verify that the metric is parsed successfully by running Telegraf.
-- If successful, add the next token, update the pattern and retest.
-- Continue one token at a time until the entire line is successfully parsed.
-
-# Logfmt
-This parser implements the logfmt format by extracting and converting key-value pairs from log text in the form `<key>=<value>`.
-At the moment, the plugin will produce one metric per line and all keys
-are added as fields.
-A typical log
-```
-method=GET host=influxdata.org ts=2018-07-24T19:43:40.275Z
-connect=4ms service=8ms status=200 bytes=1653
-```
-will be converted into
-```
-logfmt method="GET",host="influxdata.org",ts="2018-07-24T19:43:40.275Z",connect="4ms",service="8ms",status=200i,bytes=1653i
-
-```
-Additional information about the logfmt format can be found [here](https://brandur.org/logfmt).
-
-# Wavefront:
-
-Wavefront Data Format is metrics are parsed directly into Telegraf metrics.
-For more information about the Wavefront Data Format see
-[here](https://docs.wavefront.com/wavefront_data_format.html).
-
-There are no additional configuration options for Wavefront Data Format line-protocol.
-
-#### Wavefront Configuration:
-
-```toml
-[[inputs.exec]]
-  ## Commands array
-  commands = ["/tmp/test.sh", "/usr/bin/mycollector --foo=bar"]
-
-  ## measurement name suffix (for separating different commands)
-  name_suffix = "_mycollector"
-
-  ## Data format to consume.
-  ## Each data format has its own unique set of configuration options, read
-  ## more about them here:
-  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
-  data_format = "wavefront"
-```