Skip to content

Commit

Permalink
Fix logging performance issues in Elasticsearch v2. rsyslog improvements
Browse files Browse the repository at this point in the history
After upgrading to elasticsearch v2, some of our logging tests were
consistently failing if they happened to follow a number of other tests.
The issue was that if a test previously ran that made a bunch of
requests, then Elasticsearch wasn't indexing the data quickly enough,
causing rsyslog queue to get backed up. This resulted in the requests
from the logging tests not getting indexed into Elasticsearch within the
expected time.

There were 2 main performance issues with Elasticsearch v2 that led to
this:

1. Elasticsearch syncs the data to disk on every request in version 2,
   which hurts indexing performance. We've switched things back v1's
   mode of performing asynchronous syncs periodically (so we're trading
   some safety for speed, but that seems okay for this kind of log
   data).
2. Updating index mappings in Elasticsearch version 2 is more costly.
   Our mapping is mostly static unless you enable logging of the
   "request_query" field, which stores all the request query params as
   a nested object. We have some tests that generate a bunch of unique,
   random query parameters (mainly for cache busting), but this leads to
   a deluge of mapping updates since each new query param seen means
   the mapping needs to be updated.

   We recently disabled gathering this "request_query" field by default,
   but it was still enabled in our test suite by default, since we had
   some existing tests that relied on this functionality. So to solve
   the performance issues, we've shifted our test suite to disable
   gathering "request_query" too. This eliminates all the mapping
   updates during tests.

   The ability to enabled "request_query" collection still exists, and
   the existing tests for this functionality have been retained (just in
   a more isolated fashion that won't impact other tests). However,
   given the potential performance issues of enabling this, it might be
   a good reason to get rid of this functionality altogether.

While debugging the performance issues, we've also made a couple tweaks
and improvements to our rsyslog setup:

- The queue.size wasn't configured, so the memory-portion of the queue
  was capped at 1,000 by default. This meant the configured highwater
  and lowwater sizes weren't actually being used. Set a higher
  queue.size to resolve this.
- Enable the impstats plugin to output rsyslog queue stats every minute.
  This seems generally helpful to have in place to be able to see what
  rsyslog's up to and whether anything is becoming congested with
  logging.
- More comments to explain some of the more cryptic rsyslog
  configuration settings.
  • Loading branch information
GUI committed Dec 15, 2016
1 parent 0c981a8 commit c3afad9
Show file tree
Hide file tree
Showing 9 changed files with 434 additions and 158 deletions.
1 change: 1 addition & 0 deletions build/cmake/rsyslog.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ list(APPEND RSYSLOG_CONFIGURE_CMD --prefix=${INSTALL_PREFIX_EMBEDDED})
list(APPEND RSYSLOG_CONFIGURE_CMD --enable-liblogging-stdlog)
list(APPEND RSYSLOG_CONFIGURE_CMD --disable-libgcrypt)
list(APPEND RSYSLOG_CONFIGURE_CMD --enable-imptcp)
list(APPEND RSYSLOG_CONFIGURE_CMD --enable-impstats)
list(APPEND RSYSLOG_CONFIGURE_CMD --enable-mmjsonparse)
list(APPEND RSYSLOG_CONFIGURE_CMD --enable-mmutf8fix)
list(APPEND RSYSLOG_CONFIGURE_CMD --enable-elasticsearch)
Expand Down
7 changes: 7 additions & 0 deletions config/default.yml
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,13 @@ elasticsearch:
breaker:
fielddata:
limit: 60%
index:
translog:
# Sync the data to disk asynchronously on a fixed interval, rather than
# for every request. This significantly helps indexing throughput (at
# the risk of losing a few seconds of data if things crash).
durability: async
sync_interval: 10s
analytics:
adapter: elasticsearch
timezone: UTC
Expand Down
132 changes: 114 additions & 18 deletions config/elasticsearch_templates.json
Original file line number Diff line number Diff line change
Expand Up @@ -81,66 +81,162 @@
"type": "string",
"index": "not_analyzed"
},
"user_id": {
"backend_response_time": {
"type": "integer"
},
"gatekeeper_denied_code": {
"type": "string",
"index": "not_analyzed"
"analyzer": "keyword_lowercase"
},
"internal_gatekeeper_time": {
"type": "float"
},
"internal_response_time": {
"type": "float"
},
"proxy_overhead": {
"type": "integer"
},
"request_accept": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_accept_encoding": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_at": {
"type": "date"
},
"request_basic_auth_username": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_connection": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_content_type": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_hierarchy": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_host": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_ip": {
"type": "string",
"index": "not_analyzed"
},
"request_ip_city": {
"type": "string",
"index": "not_analyzed"
},
"request_ip_country": {
"type": "string",
"index": "not_analyzed"
},
"request_ip_location": {
"type": "geo_point",
"lat_lon": true
},
"request_ip_region": {
"type": "string",
"index": "not_analyzed"
},
"request_ip_city": {
"request_method": {
"type": "string",
"index": "not_analyzed"
"analyzer": "keyword_lowercase"
},
"request_origin": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_path": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_path_hierarchy": {
"type": "string",
"analyzer": "path_hierarchy_lowercase"
},
"request_referer": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_scheme": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_size": {
"type": "integer"
},
"response_status": {
"type": "short"
"request_url": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_user_agent": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_user_agent_family": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"request_user_agent_type": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"response_age": {
"type": "integer"
},
"response_cache": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"response_content_encoding": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"response_content_length": {
"type": "integer"
},
"response_content_type": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"response_server": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"response_size": {
"type": "integer"
},
"response_time": {
"type": "integer"
"response_status": {
"type": "short"
},
"backend_response_time": {
"response_time": {
"type": "integer"
},
"proxy_overhead": {
"type": "integer"
"response_transfer_encoding": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"internal_response_time": {
"type": "float"
"user_email": {
"type": "string",
"analyzer": "keyword_lowercase"
},
"internal_gatekeeper_time": {
"type": "float"
"user_id": {
"type": "string",
"index": "not_analyzed"
},
"request_ip_location": {
"type": "geo_point",
"lat_lon": true
"user_registration_source": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
Expand Down
1 change: 0 additions & 1 deletion config/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ http_port: 9080
https_port: 9081
analytics:
timezone: America/Denver
log_request_url_query_params_separately: true
static_site:
port: 13013
nginx:
Expand Down
45 changes: 41 additions & 4 deletions templates/etc/rsyslog.conf.mustache
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,16 @@ global(maxMessageSize="32k")

global(workDirectory="{{db_dir}}/rsyslog")

# Output queue statistics periodically so the health of the queue can be
# inspected.
module(
load="impstats"
interval="60"
resetCounters="on"
log.syslog="off"
log.file="{{log_dir}}/rsyslog/stats.log"
)

# Load Modules
module(load="imtcp" MaxSessions="500")
module(load="mmjsonparse")
Expand Down Expand Up @@ -160,36 +170,59 @@ template(name="elasticsearch-json-record" type="subtree" subtree="$!usr!es")
# Output to ElasticSearch.
# A disk-assisted memory queue is used for buffering.
local0.info action(
name="output-elasticsearch"
type="omelasticsearch"
server="{{elasticsearch._first_server.host}}"
serverport="{{elasticsearch._first_server.port}}"
searchIndex="elasticsearch-index"
dynSearchIndex="on"
searchType="log"
template="elasticsearch-json-record"

# Enable bulk indexing, so batches of records are sent as a single HTTP
# request.
bulkmode="on"
bulkId="elasticsearch-id"
dynBulkId="on"

# Allow bulk indexing of batches *up to* this size.
#
# Note that as long as Elasticsearch is keeping up and data isn't being
# queued by rsyslog, then rsyslog will send data as quickly as it can to
# Elasticsearch (so the batch sizes might be much smaller). See
# http://www.gossamer-threads.com/lists/rsyslog/users/17550
queue.dequeuebatchsize="5000"

# Require indexing by all replica shards.
asyncrepl="off"

# For the in-memory queue, use a linked-list (so the memory doesn't have to
# be pre-allocated based on a fixed size).
queue.type="LinkedList"

# Set a filename, so the queue is disk assisted. This allows for offloading
# the data from the memory queue to disk if the queue becomes bigger than
# expected.
queue.filename="queue-elasticsearch"

# Set thresholds for when the memory queue is too big and should use the
# disk (note the disk queue size is not bounded by the queue.size, that only
# applies to the memory portion).
queue.size="15000"
queue.highwatermark="10000"
queue.lowwatermark="2000"

# Persist data to disk on this interval (in seconds). We're okay with some
# loss in the event of unexpected failures.
queue.checkpointinterval="10"

# Persist data to disk on graceful shutdowns.
queue.saveonshutdown="on"
# Set thresholds for when the memory queue is too big and should use the
# disk.
queue.highwatermark="10000"
queue.lowwatermark="2000"

# If Elasticsearch is inaccessible, retry on this interval (in seconds)
# indefinitely (so we don't stop logging to Elasticsearch in case it goes
# down for a longer period of time).
action.resumeInterval="30"
action.resumeRetryCount="-1"
)
{{/analytics._output_elasticsearch?}}
Expand Down Expand Up @@ -361,6 +394,7 @@ template(name="sql-json-record" type="subtree" subtree="$!usr!sql")
# Output to Kafka.
# A disk-assisted memory queue is used for buffering.
local0.info action(
name="output-kafka"
type="omkafka"
broker=[{{kafka._rsyslog_broker}}]
topic="{{kafka.topic}}"
Expand All @@ -370,6 +404,7 @@ local0.info action(
queue.filename="queue-kafka"
queue.checkpointinterval="10"
queue.saveonshutdown="on"
queue.size="15000"
queue.highwatermark="10000"
queue.lowwatermark="2000"
action.resumeRetryCount="-1"
Expand All @@ -381,13 +416,15 @@ template(name="all-json-record" type="list") {
property(name="$!raw") constant(value="\n")
}
local0.info action(
name="output-file"
type="omfile"
file="{{log_dir}}/rsyslog/requests.log.gz"
template="all-json-record"
queue.type="LinkedList"
queue.filename="queue-file"
queue.checkpointinterval="10"
queue.saveonshutdown="on"
queue.size="15000"
queue.highwatermark="10000"
queue.lowwatermark="2000"
zipLevel="3"
Expand Down
Loading

0 comments on commit c3afad9

Please sign in to comment.