Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize Monitoring Bulk Request Compressed #56410

Conversation

original-brownbear
Copy link
Member

Even with changes from #48854 we're still seeing significant (as in tens and hundreds of MB)
buffer usage for bulk exports in some cases which destabilizes master nodes.
Since we need to know the serialized length of the bulk body we can't do the serialization
in a streaming manner. (also it's not easily doable with the HTTP client API we're using anyway).
=> let's at least serialize on heap in compressed form and decompress as we're streaming to the
HTTP connection. For small requests this adds negligible overhead but for large requests this reduces
the size of the payload field by about an order of magnitude (empirically determined) which is a massive
reduction in size when considering O(100MB) bulk requests.

Even with changes from elastic#48854 we're still seeing significant (as in tens and hundreds of MB)
buffer usage for bulk exports in some cases which destabilizes master nodes.
Since we need to know the serialized length of the bulk body we can't do the serialization
in a streaming manner. (also it's not easily doable with the HTTP client API we're using anyway).
=> let's at least serialize on heap in compressed form and decompress as we're streaming to the
http connection. For small requests this adds negligible overhead but for large requests this reduces
the size of the payload field by about an order of magnitude (empirically determined) which is a massive
reduction in size when considering O(100MB) bulk requests.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Monitoring)

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label May 8, 2020
@jakelandis
Copy link
Contributor

@original-brownbear the changes here look fine technically, however, I wonder about the practicality. Despite bulk and flush separated by methods, it actually does very little between these two methods. This particular workflow does not buffer bulk payloads and flush based on time or size (like other parts of the code base). So here we will compress only to de-compress a few milliseconds later.

I agree the goal with the reduce the memory usage of the bulk exporters on master, but I am not sure if this buys us much without the ability to send the compressed bits. I wonder if gzip compression the payload for use with the http entity such that we don't need to decompress prior to calling out is feasible ?

@original-brownbear
Copy link
Member Author

original-brownbear commented May 8, 2020

So here we will compress only to de-compress a few milliseconds later.

That still saves us a boatload of memory I thnk. We're not decompressing to buffers but decompressing to a stream that is then consumed step by step by the HTTP client. So at no point will we need the full bulk request on heap uncompressed and peak memory use is way reduced isn't it?

Also, even if it did get buffered in full in the client (but I think the Apache HTTP client is way smarter than that ... looks that way from the code as well) we'd still save an order of magnitude on half the request size :) (looking at heap dumps of clusters with trouble I can't see any large buffers than the ones we create as well though, so I'm pretty confident the client streams properly)

Copy link
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I trust your assessment of the practicality, and no issues with the implementation.

@original-brownbear
Copy link
Member Author

Thanks Jake!

@original-brownbear original-brownbear merged commit bbbaee6 into elastic:master May 8, 2020
@original-brownbear original-brownbear deleted the monitoring-serialize-compressed branch May 8, 2020 16:33
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request May 8, 2020
Even with changes from elastic#48854 we're still seeing significant (as in tens and hundreds of MB)
buffer usage for bulk exports in some cases which destabilizes master nodes.
Since we need to know the serialized length of the bulk body we can't do the serialization
in a streaming manner. (also it's not easily doable with the HTTP client API we're using anyway).
=> let's at least serialize on heap in compressed form and decompress as we're streaming to the
HTTP connection. For small requests this adds negligible overhead but for large requests this reduces
the size of the payload field by about an order of magnitude (empirically determined) which is a massive reduction in size when considering O(100MB) bulk requests.
original-brownbear added a commit that referenced this pull request May 8, 2020
Even with changes from #48854 we're still seeing significant (as in tens and hundreds of MB)
buffer usage for bulk exports in some cases which destabilizes master nodes.
Since we need to know the serialized length of the bulk body we can't do the serialization
in a streaming manner. (also it's not easily doable with the HTTP client API we're using anyway).
=> let's at least serialize on heap in compressed form and decompress as we're streaming to the
HTTP connection. For small requests this adds negligible overhead but for large requests this reduces
the size of the payload field by about an order of magnitude (empirically determined) which is a massive reduction in size when considering O(100MB) bulk requests.
@original-brownbear original-brownbear restored the monitoring-serialize-compressed branch August 6, 2020 18:23
@original-brownbear original-brownbear deleted the monitoring-serialize-compressed branch August 12, 2020 09:20
@original-brownbear original-brownbear restored the monitoring-serialize-compressed branch December 6, 2020 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants