Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qryn crash under load - ERR_STREAM_PREMATURE_CLOSE #516

Open
jpsfs opened this issue Jun 13, 2024 · 6 comments
Open

Qryn crash under load - ERR_STREAM_PREMATURE_CLOSE #516

jpsfs opened this issue Jun 13, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@jpsfs
Copy link

jpsfs commented Jun 13, 2024

Hi!

I'm facing an issue while using PromQL through qryn.

The query is the following:

histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{service_namespace=~"$environment",service_name=~"$component",  instance=~"$instance",http_route=~"$route", http_request_method=~"$method"}[$__rate_interval])) by (le))

It translates to this ClickHouse query:

WITH idx AS (select `fingerprint` from `qryn`.`time_series_gin` as `time_series_gin` where ((((`key` = 'service_namespace') and (match(val, '.+') = 1)) or ((`key` = 'service_name') and (match(val, '.+') = 1)) or ((`key` = 'instance') and (match(val, '.+') = 1)) or ((`key` = 'http_route') and (match(val, '.+') = 1)) or ((`key` = 'http_request_method') and (match(val, '.+') = 1)) or ((`key` = '__name__') and (`val` = 'http_server_request_duration_seconds_bucket'))) and (`date` >= toDate(fromUnixTimestamp(1718310540))) and (`date` <= toDate(fromUnixTimestamp(1718312340))) and (`type` in (0,0))) group by `fingerprint` having (groupBitOr(bitShiftLeft(((`key` = 'service_namespace') and (match(val, '.+') = 1))::UInt64, 0)+bitShiftLeft(((`key` = 'service_name') and (match(val, '.+') = 1))::UInt64, 1)+bitShiftLeft(((`key` = 'instance') and (match(val, '.+') = 1))::UInt64, 2)+bitShiftLeft(((`key` = 'http_route') and (match(val, '.+') = 1))::UInt64, 3)+bitShiftLeft(((`key` = 'http_request_method') and (match(val, '.+') = 1))::UInt64, 4)+bitShiftLeft(((`key` = '__name__') and (`val` = 'http_server_request_duration_seconds_bucket'))::UInt64, 5)) = 63)), raw AS (select argMaxMerge(last) as `value`,`fingerprint`,intDiv(timestamp_ns, 15000000000) * 15000 as `timestamp_ms` from `metrics_15s` as `metrics_15s` where ((`fingerprint` in (idx)) and (`timestamp_ns` >= 1718310540000000000) and (`timestamp_ns` <= 1718312340000000000) and (`type` in (0,0))) group by `fingerprint`,`timestamp_ms` order by `fingerprint`,`timestamp_ms`), timeSeries AS (select `fingerprint`,arraySort(JSONExtractKeysAndValues(labels, 'String')) as `labels` from `qryn`.`time_series` where ((`fingerprint` in (idx)) and (`type` in (0,0)))) select any(labels) as `stream`,arraySort(groupArray((raw.timestamp_ms, raw.value))) as `values` from raw as `raw` any left join timeSeries as time_series on `time_series`.`fingerprint` = raw.fingerprint group by `raw`.`fingerprint` order by `raw`.`fingerprint`

And after a few seconds it crashes with the following error:

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at Gunzip.onclose (node:internal/streams/end-of-stream:154:30)
    at Gunzip.emit (node:events:531:35)
    at emitCloseNT (node:internal/streams/destroy:147:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21)
Emitted 'error' event on Readable instance at:
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  code: 'ERR_STREAM_PREMATURE_CLOSE'
}

Running this query directly in ClickHouse, it returns in around 150ms, with a total size of 50MiB.
Any pointers on what I should do to overcome this?

Best,
José

@lmangani
Copy link
Collaborator

Thanks for the report @jpsfs
Do you see any logs or errors from ClickHouse as this query fails?

@jpsfs
Copy link
Author

jpsfs commented Jun 13, 2024

Thank you for the follow-up @lmangani !
Only normal ClickHouse logs, the query itself doesn't seem to fail on ClickHouse and if I try to execute it manually it succeeds quite fast.

I forgot to mention that this was tested in the latest version (released today) as well as in the previous two versions.

If the database is smaller (less data) the query succeeds on qryn as well.

Best,
José

@lmangani lmangani added the bug Something isn't working label Jun 13, 2024
@akvlad
Copy link
Collaborator

akvlad commented Jun 14, 2024

@jpsfs do you have an error message: "timeout" in Grafana when you request histogram_quantile(0.50, ....) ?

@jpsfs
Copy link
Author

jpsfs commented Jun 15, 2024 via email

@akvlad akvlad mentioned this issue Jun 20, 2024
@lmangani
Copy link
Collaborator

lmangani commented Jun 20, 2024

@jpsfs could you please retest using the latest release and provide any feedback?

EXPERIMENTAL_PROMQL_OPTIMIZE==1

@akvlad
Copy link
Collaborator

akvlad commented Jun 20, 2024

Hello @jpsfs . 3.2.24 version is released.

  • sum and rate functions were optimized using clickhouse functions.

Please set the env var EXPERIMENTAL_PROMQL_OPTIMIZE=1 before usage.
Please share the user experience of using sum and rate functions (like in your histogram_quantile.... request) so we can decide if the further optimizations are worth to be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants