Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

premature close when querying logql #533

Closed
KhafRuslan opened this issue Jul 15, 2024 · 9 comments
Closed

premature close when querying logql #533

KhafRuslan opened this issue Jul 15, 2024 · 9 comments

Comments

@KhafRuslan
Copy link

KhafRuslan commented Jul 15, 2024

When querying in grafana (version 10) I get different results. Qryn (like Loki) is selected as the data source. My query returns different number of rows. Similarly in Qryn view.

image

Config Qryn

services:
  otel-collector:
    container_name: otel-collector
    image: some-registry/ghcr.io-cache/metrico/qryn-otel-collector:0.0.3
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC receiver
#      - "24224:24224"
#      - "8086:8086"
      - "14250:14250"
      - "6831:6831"
      - "14268:14268"
      - "3200:3200"
      - "3100:3100"
      - "4318:4318"
      - "8888:8888"
#      - "11800:11800"
#      - "12800:12800"
#      - "8088:8088"
      - "5514:5514"
#      - "5514:3100"
      - "9411:9411"
      - "9200:9200"   # loki receiver
#      - "9300:9300"   # Elasticsearch
  qryn:
    image:  some-registry/project-cache/qxip/qryn:3.2.25
    container_name: qryn
    restart: unless-stopped
    expose:
      - 3101
    ports:
      - "80:3100"
    environment:
      CLICKHOUSE_SERVER: "chproxy-server"
      CLICKHOUSE_PORT: "8123"
      CLICKHOUSE_DB: "qryn"
      CLICKHOUSE_AUTH: "somecred"
      CLUSTER_NAME: "qryn_cluster"
      LOG_LEVEL: "debug"
      FASTIFY_BODYLIMIT: "1048576"

Qryn logs:

{"level":30,"time":1721035800909,"pid":19,"hostname":"be0837b9417c","name":"qryn","reqId":"req-1k","req":{"method":"GET","url":"/loki/api/v1/query_range?direction=backward&end=1721035800562000000&limit=10&query=%7Bappname%3D%22acs1-server%22%7D&start=1721014200562000000&step=21600000ms","hostname":"qryn-test.somedomain.group","remoteAddress":"someip","remotePort":47542},"msg":"incoming request"}
{"level":20,"time":1721035800909,"pid":19,"hostname":"be0837b9417c","name":"qryn","reqId":"req-1k","msg":"GET /loki/api/v1/query_range"}
{"level":20,"time":1721035800909,"pid":19,"hostname":"be0837b9417c","name":"qryn","msg":"Scanning Fingerprints..."}
WITH sel_a AS (select `samples`.`string` as `string`,`samples`.`fingerprint` as `fingerprint`,samples.timestamp_ns as `timestamp_ns` from qryn.samples_v3_dist as `samples` where ((`samples`.`timestamp_ns`   between 1721014200561000000 and 1721035800561000000) and (`samples`.`type` in (0,0))) and (`samples`.`fingerprint` in (select `fingerprint` from (select DISTINCT `fingerprint` from qryn.time_series where ((JSONHas(labels, 'appname') = 1) and (JSONExtractString(labels, 'appname') = 'acs1-server'))))) order by `timestamp_ns` desc limit 10) select JSONExtractKeysAndValues(time_series.labels, 'String') as `labels`,sel_a.* from sel_a left any join `qryn`.`time_series_dist` AS time_series on `sel_a`.`fingerprint` = time_series.fingerprint order by `labels` desc,`timestamp_ns` desc
{"level":20,"time":1721035800924,"pid":19,"hostname":"be0837b9417c","name":"qryn","msg":"Scanning Fingerprints..."}
{"level":30,"time":1721035801753,"pid":19,"hostname":"be0837b9417c","name":"qryn","reqId":"req-1k","res":{"statusCode":200},"responseTime":844.1129480004311,"msg":"request completed"}
{"level":30,"time":1721035801841,"pid":19,"hostname":"be0837b9417c","name":"qryn","reqId":"req-1j","res":{"statusCode":200},"msg":"stream closed prematurely"}
{"level":50,"time":1721035801842,"pid":19,"hostname":"be0837b9417c","name":"qryn","reqId":"req-1j","err":"premature close\nError: premature close\n    at onclosenexttick (/app/node_modules/end-of-stream/index.js:54:86)\n    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)","msg":"premature close"}

In clickhouse, this query always returns 16 lines. I withheld the data
image

@lmangani
Copy link
Collaborator

Thanks for the report @KhafRuslan have you tried bypassing chproxy to make sure it's not involved?

@KhafRuslan
Copy link
Author

Similarly, it didn't solve the problem

@lmangani
Copy link
Collaborator

Ok, please provide any useful elements to reproduce this since we do not have any other similar reports this might be due to something on your system.

@KhafRuslan
Copy link
Author

Operating system on all servers centos 8. Scheme:
image
Otel config:

receivers:
  tcplog:
    listen_address: "0.0.0.0:5514"
    operators:
    - type: json_parser
      timestamp:
        parse_from: attributes.time
        layout_type: gotime
        layout: '2006-01-02 15:04:05.999'
    - type: move
      from: attributes.newmessage
      to: body
    - type: remove
      id: time_remove
      field: attributes.time

extensions:
  health_check: []
  pprof: []
  zpages: []

processors:
  batch/logs:
    send_batch_size: 500
    timeout: 20s
  memory_limiter/logs:
    check_interval: 2s
    limit_mib: 18000
    spike_limit_mib: 500
  batch/trace:
    send_batch_size: 10000
    timeout: 0s
  memory_limiter/trace:
    check_interval: 1s
    limit_mib: 4000
    spike_limit_mib: 800

exporters:
  qryn:
    dsn: http://chproxy-host.somedomain.group:9000/qryn?username=qryn&password=somepassword
    logs:
      format: raw
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 300s
      max_interval: 30s
    sending_queue:
      queue_size: 1200
    timeout: 10s

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    logs:
      exporters: [qryn]
      processors: [memory_limiter/logs, batch/logs]
      receivers: [tcplog]
  telemetry:
    logs:
      level: "debug"
    metrics:
      address: 0.0.0.0:8888

Haproxy config:

global
    log 127.0.0.1 local2
    maxconn 1000
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    user   haproxy
    group  haproxy
    daemon
    stats socket /var/lib/haproxy/stats
defaults
    log global
    mode tcp
    retries 2
    timeout connect 4s

listen otel_write
    balance roundrobin
    bind *:9000
    default-server inter 10s fall 3 rise 3 on-marked-down shutdown-sessions
      server qryn-db1 10.80.172.125:9000
      server qryn-db2 10.80.172.126:9000
      server qryn-db3 10.80.172.127:9000
      server qryn-db4 10.80.172.124:9000

Chproxy config:

log_debug: true
hack_me_please: true

caches:
  - name: "default_cache"
    mode: "file_system"
    file_system:
      max_size: 2Gb
      dir: "/data/cache"
    max_payload_size: 2Gb
    shared_with_all_users: true
    expire: 1h

server:
  http:
    listen_addr: "0.0.0.0:8123"

users:
  - name: "qryn"
    password: "somepassword"
    to_cluster: "qryn_cluster"
    to_user: "qryn"
    max_concurrent_queries: 1000
    max_execution_time: 10m
    cache: "default_cache"

clusters:
  - name: "qryn_cluster"
    nodes: ["qryn-db1:8123", "qryn-db2:8123", "qryn-db3:8123", "qryn-db4:8123"]
    users:
      - name: "qryn"
        password: "somepassword"

Clackhouse settings
cluster.xml:

<clickhouse>
     <remote_servers replace="1">
       <qryn_cluster> <!-- you need to give a some name for a cluster -->
           <!-- Inter-server per-cluster secret for Distributed queries default: no secret (no authentication will be performed) -->
           <secret>SuperSecretValue</secret>
           <shard>
               <internal_replication>true</internal_replication>
               <replica>
                   <host>qryn-db1</host>
                   <port>9000</port>
               </replica>
               <replica>
                   <host>qryn-db2</host>
                   <port>9000</port>
               </replica>
           </shard>
           <shard>
               <internal_replication>true</internal_replication>
               <replica>
                   <host>qryn-db3</host>
                   <port>9000</port>
               </replica>
               <replica>
                   <host>qryn-db4</host>
                   <port>9000</port>
               </replica>
           </shard>
       </qryn_cluster>
    </remote_servers>
</clickhouse>

database_atomic_delay_before_drop_table.xml:

<clickhouse>
    <database_atomic_delay_before_drop_table_sec>1</database_atomic_delay_before_drop_table_sec>
</clickhouse>

disable_drop_restrictions.xml:

<clickhouse>
    <max_table_size_to_drop>0</max_table_size_to_drop>
    <max_partition_size_to_drop>0</max_partition_size_to_drop>
</clickhouse>

disable_unused_ports.xml:

<clickhouse>
    <!-- To reduce exposure to your ClickHouse environment, unused ports should be disabled. -->
    <mysql_port remove="1" />
    <postgresql_port remove="1" />
</clickhouse>

listen.xml:

<clickhouse>
    <!--ipv4-->
    <listen_host>0.0.0.0</listen_host>
</clickhouse>

logging.xml:

<clickhouse>
  <logger replace="1">
    <level>information</level>
    <log>/var/log/clickhouse-server/clickhouse-server.log</log>
    <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
    <size>300M</size>
    <count>5</count>
  </logger>
</clickhouse>

macros.xml:

<clickhouse>
     <!--
        That macros are defined per server,
        and they can be used in DDL, to make the DB schema cluster/server neutral
    -->
    <macros>
        <cluster>qryn_cluster</cluster>
        <shard>01</shard>
        <replica>urt-qryn-db1</replica>
    </macros>
</clickhouse>

zookeeper.xml:

<clickhouse>
     <zookeeper>
        <node>
            <host>qryn-zookeper.somedomain.group</host>
            <port>2181</port>
        </node>
    </zookeeper>
</clickhouse>

Zookeper conf:

tickTime=2000
initLimit=20
syncLimit=5
dataDir=/u/zookeeper/data
clientPort=2181
maxClientCnxns=0
autopurge.snapRetainCount=5
autopurge.purgeInterval=1
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
4lw.commands.whitelist=stat
server.1=qryn-zookeper:2888:3888

@lmangani
Copy link
Collaborator

Thanks for the full detail overview. Have you noticed any potentially related logs or errors in ClickHouse while this issue manifests?

@KhafRuslan
Copy link
Author

No

@KhafRuslan
Copy link
Author

I've been seeing errors like this before:
2024.07.16 14:41:58.926544 [ 15764 ] {c73cd6c6-4513-493b-8f66-0cfbb7c88939} HedgedConnectionsFactory: Connection failed at try №1, reason: Code: 210. DB::NetException: I/O error: Broken pipe, while writing to socket (db3:40594 -> db1:9000). (NETWORK_ERROR) (version 23.12.6.19 (official build))

but it seems to be related to something else

@KhafRuslan
Copy link
Author

KhafRuslan commented Jul 18, 2024

There was a disynchronization between databases, it was solved through several database recreations. I'll see how he behaves in another 24 hours. I also upgraded to the latest version
image

@lmangani
Copy link
Collaborator

Thanks for the update @KhafRuslan this info is particularly useful as we couldn't reproduce this so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants