Best sharding_key for Distributed tables? #198

raiford · 2022-08-24T17:17:53Z

I'm doing some experimentation with replication and sharding and still trying to learn more about the internals of qryn. I don't have any concrete results yet but I started reading some of the CH docs and reviewed the suggested schema for distributed tables:

https://github.com/metrico/qryn/wiki/sharding-replication
#172

I'm wondering if using rand() for the sharding key of the samples_v3 table might be better based on the following:

High volume logs with identical label sets can unbalance the cluster since they all end up on a single shard.
Using a field for sharding can improve performance if you do a lot of JOINs on that field, however I think this might be negated by the fact that you currently have to set distributed_product_mode to global.

This makes me start wondering if it would be necessary to make qryn itself aware of the sharding key to optimize queries but still need to experiment more.

Also, it might be good to add some notes to the wiki link about about potentially needing to customize the schema to use ON CLUSTER and I think some of the CH macros that it uses are custom to the Altinity K8s clickhouse-operator but I might be mistaken.

The text was updated successfully, but these errors were encountered:

raiford · 2022-08-26T15:30:32Z

Quick update. I'm still experimenting with this but will be leaving on vacation for a couple weeks so won't have any updates for a while.

lmangani · 2022-08-26T15:33:12Z

Thanks @raiford enjoy your holidays and we're looking forward to any future updates on this thread!

R-omk · 2022-08-26T20:03:36Z

I think this may be related. The rules for sharding and for TTL may be similar.
#158

R-omk · 2022-08-26T21:08:57Z

As for the choice of strategy - don't worry about network transfers while reading, especially don't worry about transferring data from the time_series table, there is too little data (however, you should take care to use fingerprint IN (network subquery instead of JOIN (network subquery ).

What is really worth thinking about is that there are fewer disk reads/writes involved for write and read data.

A granule is the smallest indivisible data set that ClickHouse reads when selecting data

Instead of random distribution for sharding, you can consider distribution over time intervals, such a strategy is much better for reading but worse for writing, since different shards will be loaded at different times.

It seems to me that it is best to use a combination of the selected labels and the time interval for the sharding key, where each one can configure the rules that suit his workload.

For example:

{ some match rule by labels where the first match wins  }   =  {
 labels  :  [namespace, component],  
 timeInterval : 5s 
 }

for entry labels[namespace=logging, component=qryn, pod_uid=abcdef123456, hostname=host1] time=2022-08-27 00:07:49
shard key can be calculated as
select xxHash32(concat('logging', 'qryn', toString(toStartOfInterval(toDateTime('2022-08-27 00:07:49'), INTERVAL 5 second) ) ));

A similar approach can be used for chose ttl rules #158

coelho · 2022-09-03T06:54:49Z

I chose fingerprint as the sharding key to optimize compression. In regards to query performance, I believe it's negligible given an average query's cardinality.

I think the main concern is unbalanced shards, and I'd say that again really depends on your cardinality. Dependent on your use case (for ex. logging), your cardinality may be pretty high so it may be a non-issue. With that said, this might be something to optimize at scale dependent on your workload.

CH macros that it uses are custom to the Altinity K8s clickhouse-operator

Right :) They are currently Altinity K8 macros. Of course they can be adjusted to whatever.

R-omk · 2022-09-14T12:39:24Z

I chose fingerprint as the sharding key to optimize compression.

Here I described the current situation with compression
#212

lmangani · 2023-11-01T22:45:53Z

Feel free to reopen if still interested for 3.x

lmangani added enhancement New feature or request help wanted Extra attention is needed needs testing labels Aug 26, 2022

akvlad mentioned this issue Aug 31, 2022

Full-weight cluster support #207

Open

lmangani closed this as completed Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best sharding_key for Distributed tables? #198

Best sharding_key for Distributed tables? #198

raiford commented Aug 24, 2022 •

edited

Loading

raiford commented Aug 26, 2022

lmangani commented Aug 26, 2022

R-omk commented Aug 26, 2022

R-omk commented Aug 26, 2022

coelho commented Sep 3, 2022

R-omk commented Sep 14, 2022

lmangani commented Nov 1, 2023

Best sharding_key for Distributed tables? #198

Best sharding_key for Distributed tables? #198

Comments

raiford commented Aug 24, 2022 • edited Loading

raiford commented Aug 26, 2022

lmangani commented Aug 26, 2022

R-omk commented Aug 26, 2022

R-omk commented Aug 26, 2022

coelho commented Sep 3, 2022

R-omk commented Sep 14, 2022

lmangani commented Nov 1, 2023

raiford commented Aug 24, 2022 •

edited

Loading