Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best sharding_key for Distributed tables? #198

Closed
raiford opened this issue Aug 24, 2022 · 7 comments
Closed

Best sharding_key for Distributed tables? #198

raiford opened this issue Aug 24, 2022 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed needs testing

Comments

@raiford
Copy link

raiford commented Aug 24, 2022

I'm doing some experimentation with replication and sharding and still trying to learn more about the internals of qryn. I don't have any concrete results yet but I started reading some of the CH docs and reviewed the suggested schema for distributed tables:

https://github.com/metrico/qryn/wiki/sharding-replication
#172

I'm wondering if using rand() for the sharding key of the samples_v3 table might be better based on the following:

  • High volume logs with identical label sets can unbalance the cluster since they all end up on a single shard.
  • Using a field for sharding can improve performance if you do a lot of JOINs on that field, however I think this might be negated by the fact that you currently have to set distributed_product_mode to global.

This makes me start wondering if it would be necessary to make qryn itself aware of the sharding key to optimize queries but still need to experiment more.

Also, it might be good to add some notes to the wiki link about about potentially needing to customize the schema to use ON CLUSTER and I think some of the CH macros that it uses are custom to the Altinity K8s clickhouse-operator but I might be mistaken.

@lmangani lmangani added enhancement New feature or request help wanted Extra attention is needed needs testing labels Aug 26, 2022
@raiford
Copy link
Author

raiford commented Aug 26, 2022

Quick update. I'm still experimenting with this but will be leaving on vacation for a couple weeks so won't have any updates for a while.

@lmangani
Copy link
Collaborator

Thanks @raiford enjoy your holidays and we're looking forward to any future updates on this thread!

@R-omk
Copy link

R-omk commented Aug 26, 2022

I think this may be related. The rules for sharding and for TTL may be similar.
#158

@R-omk
Copy link

R-omk commented Aug 26, 2022

As for the choice of strategy - don't worry about network transfers while reading, especially don't worry about transferring data from the time_series table, there is too little data (however, you should take care to use fingerprint IN (network subquery instead of JOIN (network subquery ).

What is really worth thinking about is that there are fewer disk reads/writes involved for write and read data.

A granule is the smallest indivisible data set that ClickHouse reads when selecting data

Instead of random distribution for sharding, you can consider distribution over time intervals, such a strategy is much better for reading but worse for writing, since different shards will be loaded at different times.

It seems to me that it is best to use a combination of the selected labels and the time interval for the sharding key, where each one can configure the rules that suit his workload.

For example:

{ some match rule by labels where the first match wins  }   =  {
 labels  :  [namespace, component],  
 timeInterval : 5s 
 }

for entry labels[namespace=logging, component=qryn, pod_uid=abcdef123456, hostname=host1] time=2022-08-27 00:07:49
shard key can be calculated as
select xxHash32(concat('logging', 'qryn', toString(toStartOfInterval(toDateTime('2022-08-27 00:07:49'), INTERVAL 5 second) ) ));

A similar approach can be used for chose ttl rules #158

@coelho
Copy link

coelho commented Sep 3, 2022

I chose fingerprint as the sharding key to optimize compression. In regards to query performance, I believe it's negligible given an average query's cardinality.

I think the main concern is unbalanced shards, and I'd say that again really depends on your cardinality. Dependent on your use case (for ex. logging), your cardinality may be pretty high so it may be a non-issue. With that said, this might be something to optimize at scale dependent on your workload.

CH macros that it uses are custom to the Altinity K8s clickhouse-operator

Right :) They are currently Altinity K8 macros. Of course they can be adjusted to whatever.

@R-omk
Copy link

R-omk commented Sep 14, 2022

I chose fingerprint as the sharding key to optimize compression.

Here I described the current situation with compression
#212

@lmangani
Copy link
Collaborator

lmangani commented Nov 1, 2023

Feel free to reopen if still interested for 3.x

@lmangani lmangani closed this as completed Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed needs testing
Projects
None yet
Development

No branches or pull requests

4 participants