-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best sharding_key for Distributed tables? #198
Comments
Quick update. I'm still experimenting with this but will be leaving on vacation for a couple weeks so won't have any updates for a while. |
Thanks @raiford enjoy your holidays and we're looking forward to any future updates on this thread! |
I think this may be related. The rules for sharding and for TTL may be similar. |
As for the choice of strategy - don't worry about network transfers while reading, especially don't worry about transferring data from the time_series table, there is too little data (however, you should take care to use What is really worth thinking about is that there are fewer disk reads/writes involved for write and read data. A granule is the smallest indivisible data set that ClickHouse reads when selecting data Instead of random distribution for sharding, you can consider distribution over time intervals, such a strategy is much better for reading but worse for writing, since different shards will be loaded at different times. It seems to me that it is best to use a combination of the selected labels and the time interval for the sharding key, where each one can configure the rules that suit his workload. For example:
for entry A similar approach can be used for chose ttl rules #158 |
I chose I think the main concern is unbalanced shards, and I'd say that again really depends on your cardinality. Dependent on your use case (for ex. logging), your cardinality may be pretty high so it may be a non-issue. With that said, this might be something to optimize at scale dependent on your workload.
Right :) They are currently Altinity K8 macros. Of course they can be adjusted to whatever. |
Here I described the current situation with compression |
Feel free to reopen if still interested for 3.x |
I'm doing some experimentation with replication and sharding and still trying to learn more about the internals of qryn. I don't have any concrete results yet but I started reading some of the CH docs and reviewed the suggested schema for distributed tables:
https://github.com/metrico/qryn/wiki/sharding-replication
#172
I'm wondering if using
rand()
for the sharding key of thesamples_v3
table might be better based on the following:distributed_product_mode
toglobal
.This makes me start wondering if it would be necessary to make qryn itself aware of the sharding key to optimize queries but still need to experiment more.
Also, it might be good to add some notes to the wiki link about about potentially needing to customize the schema to use
ON CLUSTER
and I think some of the CH macros that it uses are custom to the Altinity K8s clickhouse-operator but I might be mistaken.The text was updated successfully, but these errors were encountered: