Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): new implementation of analyze table #14725

Merged
merged 14 commits into from
Feb 27, 2024

Conversation

sundy-li
Copy link
Member

@sundy-li sundy-li commented Feb 23, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

  1. Analyze command will merge increment blocks into the hyperloglog state of table_statistics file
  2. Support querying incrementable blocks of fuse table (tuple rows maybe duplicated).
SELECT ...
FROM <fuse_table>
[ AT ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ] 
[ SINCE ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ];

eg:

databend-local:) insert into abc select * from abc_random limit 3;
3 rows written in 0.031 sec. Processed 3 rows, 3 B (95.5 rows/s, 3.54 KiB/s)

databend-local:) select * from abc since(snapshot => '045e8ed9233245e692b8782039a2a504');
┌────────────────────────────────────────────────────────┐
│      a     │      b      │       c      │       d      │
│ Int32 NULL │ String NULL │ Boolean NULL │ Float64 NULL │
├────────────┼─────────────┼──────────────┼──────────────┤
│ NULL       │ NULL        │ NULL         │ 0.0744611682 │
│ NULL       │ NULL        │ NULL         │ 0.5569072503 │
│ 1621677414 │ NULL        │ true         │ NULL         │
└────────────────────────────────────────────────────────┘
3 rows result in 0.041 sec. Processed 3 rows, 3 B (73.05 rows/s, 2.09 KiB/s)

Todo in future:

  1. Considering about the mutations, we will introduce a healthy ratio in snapshot, if this is too low, it's worth doing full table analyze to override the stats.
  • Fixes #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Feb 23, 2024
@sundy-li sundy-li marked this pull request as ready for review February 24, 2024 08:29
@dantengsky
Copy link
Member

dantengsky commented Feb 25, 2024

what is the "semantic" of query:

SELECT ... FROM <fuse_table> [ AT ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ] [ SINCE ( { SNAPSHOT => <snapshot_id> | TIMESTAMP => <timestamp> } ) ];

e.g.

will query
select * from abc since(snapshot => '045e8ed9233245e692b8782039a2a504'); return data updated since 045e8ed9233245e692b8782039a2a504 ?

@sundy-li
Copy link
Member Author

sundy-li commented Feb 25, 2024

will query
select * from abc since(snapshot => '045e8ed9233245e692b8782039a2a504'); return data updated since 045e8ed9233245e692b8782039a2a504 ?

Yes, it's similar to Stream table with change type of ChangeType::Insert.

The result is not the accurate (it did not conside about the intersection between blocks).

But for merging HLL, duplicate data does not have many side effects.

@lichuang
Copy link
Contributor

can add some data about DISTINCT_ERROR_RATE and increased disk size when add filelds about hll?

@sundy-li
Copy link
Member Author

sundy-li commented Feb 27, 2024

can add some data about DISTINCT_ERROR_RATE and increased disk size when add filelds about hll?

Current we are using rate = '0.01625', with P = 12 , it's register size is 2**12 = 4k, the average compressed size could be 1k.

So it will take 100KB for 100 columns in the statistics file. This file is only generated in analyze statement, which do not affect the insert query.

@sundy-li sundy-li added this pull request to the merge queue Feb 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 27, 2024
@BohuTANG BohuTANG merged commit d164a00 into datafuselabs:main Feb 27, 2024
71 checks passed
@zhyass zhyass mentioned this pull request Apr 3, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants