Tracking: support partial checkpoint #14041

wenym1 · 2023-12-18T08:18:01Z

This rfc involves changes in 3 layers: batch query, streaming job and hummock storage. We are going to implement the whole rfc from bottom layer to upper layer step by step. Since the current global checkpoint is a special case of partial checkpoint, though the some features of partial checkpoint will have been implemented in bottom layer, we can keep the same global checkpoint logic in upper layer before we have implemented logic in upper layer.

After the partial checkpoint is supported, we can then work on partial recovery to isolate the failure from different MVs.

Following are some changes in each layer

hummock Storage

First we will refactor the current code to implement part of the features required by partial checkpoint while remaining the same as current logic. This includes the following:

change the hummock version metadata to maintain per table max committed epoch and safe epoch
change the process of collecting barrier from two separate one-shot rpc calls to a long existing streaming rpc between CN and meta to report barrier progress.

Meanwhile, we can develop L0 as a log so that we can reuse the data of MV.

streaming job

First we will implement new streaming executor similar to a source executor that consume the logs of upstream MV.

Second we will implement a partial checkpoint manager that comprehend the streaming graph and collect the barriers reported from each MV parallelism and trigger partial checkpoint.

batch query

By now the batch query will use the global max committed epoch as the query epoch. We can implement different query consistency for batch query. Different query consistency means the policy to choose the query epoch of different state table. The default one is to use the global max committed epoch.

partial recovery

First we can support the failure isolation among MVs that are not connected.

Second we can support failure isolation between upstream and downstream MVs .

Development progress:

Ongoing:

batch query
- support different query consistency
- isolate checkpoint between different databases
support scale in a single barrier
- Support configuration change in a single barrier #18312
spawn actor in inject barrier

All tasks:

kwannoel · 2024-04-02T14:49:23Z

batch query

By now the batch query will use the global max committed epoch as the query epoch. We can implement different query consistency for batch query. Different query consistency means the policy to choose the query epoch of different state table. The default one is to use the global max committed epoch.

For batch query can we use global checkpoint of created stream jobs i.e. excluding those stream jobs which are in creating process. Such that stream jobs being created will not affect freshness of batch query.

wenym1 added the type/feature label Dec 18, 2023

wenym1 self-assigned this Dec 18, 2023

github-actions bot added this to the release-1.6 milestone Dec 18, 2023

wenym1 mentioned this issue Dec 21, 2023

refactor(storage): replace PbHummockVersion with new HummockVersion struct #14101

Merged

9 tasks

wenym1 mentioned this issue Jan 9, 2024

Support partitioned checkpoint #1157

Closed

4 tasks

wenym1 modified the milestones: release-1.6, release-1.7 Jan 9, 2024

wenym1 modified the milestones: release-1.7, release-1.8 Mar 6, 2024

wenym1 modified the milestones: release-1.8, release-1.9 Apr 8, 2024

wenym1 modified the milestones: release-1.9, release-1.10 May 14, 2024

wenym1 modified the milestones: release-1.10, future-release-1.11 Jul 10, 2024

wenym1 modified the milestones: release-2.0, future-release-2.2 Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: support partial checkpoint #14041

Tracking: support partial checkpoint #14041

wenym1 commented Dec 18, 2023 •

edited

Loading

kwannoel commented Apr 2, 2024

Tracking: support partial checkpoint #14041

Tracking: support partial checkpoint #14041

Comments

wenym1 commented Dec 18, 2023 • edited Loading

hummock Storage

streaming job

batch query

partial recovery

Development progress:

kwannoel commented Apr 2, 2024

wenym1 commented Dec 18, 2023 •

edited

Loading