Discussion: Handling inconsistent streams #14031

kwannoel · 2023-12-18T05:41:55Z

Problems

For stream executors, the input stream from upstream does not undergo consistency checks.

For example, consider the following input where v1 is a primary key:

| Op | v1 | v2 | v3 |
| +  |  1 |  2 |  3 |
| +  |  1 |  2 |  3 |

The executor may panic only when their internal state / cache becomes inconsistent.
The actual source is from upstream.
We have to enumerate all upstream sources to discover what's the actual root cause.
This might be fine when we have a small stream graph, but when our stream graph is large, we cannot find root cause fast.

Solutions

Introduce a testing feature, which adds an extra executor between each executor which stores the state of the stream passing through, and checks for inconsistency. Use it in fuzzing test, e.g. sqlsmith. Then we can find inconsistency bugs. This cannot solve the case in production though.

Any other ideas?

The text was updated successfully, but these errors were encountered:

BugenZhao · 2023-12-18T07:38:25Z

BTW, the internal state / cache inconsistency could also be led by data corruption in the storage layer, which is even harder to debug. 😕 The factor can be eliminated by using an in-memory storage backend under the testing feature. However, this also implies that it cannot be covered in production.

kwannoel · 2023-12-18T14:27:32Z

The factor can be eliminated by using an in-memory storage backend under the testing feature.

Can you elaborate more on how this can catch data corruption on storage layer? Or what kinds of data corruption it can catch?

BugenZhao · 2023-12-18T14:31:41Z

The factor can be eliminated by using an in-memory storage backend under the testing feature.

Can you elaborate more on how this can catch data corruption on storage layer? Or what kinds of data corruption it can catch?

IIRC, we once encountered an issue that the file cache was not correctly invalidated after node restart, which caused the read to return totally irrelevant results. As the in-memory state backend is simply enough, I guess we can assume that there's no such issue.

kwannoel · 2023-12-18T14:43:12Z

Add another reason for inconsistent stream: backwards incompatibility.

If the user is using stable features, it's likely the bug could be triggered by backwards compatibility after upgrade.

kwannoel · 2023-12-20T10:39:53Z

Another case recently that occurred, which caused hash join state to become inconsistent.

A complementary approach suggested by @fuyufjh . First we tolerate the inconsistency to prevent the cluster from crashing, but leave an error log out.

For example, if a row is to be inserted into the cache, but it already exists, it means that this row has been seen before, and we should skip over it. It should still not happen though, so we should log an error.

fuyufjh · 2023-12-20T11:37:29Z

+1. I have discussed this with multiple people recently, and at least we all agree that we can hardly do anything when meeting this problem.

kwannoel · 2023-12-20T11:49:04Z

I will work on the testing part, fuzzing with checks on the data stream.

fuyufjh · 2023-12-20T11:50:07Z

When talking with @stdrc yesterday, we think we can provide an option like "non-strict mode" to warn these inconsistent problems but not panic. As an option, we will always use "strict mode" for testing; while for production deployment, we can decide to enable or disable case by case.

Some known places that need to be refactored:

HashAgg

src/stream/src/executor/aggregation/agg_group.rs:320:13: row count should be non-negative

HashJoin

src/stream/src/executor/managed_state/join/join_entry_state.rs:43:44 unwrap()
src/stream/src/executor/managed_state/join/join_entry_state.rs:51:13 pk [...] should be in the cache

Inside MemTable
- Search for handle_mem_table_error()

There should be more cases... Need to review code to find out.

hzxa21 · 2023-12-20T13:26:19Z

When talking with @stdrc yesterday, we think we can provide an option like "non-strict mode" to warn these inconsistent problems but not panic. As an option, we will always use "strict mode" for testing; while for production deployment, we can decide to enable or disable case by case.

Some known places that need to be refactored:
HashAgg
src/stream/src/executor/aggregation/agg_group.rs:320:13: row count should be non-negative
HashJoin
src/stream/src/executor/managed_state/join/join_entry_state.rs:43:44 unwrap()
src/stream/src/executor/managed_state/join/join_entry_state.rs:51:13 pk [...] should be in the cache
There should be more cases... Need to review code to find out.

Let me share more info about the three occurrences of the panics:

join_entry_state.rs:51:13 pk [...] should be in the cache
- image: nightly-20231123
- stateful operators involved (no agg):
  - inner join
  - left outer join
agg_group.rs:320:13: row count should be non-negative
- image v1.5.0
- stateful operators involved
  - inner join
  - left outer join
  - temporal filter
  - agg
join_entry_state.rs:43:44 unwrap()
- image v1.4.0
- stateful operators involved
  - left outer join
  - agg

1 and 2 seem to be double delete while 3 seems to be double insert. I wonder whether it is possible to be caused by temporal filter emitting a previously emitted rows to downstream under some corner cases.

Updated: there is no temporal filter involved in 1 and 3. The only common stateful operator seems to be left outer join

hzxa21 · 2023-12-20T14:50:42Z

When talking with @stdrc yesterday, we think we can provide an option like "non-strict mode" to warn these inconsistent problems but not panic. As an option, we will always use "strict mode" for testing; while for production deployment, we can decide to enable or disable case by case.
Some known places that need to be refactored:
HashAgg
src/stream/src/executor/aggregation/agg_group.rs:320:13: row count should be non-negative
HashJoin
src/stream/src/executor/managed_state/join/join_entry_state.rs:43:44 unwrap()
src/stream/src/executor/managed_state/join/join_entry_state.rs:51:13 pk [...] should be in the cache
There should be more cases... Need to review code to find out.
Let me share more info about the three occurrences of the panics:

join_entry_state.rs:51:13 pk [...] should be in the cache

image: nightly-20231123

stateful operators involved (no agg):

inner join

left outer join

temporal filter

agg_group.rs:320:13: row count should be non-negative

image v1.5.0

stateful operators involved

inner join

left outer join

temporal filter

agg

join_entry_state.rs:43:44 unwrap()

image v1.4.0

stateful operators involved

left outer join

agg

1 and 2 seem to be double delete while 3 seems to be double insert. I wonder whether it is possible to be caused by temporal filter emitting a previously emitted rows to downstream under some corner cases.

Updated: there is no temporal filter involved in 3. The only common stateful operator seems to be left outer join

Suspicious join related PR introduced since v1.4.0: #13214

st1page · 2023-12-23T02:15:22Z

Could because #13351, I will do some invesitagitions

st1page · 2023-12-23T04:15:01Z

Could because #13351, I will do some invesitagitions

It doesn't seem so... #14166

lmatz · 2023-12-26T03:18:22Z

one prod case of inconsistency: #14197

kwannoel · 2024-01-09T06:50:53Z

Another idea I have, we can enable the stream consistency check to be toggled at runtime.
This means that inconsistent bugs which trigger crash loops can be caught and debugged in the user's cluster.

And it will catch most cases.

It can be supported in the same way as described in the issue description, by adding an additional executor, and using a system variable to toggle it.

Edit:
@fuyufjh raised a good point which is that historical data then needs to be stored all the way to detect it. So seems like it can only be used in testing again..

Edit 2:
From offline discussion with @st1page, @TennyZhuang proposed storing on log / local disk most recent operations, bounded by a period, e.g. 10,000 recent operations, we can use that to check.

So it can work, limited by the recent operations.

Edit 3:
Typically bugs are triggered by very large data amount.

github-actions · 2024-07-03T09:57:53Z

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

stdrc · 2024-07-23T08:06:01Z

After some work in previous quarter, we already have a non-strict mode to allow inconsistent streams in out system. And we now have a description for the related config item in https://docs.risingwave.com/docs/dev/node-specific-configurations/#streaming-configurations. So I guess for now we can close this issue as completed?

While we have to acknowledge that the non-strict mode is just to ignore inconsistency. We still don't have a better way to identify inconsistency. We do have many code in several executors that checks the Ops and data in input chunk, but not all executors are covered, so it's still hard to backtrack a panic to find the origin in strict mode.

If anyone has any other thoughts, plz feel free to reopen this issue.

kwannoel added the type/feature label Dec 18, 2023

github-actions bot added this to the release-1.6 milestone Dec 18, 2023

kwannoel mentioned this issue Dec 20, 2023

risingwave_stream::executor::aggregation::agg_group: bad row count group=OwnedRow(...) #14095

Closed

fuyufjh mentioned this issue Dec 20, 2023

Bug: LeftOuter join panicked at src/stream/src/executor/managed_state/join/join_entry_state.rs:51:13 #14040

Closed

kwannoel self-assigned this Dec 20, 2023

stdrc mentioned this issue Dec 27, 2023

chore(agg): comment out bad row count panic #14232

Merged

9 tasks

Little-Wallace mentioned this issue Dec 27, 2023

fix(compactor): fix put key miss tombstone #14233

Merged

9 tasks

kwannoel mentioned this issue Jan 5, 2024

Tracking: Sqlsmith extras #7329

Open

24 tasks

kwannoel modified the milestones: release-1.6, release-1.7 Jan 9, 2024

kwannoel modified the milestones: release-1.7, release-1.8 Mar 6, 2024

kwannoel removed this from the release-1.8 milestone Apr 8, 2024

kwannoel added needs-discussion and removed type/feature labels Apr 15, 2024

StrikeW mentioned this issue Jun 18, 2024

fix(jdbc-sink): relax the check for UPDATE_DELETE op #17289

Merged

9 tasks

github-actions bot added the no-issue-activity label Jul 3, 2024

stdrc closed this as completed Jul 23, 2024

stdrc self-assigned this Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Handling inconsistent streams #14031

Discussion: Handling inconsistent streams #14031

kwannoel commented Dec 18, 2023 •

edited

Loading

BugenZhao commented Dec 18, 2023

kwannoel commented Dec 18, 2023

BugenZhao commented Dec 18, 2023 •

edited

Loading

kwannoel commented Dec 18, 2023 •

edited

Loading

kwannoel commented Dec 20, 2023 •

edited

Loading

fuyufjh commented Dec 20, 2023

kwannoel commented Dec 20, 2023

fuyufjh commented Dec 20, 2023 •

edited

Loading

hzxa21 commented Dec 20, 2023 •

edited

Loading

hzxa21 commented Dec 20, 2023

st1page commented Dec 23, 2023

st1page commented Dec 23, 2023

lmatz commented Dec 26, 2023

kwannoel commented Jan 9, 2024 •

edited

Loading

github-actions bot commented Jul 3, 2024

stdrc commented Jul 23, 2024 •

edited

Loading

Discussion: Handling inconsistent streams #14031

Discussion: Handling inconsistent streams #14031

Comments

kwannoel commented Dec 18, 2023 • edited Loading

Problems

Solutions

BugenZhao commented Dec 18, 2023

kwannoel commented Dec 18, 2023

BugenZhao commented Dec 18, 2023 • edited Loading

kwannoel commented Dec 18, 2023 • edited Loading

kwannoel commented Dec 20, 2023 • edited Loading

fuyufjh commented Dec 20, 2023

kwannoel commented Dec 20, 2023

fuyufjh commented Dec 20, 2023 • edited Loading

hzxa21 commented Dec 20, 2023 • edited Loading

hzxa21 commented Dec 20, 2023

st1page commented Dec 23, 2023

st1page commented Dec 23, 2023

lmatz commented Dec 26, 2023

kwannoel commented Jan 9, 2024 • edited Loading

github-actions bot commented Jul 3, 2024

stdrc commented Jul 23, 2024 • edited Loading

kwannoel commented Dec 18, 2023 •

edited

Loading

BugenZhao commented Dec 18, 2023 •

edited

Loading

kwannoel commented Dec 18, 2023 •

edited

Loading

kwannoel commented Dec 20, 2023 •

edited

Loading

fuyufjh commented Dec 20, 2023 •

edited

Loading

hzxa21 commented Dec 20, 2023 •

edited

Loading

kwannoel commented Jan 9, 2024 •

edited

Loading

stdrc commented Jul 23, 2024 •

edited

Loading