Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](restore) Reset next version for remote table when restore #40118

Merged
merged 3 commits into from
Aug 30, 2024

Conversation

smallx
Copy link
Contributor

@smallx smallx commented Aug 29, 2024

We should reset next version to visible version + 1 for all partitions of remote table, when restoring table that do not exist locally.

在高并发insert场景,CCR源端表的next version可能比visible version大比较多,目标集群restore全量快照后,切换到增量binlog后,commit事务时使用的version(来自next version)就可能比当前visible version大比较多。

此时,对于MoW表,就会出现publish version不连续,增量binlog一直无法publish的问题。事务状态会一直是COMMITTED,并伴随类似ErrMsg wait for publishing partition 15027 version 1037597. self version: 1037627. table 15025

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@smallx
Copy link
Contributor Author

smallx commented Aug 29, 2024

@w41ter please help to review, thank you.

Copy link
Contributor

@w41ter w41ter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 29, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@dataroaring
Copy link
Contributor

run buildall

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@doris-robot
Copy link

TPC-H: Total hot run time: 38251 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8748be0b6f9a9dd5ed6952255903c9eb6d95ce4c, data reload: false

------ Round 1 ----------------------------------
q1	13481	4500	4304	4304
q2	1313	181	177	177
q3	7401	1098	1076	1076
q4	6270	818	739	739
q5	4982	2869	2788	2788
q6	236	140	137	137
q7	964	622	612	612
q8	5939	2091	2038	2038
q9	6700	6551	6553	6551
q10	3588	2219	2186	2186
q11	385	244	244	244
q12	400	220	226	220
q13	7067	3028	3046	3028
q14	288	227	241	227
q15	519	480	492	480
q16	593	501	522	501
q17	968	708	781	708
q18	7674	6940	6997	6940
q19	2356	998	1064	998
q20	637	326	340	326
q21	3896	2959	3030	2959
q22	1084	1012	1023	1012
Total cold run time: 76741 ms
Total hot run time: 38251 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4375	4310	4300	4300
q2	379	268	268	268
q3	2925	2689	2673	2673
q4	1922	1663	1694	1663
q5	5441	5432	5409	5409
q6	220	132	134	132
q7	2101	1712	1770	1712
q8	3193	3393	3351	3351
q9	8470	8495	8512	8495
q10	3448	3243	3182	3182
q11	599	493	503	493
q12	784	622	608	608
q13	7303	3047	3015	3015
q14	310	288	274	274
q15	514	479	480	479
q16	643	580	556	556
q17	1803	1494	1486	1486
q18	7784	7441	7409	7409
q19	1675	1524	1365	1365
q20	2056	1817	1821	1817
q21	5432	5179	5233	5179
q22	1140	1020	1036	1020
Total cold run time: 62517 ms
Total hot run time: 54886 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 187956 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8748be0b6f9a9dd5ed6952255903c9eb6d95ce4c, data reload: false

query1	715	376	372	372
query2	5494	1999	1929	1929
query3	5391	220	229	220
query4	30403	23120	23196	23120
query5	3587	502	502	502
query6	260	170	172	170
query7	4578	296	296	296
query8	245	214	199	199
query9	8709	2510	2526	2510
query10	449	278	263	263
query11	17954	15297	15152	15152
query12	145	102	98	98
query13	1627	394	378	378
query14	9683	6960	7125	6960
query15	268	175	177	175
query16	7887	443	436	436
query17	1580	579	568	568
query18	2090	306	296	296
query19	199	154	150	150
query20	118	115	113	113
query21	213	108	103	103
query22	4314	4036	4019	4019
query23	34027	33736	33346	33346
query24	11377	2921	2870	2870
query25	639	388	392	388
query26	1219	164	161	161
query27	2756	289	280	280
query28	7597	2142	2116	2116
query29	840	419	412	412
query30	316	165	159	159
query31	997	778	800	778
query32	97	58	57	57
query33	768	286	289	286
query34	1036	502	496	496
query35	852	739	710	710
query36	1085	956	915	915
query37	166	94	91	91
query38	3920	3865	3883	3865
query39	1452	1398	1390	1390
query40	199	120	119	119
query41	48	46	44	44
query42	116	99	96	96
query43	518	467	481	467
query44	1262	767	761	761
query45	194	168	172	168
query46	1114	798	746	746
query47	1881	1754	1787	1754
query48	366	293	302	293
query49	1073	436	428	428
query50	811	419	413	413
query51	7059	7055	7033	7033
query52	101	87	91	87
query53	257	185	188	185
query54	1113	453	477	453
query55	81	79	80	79
query56	283	253	261	253
query57	1211	1074	1081	1074
query58	238	231	246	231
query59	3004	2839	2975	2839
query60	294	265	270	265
query61	104	99	98	98
query62	832	654	666	654
query63	226	195	190	190
query64	5475	673	723	673
query65	3221	3111	3133	3111
query66	1468	357	352	352
query67	15594	15362	15187	15187
query68	3593	604	593	593
query69	401	277	273	273
query70	1193	1069	1134	1069
query71	339	278	280	278
query72	6382	4027	3922	3922
query73	761	339	341	339
query74	9254	8879	8842	8842
query75	3395	2676	2753	2676
query76	2238	969	945	945
query77	529	310	322	310
query78	9569	9055	8941	8941
query79	1038	545	542	542
query80	692	504	513	504
query81	457	248	238	238
query82	246	150	155	150
query83	175	158	158	158
query84	221	81	76	76
query85	680	289	279	279
query86	331	302	295	295
query87	4407	4269	4276	4269
query88	3185	2360	2370	2360
query89	387	302	285	285
query90	1861	197	198	197
query91	129	107	101	101
query92	63	55	52	52
query93	1043	547	543	543
query94	695	295	288	288
query95	359	259	262	259
query96	590	278	280	278
query97	3207	3040	3055	3040
query98	213	212	199	199
query99	1474	1280	1253	1253
Total cold run time: 282303 ms
Total hot run time: 187956 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.16 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 8748be0b6f9a9dd5ed6952255903c9eb6d95ce4c, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.04	0.03
query3	0.23	0.06	0.05
query4	1.66	0.09	0.10
query5	0.49	0.48	0.50
query6	1.13	0.72	0.72
query7	0.03	0.02	0.01
query8	0.05	0.04	0.04
query9	0.55	0.48	0.50
query10	0.54	0.55	0.54
query11	0.15	0.12	0.13
query12	0.15	0.12	0.13
query13	0.61	0.59	0.58
query14	2.06	2.12	2.06
query15	0.84	0.82	0.81
query16	0.36	0.38	0.36
query17	1.07	1.05	1.01
query18	0.22	0.20	0.20
query19	1.85	1.71	1.77
query20	0.01	0.01	0.01
query21	15.40	0.69	0.68
query22	4.45	5.55	3.01
query23	18.26	1.44	1.38
query24	2.20	0.23	0.22
query25	0.14	0.09	0.08
query26	0.26	0.17	0.18
query27	0.08	0.08	0.08
query28	13.20	1.01	1.01
query29	12.61	3.31	3.31
query30	0.24	0.05	0.06
query31	2.87	0.39	0.40
query32	3.26	0.49	0.47
query33	2.97	3.00	2.99
query34	17.13	4.45	4.36
query35	4.41	4.41	4.45
query36	0.65	0.51	0.49
query37	0.19	0.16	0.17
query38	0.15	0.15	0.15
query39	0.05	0.05	0.04
query40	0.14	0.12	0.12
query41	0.10	0.06	0.05
query42	0.07	0.04	0.04
query43	0.04	0.04	0.04
Total cold run time: 110.99 s
Total hot run time: 33.16 s

@w41ter w41ter merged commit b53b8c9 into apache:master Aug 30, 2024
27 of 29 checks passed
w41ter pushed a commit to w41ter/incubator-doris that referenced this pull request Aug 30, 2024
…he#40118)

We should reset next version to visible version + 1 for all partitions
of remote table, when restoring table that do not exist locally.

在高并发insert场景,CCR源端表的next version可能比visible
version大比较多,目标集群restore全量快照后,切换到增量binlog后,commit事务时使用的version(来自next
version)就可能比当前visible version大比较多。

此时,对于MoW表,就会出现publish
version不连续,增量binlog一直无法publish的问题。事务状态会一直是`COMMITTED`,并伴随类似ErrMsg `wait
for publishing partition 15027 version 1037597. self version: 1037627.
table 15025`。
w41ter pushed a commit to w41ter/incubator-doris that referenced this pull request Aug 30, 2024
…he#40118)

We should reset next version to visible version + 1 for all partitions
of remote table, when restoring table that do not exist locally.

在高并发insert场景,CCR源端表的next version可能比visible
version大比较多,目标集群restore全量快照后,切换到增量binlog后,commit事务时使用的version(来自next
version)就可能比当前visible version大比较多。

此时,对于MoW表,就会出现publish
version不连续,增量binlog一直无法publish的问题。事务状态会一直是`COMMITTED`,并伴随类似ErrMsg `wait
for publishing partition 15027 version 1037597. self version: 1037627.
table 15025`。
yiguolei pushed a commit that referenced this pull request Aug 30, 2024
#40165)

cherry pick from #40118

Co-authored-by: smallx <e9999e@163.com>
w41ter added a commit that referenced this pull request Aug 30, 2024
#40166)

cherry pick from #40118

Co-authored-by: smallx <e9999e@163.com>
dataroaring pushed a commit that referenced this pull request Sep 3, 2024
We should reset next version to visible version + 1 for all partitions
of remote table, when restoring table that do not exist locally.

在高并发insert场景,CCR源端表的next version可能比visible
version大比较多,目标集群restore全量快照后,切换到增量binlog后,commit事务时使用的version(来自next
version)就可能比当前visible version大比较多。

此时,对于MoW表,就会出现publish
version不连续,增量binlog一直无法publish的问题。事务状态会一直是`COMMITTED`,并伴随类似ErrMsg `wait
for publishing partition 15027 version 1037597. self version: 1037627.
table 15025`。
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants