Skip to content

[fix](hive) Preserve empty text records#64671

Merged
Gabriel39 merged 4 commits into
apache:masterfrom
Gabriel39:fix-doris-26370-hive-text-empty-line
Jun 25, 2026
Merged

[fix](hive) Preserve empty text records#64671
Gabriel39 merged 4 commits into
apache:masterfrom
Gabriel39:fix-doris-26370-hive-text-empty-line

Conversation

@Gabriel39

@Gabriel39 Gabriel39 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

When scanning Hive TEXTFILE tables, Doris previously skipped empty physical lines unless read_csv_empty_line_as_null was enabled. This is inconsistent with Hive TEXTFILE semantics: an empty physical line is still a record. For a single-column text table it represents one empty field, and for multi-column text tables missing trailing fields should be filled using the table's null format.

This can cause Doris to return fewer rows than Hive for text files containing empty lines, especially when the table uses LazySimpleSerDe and custom or default serialization.null.format.

This PR fixes the behavior by adding a format-level hook for empty-line handling:

  • CSV keeps the existing default behavior and does not treat empty lines as records.
  • Hive TEXT overrides the hook and treats empty physical lines as records.
  • Empty Hive text lines are passed through normal field deserialization so string/null handling stays consistent with null_format.

The PR also adds Hive regression coverage for:

  • a single-column text table with custom serialization.null.format;
  • a multi-column text table using the default Hive null marker \N;
  • preservation of empty records and correct NULL/empty-string classification.

In addition, the credit-data Hive fixture upload order is made refresh-safe. The Hive regression module refresh may rerun all data/regression setup scripts; crdmm_data now recreates the Hive table before re-uploading its HDFS data so DROP TABLE cannot remove freshly uploaded files.

Release note

Fix Hive TEXTFILE scans to preserve empty physical lines as records, matching Hive behavior.

Check List (For Author)

  • Test: Regression test
    • Added/updated external_table_p0/hive/test_hive_serde_prop.
    • Ran ./run-regression-test.sh --run -d external_table_p0/hive -s test_hive_serde_prop; local config had enableHiveTest=false, so the Hive test body was skipped.
    • Ran ./run-regression-test.sh --run -d external_table_p0/hive -s test_external_credit_data; local config had enableHiveTest=false, so the Hive test body was skipped.
    • Ran bash -n docker/thirdparties/docker-compose/hive/scripts/data/regression/crdmm_data/run.sh.
    • Ran git diff --check.
  • Behavior changed: Yes. Hive TEXTFILE scans now preserve empty physical lines as records instead of skipping them.
  • Does this need documentation: No

@Gabriel39

Copy link
Copy Markdown
Contributor Author

/review

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@github-actions

Copy link
Copy Markdown
Contributor

Codex automated review failed and did not complete.

Error: You've hit your usage limit. Visit https://chatgpt.com/codex/settings/usage to purchase more credits or try again at Jun 24th, 2026 10:56 PM.
Workflow run: https://github.com/apache/doris/actions/runs/27933954956

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@Gabriel39

Copy link
Copy Markdown
Contributor Author

/review

@Gabriel39

Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions

Copy link
Copy Markdown
Contributor

Codex automated review failed and did not complete.

Error: You've hit your usage limit. Visit https://chatgpt.com/codex/settings/usage to purchase more credits or try again at Jun 24th, 2026 10:56 PM.
Workflow run: https://github.com/apache/doris/actions/runs/27935609698

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@zclllyybb

Copy link
Copy Markdown
Contributor

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found in the PR diff.

Critical checkpoint conclusions:

  • Goal/test proof: the change preserves single-column Hive text empty records by routing empty text lines through normal Hive text deserialization, and the added Hive regression checks both row count and null/empty-string classification for Hive 2 and Hive 3 outputs.
  • Scope/focus: the implementation is narrow, limited to CsvReader/TextReader empty-line handling plus regression seed/output updates.
  • Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle-managed state, config item, protocol field, storage format, or rolling-upgrade compatibility surface was introduced.
  • Parallel paths: CSV behavior is preserved through the default _empty_line_as_record() == false; Hive text overrides it. COUNT pushdown follows the same empty-record predicate, and nullable counted-slot pushdown remains blocked by existing FE checks.
  • Tests/results: the added outputs are deterministic scalar aggregates. I did not run the full external Hive regression locally; current CI shows formatter/checkstyle/compile passing, external/regression TeamCity checks still pending, and the macOS BE UT failure is environmental (JAVA version is 25, it must be JDK-17) before tests run.
  • Observability/transactions/persistence/data writes: not applicable for this reader-side parsing fix.
  • Performance: the added branch only affects zero-length lines and keeps normal non-empty-line processing unchanged.

User focus: no additional user-provided focus points were supplied.

Subagent conclusions: optimizer-rewrite found no candidate findings; tests-session-config found no candidate findings. I independently verified their dismissed points in the reader, line-reader, Hive SerDe property, aggregate pushdown, and regression-output paths. Final convergence round 1 ended with both subagents replying NO_NEW_VALUABLE_FINDINGS for the same ledger/comment set, and no inline comments were proposed.

@Gabriel39 Gabriel39 force-pushed the fix-doris-26370-hive-text-empty-line branch from 32d1e52 to ecdd9f8 Compare June 22, 2026 09:08
@Gabriel39

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29493 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ecdd9f81a1bc5e009cca015db5ecddeded5993a6, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17600	4171	4157	4157
q2	2060	314	190	190
q3	10281	1522	883	883
q4	4685	483	352	352
q5	7523	861	583	583
q6	188	181	141	141
q7	785	858	623	623
q8	9352	1578	1601	1578
q9	5817	4550	4503	4503
q10	6772	1797	1527	1527
q11	463	283	240	240
q12	622	446	293	293
q13	18096	3465	2810	2810
q14	274	254	239	239
q15	q16	780	778	712	712
q17	1000	897	1006	897
q18	7370	5817	5605	5605
q19	1296	1428	1139	1139
q20	502	419	270	270
q21	5843	2648	2449	2449
q22	449	361	302	302
Total cold run time: 101758 ms
Total hot run time: 29493 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4567	4480	4498	4480
q2	356	382	222	222
q3	4683	4993	4435	4435
q4	2144	2156	1385	1385
q5	4586	4402	4473	4402
q6	250	190	130	130
q7	2143	2110	1632	1632
q8	2826	2335	2368	2335
q9	8356	8336	8186	8186
q10	4875	4765	4325	4325
q11	626	432	431	431
q12	838	784	538	538
q13	3317	3651	2942	2942
q14	298	299	280	280
q15	q16	714	775	661	661
q17	1404	1424	1546	1424
q18	8010	7420	7379	7379
q19	1192	1140	1105	1105
q20	2228	2228	1951	1951
q21	5439	4790	4537	4537
q22	536	493	423	423
Total cold run time: 59388 ms
Total hot run time: 53203 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 173623 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ecdd9f81a1bc5e009cca015db5ecddeded5993a6, data reload: false

query5	4356	641	490	490
query6	439	195	170	170
query7	4967	541	319	319
query8	357	215	192	192
query9	8751	4159	4136	4136
query10	459	315	264	264
query11	5959	2326	2142	2142
query12	157	107	99	99
query13	1236	614	411	411
query14	6321	5447	5076	5076
query14_1	4373	4366	4368	4366
query15	211	205	173	173
query16	985	452	429	429
query17	915	695	565	565
query18	2434	472	339	339
query19	198	188	147	147
query20	121	112	115	112
query21	218	142	120	120
query22	13765	13644	13359	13359
query23	17483	16531	16130	16130
query23_1	16321	16399	16293	16293
query24	7444	1770	1345	1345
query24_1	1343	1344	1293	1293
query25	572	469	405	405
query26	1305	344	177	177
query27	2658	555	358	358
query28	4466	2077	2071	2071
query29	1080	658	505	505
query30	307	240	206	206
query31	1116	1074	978	978
query32	111	65	65	65
query33	538	326	259	259
query34	1253	1144	662	662
query35	767	790	697	697
query36	1361	1388	1229	1229
query37	148	109	94	94
query38	1898	1758	1683	1683
query39	940	919	924	919
query39_1	881	890	869	869
query40	223	128	106	106
query41	71	68	68	68
query42	90	90	89	89
query43	324	330	288	288
query44	1465	788	801	788
query45	201	191	178	178
query46	1078	1228	751	751
query47	2369	2344	2193	2193
query48	367	433	305	305
query49	651	482	365	365
query50	1015	373	282	282
query51	4324	4335	4337	4335
query52	85	84	71	71
query53	251	264	199	199
query54	287	233	208	208
query55	75	71	72	71
query56	258	245	222	222
query57	1438	1408	1338	1338
query58	244	212	213	212
query59	1635	1661	1464	1464
query60	281	242	227	227
query61	154	147	150	147
query62	691	653	582	582
query63	226	192	197	192
query64	2564	758	627	627
query65	4882	4810	4797	4797
query66	1798	454	343	343
query67	29867	29714	29610	29610
query68	3095	1513	905	905
query69	423	303	262	262
query70	1119	980	969	969
query71	282	239	221	221
query72	2831	2658	2340	2340
query73	836	770	439	439
query74	5140	4969	4777	4777
query75	2637	2598	2253	2253
query76	2342	1203	819	819
query77	358	381	308	308
query78	12548	12448	11994	11994
query79	1422	1209	811	811
query80	1279	501	401	401
query81	524	286	241	241
query82	634	164	131	131
query83	351	280	251	251
query84	318	147	118	118
query85	920	527	421	421
query86	452	307	289	289
query87	1847	1833	1775	1775
query88	3717	2802	2814	2802
query89	429	380	336	336
query90	1882	188	192	188
query91	177	164	136	136
query92	65	58	60	58
query93	1572	1428	869	869
query94	700	344	321	321
query95	680	489	337	337
query96	1018	761	360	360
query97	2738	2729	2563	2563
query98	212	211	200	200
query99	1170	1155	1033	1033
Total cold run time: 259286 ms
Total hot run time: 173623 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.33 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ecdd9f81a1bc5e009cca015db5ecddeded5993a6, data reload: false

query1	0.01	0.01	0.01
query2	0.10	0.06	0.05
query3	0.26	0.13	0.14
query4	1.62	0.14	0.13
query5	0.26	0.23	0.22
query6	1.22	1.12	1.08
query7	0.04	0.01	0.01
query8	0.06	0.04	0.04
query9	0.39	0.32	0.32
query10	0.56	0.54	0.55
query11	0.23	0.15	0.14
query12	0.18	0.14	0.15
query13	0.47	0.48	0.48
query14	1.02	0.99	1.00
query15	0.61	0.59	0.60
query16	0.33	0.36	0.33
query17	1.16	1.13	1.10
query18	0.23	0.21	0.21
query19	2.04	1.98	2.00
query20	0.02	0.01	0.02
query21	15.43	0.22	0.12
query22	4.94	0.05	0.06
query23	16.14	0.29	0.13
query24	3.00	0.44	0.34
query25	0.12	0.05	0.04
query26	0.74	0.20	0.16
query27	0.03	0.03	0.04
query28	3.52	0.88	0.54
query29	12.52	4.32	3.46
query30	0.28	0.15	0.15
query31	2.77	0.61	0.31
query32	3.23	0.60	0.49
query33	3.19	3.17	3.35
query34	15.60	4.20	3.51
query35	3.57	3.54	3.53
query36	0.56	0.44	0.41
query37	0.08	0.07	0.07
query38	0.06	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.16	0.15
query41	0.09	0.03	0.04
query42	0.03	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 96.98 s
Total hot run time: 25.33 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (16/16) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.03% (28375/38329)
Line Coverage 57.93% (309168/533648)
Region Coverage 54.60% (258178/472881)
Branch Coverage 56.00% (112253/200469)

@Gabriel39 Gabriel39 force-pushed the fix-doris-26370-hive-text-empty-line branch 3 times, most recently from 1f8297f to c26e6e6 Compare June 23, 2026 01:43
@Gabriel39

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29621 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c26e6e63f87e9836f1b83777a2c7920359535b32, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17726	4388	4341	4341
q2	2027	321	193	193
q3	10287	1460	869	869
q4	4681	471	338	338
q5	7540	883	571	571
q6	187	181	142	142
q7	831	873	636	636
q8	9336	1609	1600	1600
q9	5942	4529	4550	4529
q10	6768	1811	1537	1537
q11	423	277	245	245
q12	630	424	301	301
q13	18100	3462	2784	2784
q14	268	270	257	257
q15	q16	799	791	716	716
q17	1023	986	1018	986
q18	7033	5843	5554	5554
q19	1259	1230	1023	1023
q20	471	408	266	266
q21	6004	2666	2429	2429
q22	457	369	304	304
Total cold run time: 101792 ms
Total hot run time: 29621 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4506	4455	4482	4455
q2	346	382	236	236
q3	4660	4970	4411	4411
q4	2119	2211	1397	1397
q5	4569	4388	4400	4388
q6	244	186	136	136
q7	2052	2076	1728	1728
q8	2728	2330	2364	2330
q9	8372	8480	8058	8058
q10	4819	4781	4295	4295
q11	611	448	422	422
q12	774	771	563	563
q13	3333	3713	2877	2877
q14	288	314	276	276
q15	q16	731	741	670	670
q17	1444	1390	1524	1390
q18	8138	7600	7472	7472
q19	1206	1094	1099	1094
q20	2229	2235	1979	1979
q21	5420	4745	4641	4641
q22	536	484	455	455
Total cold run time: 59125 ms
Total hot run time: 53273 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 173045 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c26e6e63f87e9836f1b83777a2c7920359535b32, data reload: false

query5	4324	653	489	489
query6	441	193	172	172
query7	4824	570	313	313
query8	376	222	206	206
query9	8739	4050	4066	4050
query10	434	309	275	275
query11	5936	2327	2106	2106
query12	152	113	99	99
query13	1272	599	472	472
query14	6379	5422	5038	5038
query14_1	4333	4327	4353	4327
query15	199	198	178	178
query16	996	480	414	414
query17	898	678	559	559
query18	2426	475	335	335
query19	194	192	138	138
query20	105	105	103	103
query21	215	135	117	117
query22	13701	13590	13502	13502
query23	17429	16576	16172	16172
query23_1	16364	16259	16311	16259
query24	7461	1796	1310	1310
query24_1	1336	1297	1294	1294
query25	549	431	391	391
query26	1281	319	177	177
query27	2727	588	353	353
query28	4532	2062	2041	2041
query29	1103	636	521	521
query30	314	236	199	199
query31	1109	1066	965	965
query32	100	59	57	57
query33	521	309	243	243
query34	1200	1188	680	680
query35	744	784	675	675
query36	1379	1339	1219	1219
query37	153	102	92	92
query38	1904	1726	1704	1704
query39	933	922	908	908
query39_1	875	866	896	866
query40	215	120	103	103
query41	65	60	61	60
query42	89	85	85	85
query43	315	325	273	273
query44	1421	778	775	775
query45	193	189	179	179
query46	1070	1227	768	768
query47	2351	2330	2245	2245
query48	405	428	280	280
query49	620	457	340	340
query50	996	351	262	262
query51	4414	4265	4215	4215
query52	81	81	68	68
query53	254	269	195	195
query54	266	219	214	214
query55	72	70	69	69
query56	234	234	228	228
query57	1464	1389	1324	1324
query58	244	212	209	209
query59	1569	1637	1437	1437
query60	284	253	230	230
query61	152	155	148	148
query62	692	641	578	578
query63	235	194	189	189
query64	2519	791	613	613
query65	4849	4762	4755	4755
query66	1786	465	345	345
query67	29884	29861	29678	29678
query68	3104	1534	879	879
query69	424	298	265	265
query70	1069	978	961	961
query71	294	230	208	208
query72	2982	2741	2430	2430
query73	839	828	405	405
query74	5091	4983	4786	4786
query75	2648	2629	2254	2254
query76	2319	1200	813	813
query77	356	392	297	297
query78	12388	12320	12096	12096
query79	1231	1171	712	712
query80	547	503	409	409
query81	449	284	241	241
query82	235	164	131	131
query83	286	293	262	262
query84	297	150	118	118
query85	918	606	486	486
query86	345	302	283	283
query87	1827	1842	1780	1780
query88	3697	2884	2771	2771
query89	417	382	330	330
query90	2089	185	175	175
query91	171	160	134	134
query92	63	57	55	55
query93	1421	1419	919	919
query94	537	324	311	311
query95	684	476	339	339
query96	1020	814	378	378
query97	2703	2676	2571	2571
query98	219	212	204	204
query99	1191	1150	1022	1022
Total cold run time: 256981 ms
Total hot run time: 173045 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.32 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c26e6e63f87e9836f1b83777a2c7920359535b32, data reload: false

query1	0.01	0.01	0.01
query2	0.09	0.05	0.05
query3	0.26	0.13	0.13
query4	1.61	0.14	0.14
query5	0.24	0.22	0.21
query6	1.26	1.10	1.08
query7	0.04	0.01	0.01
query8	0.06	0.04	0.04
query9	0.38	0.32	0.35
query10	0.57	0.57	0.56
query11	0.21	0.14	0.14
query12	0.19	0.14	0.15
query13	0.46	0.48	0.48
query14	1.00	1.00	1.01
query15	0.62	0.60	0.60
query16	0.31	0.33	0.31
query17	1.13	1.13	1.16
query18	0.22	0.21	0.22
query19	1.97	1.98	1.91
query20	0.02	0.01	0.02
query21	15.44	0.22	0.13
query22	4.80	0.05	0.06
query23	16.13	0.32	0.12
query24	3.01	0.45	0.34
query25	0.12	0.05	0.05
query26	0.75	0.22	0.15
query27	0.04	0.04	0.04
query28	3.51	0.91	0.55
query29	12.46	4.38	3.48
query30	0.27	0.16	0.15
query31	2.77	0.60	0.31
query32	3.22	0.61	0.50
query33	3.21	3.31	3.14
query34	15.52	4.20	3.51
query35	3.62	3.52	3.54
query36	0.56	0.43	0.46
query37	0.08	0.06	0.07
query38	0.04	0.04	0.03
query39	0.04	0.03	0.03
query40	0.17	0.16	0.15
query41	0.08	0.03	0.03
query42	0.03	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 96.57 s
Total hot run time: 25.32 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/16) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.41% (21358/39255)
Line Coverage 38.02% (204062/536785)
Region Coverage 34.02% (160137/470756)
Branch Coverage 35.04% (70151/200226)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (16/16) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.13% (28415/38329)
Line Coverage 58.07% (309863/533633)
Region Coverage 54.84% (259340/472891)
Branch Coverage 56.18% (112633/200477)

@Gabriel39 Gabriel39 force-pushed the fix-doris-26370-hive-text-empty-line branch from c26e6e6 to 448f1d2 Compare June 23, 2026 06:50
@Gabriel39

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29272 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 448f1d25cea81cff20d6afa8146721be5e85853a, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17698	4060	4013	4013
q2	2044	309	198	198
q3	10271	1471	868	868
q4	4683	468	344	344
q5	7549	868	610	610
q6	189	181	138	138
q7	804	862	633	633
q8	9337	1560	1711	1560
q9	5635	4557	4543	4543
q10	6776	1822	1538	1538
q11	471	278	241	241
q12	628	417	291	291
q13	18100	3454	2742	2742
q14	268	261	245	245
q15	q16	788	782	722	722
q17	961	980	947	947
q18	6900	5800	5550	5550
q19	1296	1279	1120	1120
q20	515	422	266	266
q21	5927	2666	2403	2403
q22	438	361	300	300
Total cold run time: 101278 ms
Total hot run time: 29272 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4339	4251	4259	4251
q2	315	341	223	223
q3	4580	4943	4434	4434
q4	2102	2160	1408	1408
q5	4435	4314	4332	4314
q6	236	180	132	132
q7	1726	1637	2071	1637
q8	2647	2194	2210	2194
q9	8332	8548	8101	8101
q10	4828	4772	4343	4343
q11	565	433	391	391
q12	776	762	579	579
q13	3200	3530	2928	2928
q14	309	310	273	273
q15	q16	720	734	650	650
q17	1386	1351	1318	1318
q18	8402	7501	7434	7434
q19	1200	1121	1108	1108
q20	2252	2262	1969	1969
q21	5335	4660	4427	4427
q22	520	457	409	409
Total cold run time: 58205 ms
Total hot run time: 52523 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 174281 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 448f1d25cea81cff20d6afa8146721be5e85853a, data reload: false

query5	4298	645	483	483
query6	435	189	182	182
query7	4813	581	313	313
query8	361	215	199	199
query9	8761	4046	4047	4046
query10	450	312	254	254
query11	5745	2336	2159	2159
query12	151	98	95	95
query13	1242	607	417	417
query14	6294	5394	5026	5026
query14_1	4391	4399	4338	4338
query15	204	197	172	172
query16	983	485	433	433
query17	919	689	561	561
query18	2411	478	340	340
query19	194	176	141	141
query20	109	107	105	105
query21	219	141	121	121
query22	13719	13614	13486	13486
query23	17539	16544	16159	16159
query23_1	16262	16349	16363	16349
query24	7558	1813	1334	1334
query24_1	1325	1332	1306	1306
query25	574	477	406	406
query26	1320	327	182	182
query27	2649	584	347	347
query28	4547	2069	2042	2042
query29	1097	634	507	507
query30	311	244	202	202
query31	1127	1084	974	974
query32	103	63	60	60
query33	526	329	267	267
query34	1181	1122	688	688
query35	766	798	681	681
query36	1394	1384	1245	1245
query37	152	108	91	91
query38	1892	1727	1663	1663
query39	921	921	888	888
query39_1	902	885	891	885
query40	218	129	107	107
query41	71	67	67	67
query42	92	88	86	86
query43	321	348	290	290
query44	1456	780	786	780
query45	200	190	216	190
query46	1096	1190	732	732
query47	2396	2351	2207	2207
query48	399	436	304	304
query49	616	469	355	355
query50	971	356	261	261
query51	4456	4390	4309	4309
query52	80	81	69	69
query53	257	269	197	197
query54	255	225	191	191
query55	71	68	64	64
query56	230	221	202	202
query57	1439	1398	1323	1323
query58	247	215	208	208
query59	1564	1634	1473	1473
query60	279	244	232	232
query61	152	151	142	142
query62	696	641	593	593
query63	232	193	198	193
query64	2539	776	588	588
query65	5093	4781	4804	4781
query66	1795	466	344	344
query67	29804	29757	29634	29634
query68	3322	1600	919	919
query69	410	296	259	259
query70	1052	969	978	969
query71	292	251	213	213
query72	2965	2599	2340	2340
query73	819	783	433	433
query74	5127	4990	4751	4751
query75	2634	2590	2223	2223
query76	2334	1219	829	829
query77	352	378	312	312
query78	12469	12310	11901	11901
query79	1239	1149	749	749
query80	515	471	379	379
query81	450	272	245	245
query82	238	156	121	121
query83	276	278	253	253
query84	290	145	112	112
query85	819	524	422	422
query86	336	290	288	288
query87	1824	1831	1785	1785
query88	3693	2796	2741	2741
query89	413	382	341	341
query90	2153	181	174	174
query91	170	165	132	132
query92	60	60	53	53
query93	1437	1426	859	859
query94	539	373	306	306
query95	680	378	342	342
query96	1126	778	382	382
query97	2716	2702	2596	2596
query98	213	205	203	203
query99	1192	1143	1028	1028
Total cold run time: 257183 ms
Total hot run time: 174281 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.19 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 448f1d25cea81cff20d6afa8146721be5e85853a, data reload: false

query1	0.00	0.00	0.01
query2	0.10	0.06	0.05
query3	0.25	0.14	0.14
query4	1.61	0.14	0.13
query5	0.24	0.26	0.22
query6	1.25	1.09	1.02
query7	0.04	0.01	0.00
query8	0.06	0.04	0.04
query9	0.38	0.31	0.31
query10	0.55	0.54	0.54
query11	0.19	0.14	0.14
query12	0.18	0.15	0.14
query13	0.46	0.47	0.47
query14	1.02	1.00	1.01
query15	0.61	0.60	0.59
query16	0.32	0.33	0.33
query17	1.09	1.10	1.09
query18	0.22	0.22	0.21
query19	2.05	1.96	2.03
query20	0.02	0.01	0.02
query21	15.46	0.21	0.13
query22	4.94	0.05	0.05
query23	16.18	0.32	0.12
query24	2.97	0.42	0.31
query25	0.11	0.05	0.04
query26	0.75	0.22	0.14
query27	0.04	0.04	0.03
query28	3.54	0.92	0.52
query29	12.51	4.37	3.49
query30	0.29	0.15	0.16
query31	2.77	0.62	0.31
query32	3.23	0.60	0.48
query33	3.17	3.34	3.17
query34	15.60	4.23	3.52
query35	3.49	3.58	3.54
query36	0.57	0.44	0.45
query37	0.09	0.06	0.07
query38	0.05	0.04	0.03
query39	0.03	0.03	0.03
query40	0.18	0.16	0.15
query41	0.08	0.03	0.04
query42	0.04	0.03	0.03
query43	0.04	0.03	0.03
Total cold run time: 96.77 s
Total hot run time: 25.19 s

@Gabriel39 Gabriel39 force-pushed the fix-doris-26370-hive-text-empty-line branch from 448f1d2 to 090ad5d Compare June 23, 2026 10:20
@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/16) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.60% (21435/39258)
Line Coverage 38.06% (204376/537049)
Region Coverage 34.03% (160316/471038)
Branch Coverage 35.03% (70195/200372)

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.15 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e0a3dc35be2a7d244737fb9b72644071b19f2b73, data reload: false

query1	0.01	0.01	0.01
query2	0.11	0.05	0.05
query3	0.26	0.14	0.14
query4	1.60	0.14	0.14
query5	0.24	0.22	0.22
query6	1.25	1.07	1.03
query7	0.04	0.01	0.01
query8	0.07	0.04	0.03
query9	0.45	0.32	0.33
query10	0.56	0.58	0.53
query11	0.20	0.14	0.15
query12	0.19	0.15	0.15
query13	0.48	0.48	0.48
query14	1.01	1.01	0.99
query15	0.61	0.59	0.59
query16	0.31	0.33	0.32
query17	1.09	1.12	1.12
query18	0.23	0.21	0.20
query19	2.02	1.95	1.93
query20	0.02	0.01	0.01
query21	15.44	0.19	0.13
query22	4.93	0.05	0.06
query23	16.14	0.32	0.12
query24	3.08	0.38	0.32
query25	0.12	0.04	0.05
query26	0.74	0.21	0.16
query27	0.06	0.04	0.04
query28	3.52	0.88	0.55
query29	12.47	4.33	3.46
query30	0.27	0.14	0.17
query31	2.76	0.59	0.31
query32	3.22	0.58	0.48
query33	3.26	3.20	3.23
query34	15.51	4.19	3.46
query35	3.54	3.48	3.52
query36	0.58	0.43	0.42
query37	0.09	0.06	0.07
query38	0.06	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.16	0.15
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 96.93 s
Total hot run time: 25.15 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (20/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.15% (28431/38341)
Line Coverage 58.04% (309929/534028)
Region Coverage 54.77% (259191/473270)
Branch Coverage 56.11% (112604/200689)

@Gabriel39

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29321 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f4c91951c34cb81ff2d0bd00addb4e01ae88e636, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17706	4009	4012	4009
q2	2038	310	207	207
q3	10287	1433	826	826
q4	4677	472	340	340
q5	7549	854	573	573
q6	175	170	135	135
q7	803	857	643	643
q8	9314	1614	1665	1614
q9	5651	4540	4523	4523
q10	6737	1821	1536	1536
q11	430	282	244	244
q12	628	423	297	297
q13	18076	3361	2805	2805
q14	272	261	241	241
q15	q16	786	783	711	711
q17	1015	874	1063	874
q18	7047	5952	5741	5741
q19	1322	1196	1106	1106
q20	472	396	269	269
q21	5900	2680	2321	2321
q22	439	359	306	306
Total cold run time: 101324 ms
Total hot run time: 29321 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4329	4252	4239	4239
q2	321	346	239	239
q3	4626	4983	4450	4450
q4	2109	2148	1380	1380
q5	4420	4326	4295	4295
q6	233	175	130	130
q7	1702	1626	2074	1626
q8	2601	2281	2209	2209
q9	8359	8570	8047	8047
q10	4815	4773	4320	4320
q11	586	406	379	379
q12	770	751	540	540
q13	3351	3563	2963	2963
q14	307	302	266	266
q15	q16	706	743	630	630
q17	1348	1337	1315	1315
q18	7874	7569	7234	7234
q19	1182	1143	1101	1101
q20	2232	2242	2000	2000
q21	5328	4667	4507	4507
q22	512	451	397	397
Total cold run time: 57711 ms
Total hot run time: 52267 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 173213 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f4c91951c34cb81ff2d0bd00addb4e01ae88e636, data reload: false

query5	4317	630	479	479
query6	425	200	169	169
query7	4830	543	305	305
query8	361	208	201	201
query9	8740	4035	4052	4035
query10	441	332	265	265
query11	5895	2487	2126	2126
query12	166	106	102	102
query13	1260	631	423	423
query14	6540	5431	5052	5052
query14_1	4406	4383	4371	4371
query15	202	199	173	173
query16	1011	475	447	447
query17	1142	746	584	584
query18	2709	507	352	352
query19	245	192	144	144
query20	113	117	107	107
query21	219	143	120	120
query22	13670	13612	13398	13398
query23	17444	16606	16178	16178
query23_1	16308	16129	16336	16129
query24	7548	1767	1349	1349
query24_1	1319	1337	1337	1337
query25	565	467	401	401
query26	1290	307	174	174
query27	2561	533	336	336
query28	4387	2039	2007	2007
query29	1063	649	495	495
query30	315	241	195	195
query31	1125	1085	959	959
query32	98	69	60	60
query33	531	343	299	299
query34	1163	1142	655	655
query35	734	783	669	669
query36	1389	1385	1258	1258
query37	150	104	89	89
query38	1865	1738	1655	1655
query39	925	911	894	894
query39_1	875	877	862	862
query40	210	123	99	99
query41	63	62	63	62
query42	88	85	87	85
query43	326	323	286	286
query44	1427	772	764	764
query45	191	188	177	177
query46	1066	1207	748	748
query47	2407	2417	2253	2253
query48	401	405	296	296
query49	618	471	362	362
query50	973	354	263	263
query51	4467	4378	4303	4303
query52	79	80	69	69
query53	237	258	186	186
query54	261	233	191	191
query55	72	68	65	65
query56	238	237	206	206
query57	1441	1421	1346	1346
query58	234	202	214	202
query59	1538	1606	1374	1374
query60	283	245	229	229
query61	148	143	150	143
query62	692	653	597	597
query63	233	187	187	187
query64	2478	758	605	605
query65	4877	4739	4783	4739
query66	1731	469	334	334
query67	29829	29756	29504	29504
query68	3159	1487	981	981
query69	407	292	287	287
query70	1077	989	968	968
query71	286	219	204	204
query72	2953	2633	2317	2317
query73	864	759	456	456
query74	5072	4954	4768	4768
query75	2618	2590	2210	2210
query76	2316	1214	785	785
query77	349	376	285	285
query78	12535	12543	11783	11783
query79	1351	1171	777	777
query80	1288	471	378	378
query81	511	281	241	241
query82	582	157	124	124
query83	324	276	262	262
query84	266	149	116	116
query85	915	535	420	420
query86	432	301	292	292
query87	1833	1835	1744	1744
query88	3705	2805	2769	2769
query89	430	384	327	327
query90	2020	183	180	180
query91	176	163	135	135
query92	64	65	57	57
query93	1485	1518	916	916
query94	696	334	317	317
query95	693	382	347	347
query96	1072	807	335	335
query97	2759	2730	2610	2610
query98	216	215	204	204
query99	1196	1166	1016	1016
Total cold run time: 258746 ms
Total hot run time: 173213 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.24 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f4c91951c34cb81ff2d0bd00addb4e01ae88e636, data reload: false

query1	0.00	0.00	0.01
query2	0.10	0.05	0.05
query3	0.25	0.14	0.14
query4	1.61	0.14	0.14
query5	0.24	0.22	0.21
query6	1.25	1.09	1.08
query7	0.03	0.02	0.00
query8	0.06	0.04	0.04
query9	0.38	0.31	0.32
query10	0.58	0.54	0.54
query11	0.21	0.14	0.14
query12	0.18	0.14	0.14
query13	0.46	0.47	0.48
query14	1.03	1.04	1.00
query15	0.60	0.60	0.59
query16	0.34	0.32	0.32
query17	1.11	1.12	1.13
query18	0.23	0.21	0.21
query19	2.04	1.95	1.97
query20	0.02	0.01	0.02
query21	15.43	0.18	0.12
query22	4.99	0.05	0.06
query23	16.12	0.31	0.12
query24	2.98	0.42	0.29
query25	0.11	0.05	0.05
query26	0.73	0.20	0.15
query27	0.04	0.04	0.04
query28	3.60	0.90	0.53
query29	12.48	4.31	3.46
query30	0.28	0.15	0.14
query31	2.77	0.60	0.31
query32	3.22	0.60	0.50
query33	3.21	3.18	3.20
query34	15.83	4.21	3.51
query35	3.49	3.53	3.54
query36	0.55	0.45	0.42
query37	0.09	0.06	0.07
query38	0.05	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.17	0.16
query41	0.08	0.03	0.03
query42	0.04	0.03	0.02
query43	0.05	0.04	0.04
Total cold run time: 97.08 s
Total hot run time: 25.24 s

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The Hive serde regression fixture added a multi-column text table to verify the default Hive null marker, but the table DDL did not specify a tab field delimiter while the prepared data file used tab-separated rows. Hive LazySimpleSerDe defaults to Ctrl-A as the field delimiter, so Doris did not split rows such as a\tb into two columns and the expected aggregation result mismatched. This change declares the table with ROW FORMAT DELIMITED and FIELDS TERMINATED BY '\t' while leaving serialization.null.format unset, so the case continues to validate the default \N null format.

### Release note

None

### Check List (For Author)

- Test: Regression test
    - Ran ./run-regression-test.sh --run -d external_table_p0/hive -s test_hive_serde_prop; the suite succeeded but the Hive test body was skipped because enableHiveTest=false in the local config.
    - Ran git diff --check.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#64671

Problem Summary: Refreshing the Hive regression module reruns all data/regression run.sh scripts. The credit-data fixture uploaded files before executing its Hive DDL, but the DDL drops and recreates the managed Hive table. On refresh, DROP TABLE can remove the table location after the files have just been uploaded, leaving crdmm_data empty and causing test_external_credit_data to return no rows. This change runs the DDL first, then recreates the target HDFS directory and uploads the credit-data files into the table location.

### Release note

None

### Check List (For Author)

- Test: Regression test
    - Ran bash -n docker/thirdparties/docker-compose/hive/scripts/data/regression/crdmm_data/run.sh.
    - Ran git diff --check.
    - Ran ./run-regression-test.sh --run -d external_table_p0/hive -s test_external_credit_data; the suite succeeded but the Hive test body was skipped because enableHiveTest=false in the local config.
- Behavior changed: No
- Does this need documentation: No
@Gabriel39

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29147 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c8eed8fa65c99f85cb7319d0e8784e6ce80c0835, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17678	4039	3945	3945
q2	2003	317	184	184
q3	10331	1387	832	832
q4	4683	466	345	345
q5	7504	865	573	573
q6	181	168	141	141
q7	763	832	629	629
q8	9336	1714	1536	1536
q9	5449	4496	4545	4496
q10	6650	1797	1535	1535
q11	439	275	240	240
q12	630	432	290	290
q13	18091	3421	2720	2720
q14	268	264	244	244
q15	q16	789	773	709	709
q17	986	1021	939	939
q18	6951	5773	5595	5595
q19	1310	1212	1103	1103
q20	502	396	261	261
q21	5857	2703	2524	2524
q22	439	367	306	306
Total cold run time: 100840 ms
Total hot run time: 29147 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4290	4243	4218	4218
q2	311	338	220	220
q3	4670	4975	4428	4428
q4	2063	2145	1378	1378
q5	4412	4267	4283	4267
q6	235	178	129	129
q7	1702	1616	1565	1565
q8	2767	2230	2166	2166
q9	8265	8348	8026	8026
q10	4803	4750	4316	4316
q11	557	413	375	375
q12	763	758	554	554
q13	3262	3644	2959	2959
q14	286	304	262	262
q15	q16	726	725	661	661
q17	1326	1307	1339	1307
q18	7852	7315	7086	7086
q19	1159	1160	1176	1160
q20	2234	2264	1950	1950
q21	5254	4591	4484	4484
q22	512	463	404	404
Total cold run time: 57449 ms
Total hot run time: 51915 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172615 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c8eed8fa65c99f85cb7319d0e8784e6ce80c0835, data reload: false

query5	4301	621	500	500
query6	427	188	169	169
query7	4877	549	294	294
query8	370	210	194	194
query9	8775	4031	4007	4007
query10	479	309	264	264
query11	5958	2368	2146	2146
query12	158	103	95	95
query13	1254	603	432	432
query14	6386	5384	5081	5081
query14_1	4380	4373	4335	4335
query15	210	196	175	175
query16	1014	451	349	349
query17	1099	679	546	546
query18	2698	456	340	340
query19	191	179	137	137
query20	109	105	106	105
query21	216	142	120	120
query22	13580	13526	13540	13526
query23	17302	16534	16066	16066
query23_1	16238	16265	16251	16251
query24	7398	1782	1305	1305
query24_1	1293	1279	1283	1279
query25	519	432	366	366
query26	1294	322	164	164
query27	2639	550	340	340
query28	4378	2033	2020	2020
query29	1087	626	500	500
query30	321	245	204	204
query31	1108	1111	949	949
query32	109	64	60	60
query33	521	326	256	256
query34	1157	1159	622	622
query35	762	801	674	674
query36	1380	1418	1244	1244
query37	155	107	93	93
query38	1902	1725	1648	1648
query39	947	925	903	903
query39_1	893	890	879	879
query40	216	125	104	104
query41	69	69	65	65
query42	88	87	88	87
query43	315	323	276	276
query44	1419	786	774	774
query45	199	184	178	178
query46	1057	1277	763	763
query47	2417	2373	2311	2311
query48	415	421	309	309
query49	638	486	363	363
query50	982	373	269	269
query51	4485	4341	4309	4309
query52	85	84	72	72
query53	259	265	194	194
query54	277	232	218	218
query55	78	74	68	68
query56	264	230	238	230
query57	1434	1417	1333	1333
query58	249	215	220	215
query59	1535	1585	1405	1405
query60	282	252	248	248
query61	175	175	172	172
query62	718	655	633	633
query63	230	196	195	195
query64	2492	762	593	593
query65	4849	4785	4751	4751
query66	1737	459	340	340
query67	29743	29681	29486	29486
query68	3156	1598	920	920
query69	419	307	261	261
query70	1107	979	953	953
query71	286	233	216	216
query72	3011	2597	2351	2351
query73	873	753	444	444
query74	5090	4952	4761	4761
query75	2616	2585	2234	2234
query76	2370	1171	782	782
query77	355	376	284	284
query78	12577	12604	11908	11908
query79	1426	1142	752	752
query80	638	475	387	387
query81	465	277	244	244
query82	588	155	121	121
query83	323	276	242	242
query84	262	145	116	116
query85	839	504	413	413
query86	362	308	279	279
query87	1831	1828	1763	1763
query88	3691	2804	2767	2767
query89	420	380	337	337
query90	1886	180	175	175
query91	172	158	131	131
query92	64	62	56	56
query93	1548	1487	925	925
query94	560	352	309	309
query95	673	487	335	335
query96	1087	808	340	340
query97	2701	2712	2581	2581
query98	218	204	203	203
query99	1174	1152	1051	1051
Total cold run time: 257811 ms
Total hot run time: 172615 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.35 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c8eed8fa65c99f85cb7319d0e8784e6ce80c0835, data reload: false

query1	0.01	0.01	0.00
query2	0.10	0.04	0.05
query3	0.25	0.13	0.13
query4	1.61	0.14	0.14
query5	0.24	0.28	0.22
query6	1.30	1.11	1.05
query7	0.04	0.01	0.00
query8	0.06	0.04	0.04
query9	0.38	0.31	0.32
query10	0.55	0.56	0.56
query11	0.20	0.14	0.13
query12	0.19	0.14	0.15
query13	0.47	0.48	0.46
query14	1.02	1.01	1.01
query15	0.62	0.58	0.60
query16	0.30	0.34	0.33
query17	1.10	1.10	1.13
query18	0.22	0.22	0.22
query19	2.06	1.96	1.96
query20	0.01	0.02	0.01
query21	15.45	0.21	0.13
query22	4.93	0.05	0.05
query23	16.13	0.30	0.12
query24	2.96	0.41	0.32
query25	0.11	0.05	0.05
query26	0.75	0.21	0.16
query27	0.04	0.02	0.03
query28	3.44	0.88	0.55
query29	12.47	4.46	3.58
query30	0.28	0.16	0.15
query31	2.77	0.57	0.31
query32	3.22	0.59	0.49
query33	3.15	3.18	3.20
query34	15.63	4.25	3.54
query35	3.50	3.49	3.52
query36	0.56	0.42	0.42
query37	0.09	0.07	0.06
query38	0.04	0.04	0.03
query39	0.04	0.03	0.03
query40	0.18	0.16	0.15
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.03	0.03
Total cold run time: 96.64 s
Total hot run time: 25.35 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/18) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.92% (21573/39283)
Line Coverage 38.41% (206334/537239)
Region Coverage 34.49% (162502/471131)
Branch Coverage 35.48% (71115/200416)

@Gabriel39

Copy link
Copy Markdown
Contributor Author

/review

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (18/18) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.10% (28411/38343)
Line Coverage 57.97% (309590/534060)
Region Coverage 54.62% (258475/473242)
Branch Coverage 56.01% (112395/200665)

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the PR patch for Hive TEXTFILE empty-record preservation and the Hive regression fixture refresh changes. I did not find a substantiated blocking issue to raise inline.

Checkpoint conclusions:

  • Goal/test proof: The BE reader change preserves empty physical TEXTFILE records and the regression suite adds count/value checks for custom and default null formats across hive2/hive3 expected output sections.
  • Scope/focus: The code change is narrowly scoped to CsvReader/TextReader empty-line behavior plus Hive fixture setup.
  • Concurrency/lifecycle/config: No new concurrency, lifecycle ownership, persistent state, or dynamic config behavior is introduced.
  • Parallel paths: Normal materialization and COUNT pushdown paths were both checked for empty-line consistency.
  • Compatibility: CSV default behavior stays unchanged because _empty_line_as_record() defaults to false; the override is TextReader-only.
  • Testing: Added Hive regression coverage targets the changed behavior; local full execution was not rerun because this runner lacks initialized Hive/thirdparty test setup.
  • Style/basic checks: bash -n passed for modified shell scripts; added PR patch lines did not contain trailing whitespace.
  • User focus: No additional user-provided focus was present.

Subagent conclusions: The optimizer-rewrite subagent reported NO_NEW_VALUABLE_FINDINGS. The tests-session-config subagent reported NO_NEW_VALUABLE_FINDINGS. No subagent candidate became an inline comment, no duplicates were merged, and convergence round 1 ended with all live subagents reporting NO_NEW_VALUABLE_FINDINGS for the current ledger/comment set.

@Gabriel39 Gabriel39 merged commit a1ac4db into apache:master Jun 25, 2026
32 of 33 checks passed
github-actions Bot pushed a commit that referenced this pull request Jun 25, 2026
### What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

When scanning Hive TEXTFILE tables, Doris previously skipped empty
physical lines unless `read_csv_empty_line_as_null` was enabled. This is
inconsistent with Hive TEXTFILE semantics: an empty physical line is
still a record. For a single-column text table it represents one empty
field, and for multi-column text tables missing trailing fields should
be filled using the table's null format.

This can cause Doris to return fewer rows than Hive for text files
containing empty lines, especially when the table uses `LazySimpleSerDe`
and custom or default `serialization.null.format`.

This PR fixes the behavior by adding a format-level hook for empty-line
handling:

- CSV keeps the existing default behavior and does not treat empty lines
as records.
- Hive TEXT overrides the hook and treats empty physical lines as
records.
- Empty Hive text lines are passed through normal field deserialization
so string/null handling stays consistent with `null_format`.

The PR also adds Hive regression coverage for:

- a single-column text table with custom `serialization.null.format`;
- a multi-column text table using the default Hive null marker `\N`;
- preservation of empty records and correct NULL/empty-string
classification.

In addition, the credit-data Hive fixture upload order is made
refresh-safe. The Hive regression module refresh may rerun all
`data/regression` setup scripts; `crdmm_data` now recreates the Hive
table before re-uploading its HDFS data so `DROP TABLE` cannot remove
freshly uploaded files.

### Release note

Fix Hive TEXTFILE scans to preserve empty physical lines as records,
matching Hive behavior.

### Check List (For Author)

- Test: Regression test
    - Added/updated `external_table_p0/hive/test_hive_serde_prop`.
- Ran `./run-regression-test.sh --run -d external_table_p0/hive -s
test_hive_serde_prop`; local config had `enableHiveTest=false`, so the
Hive test body was skipped.
- Ran `./run-regression-test.sh --run -d external_table_p0/hive -s
test_external_credit_data`; local config had `enableHiveTest=false`, so
the Hive test body was skipped.
- Ran `bash -n
docker/thirdparties/docker-compose/hive/scripts/data/regression/crdmm_data/run.sh`.
    - Ran `git diff --check`.
- Behavior changed: Yes. Hive TEXTFILE scans now preserve empty physical
lines as records instead of skipping them.
- Does this need documentation: No
yiguolei pushed a commit that referenced this pull request Jun 25, 2026
Cherry-picked from #64671

Co-authored-by: Gabriel <liwenqiang@selectdb.com>
Gabriel39 added a commit to Gabriel39/incubator-doris that referenced this pull request Jun 26, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63781, apache#64671

Problem Summary: File scanner v2 did not carry the same fixes as the existing file scanner path. Predicate rows filtered inside v2 file readers were still reported through scanner load counters unless the scanner was a real load source, and Hive TEXTFILE empty physical lines were still skipped unless read_csv_empty_line_as_null was enabled. This change gates v2 load counter reporting with the same FILE_STREAM exception used by FileScanner and adds a delimited text hook so Hive Text v2 treats empty physical lines as records while CSV keeps the old default behavior.

### Release note

Fix file scanner v2 load counter reporting and Hive TEXTFILE empty-line handling.

### Check List (For Author)

- Test: Unit Test / Manual test
    - Added TextV2ReaderTest coverage for Hive TEXTFILE empty line records, single-column empty string fields, and COUNT pushdown.
    - Ran git diff --check.
    - Ran clang-format v16 through build-support/run_clang_format.py for changed files.
    - Attempted ./run-be-ut.sh --run --filter='TextV2ReaderTest.*:FileScannerV2Test.*', but the local run was blocked because the script needed to update/download datasketches-cpp and network access was unavailable; no BE UT binary was already built.
    - Attempted clang-tidy with the available compile_commands.json, but it pointed at a stale /mnt/disk3/gabriel path; the project clang-tidy wrapper also requires bash 4+ while only system bash is available.
- Behavior changed: Yes. File scanner v2 now matches v1 load counter gating and Hive TEXTFILE empty-line semantics.
- Does this need documentation: No
Gabriel39 added a commit that referenced this pull request Jun 26, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: #63781, #64671

Problem Summary: File scanner v2 did not carry the same fixes as the
existing file scanner path. Predicate rows filtered inside v2 file
readers were still reported through scanner load counters unless the
scanner was a real load source, and Hive TEXTFILE empty physical lines
were still skipped unless read_csv_empty_line_as_null was enabled. This
change gates v2 load counter reporting with the same FILE_STREAM
exception used by FileScanner and adds a delimited text hook so Hive
Text v2 treats empty physical lines as records while CSV keeps the old
default behavior.

### Release note

Fix file scanner v2 load counter reporting and Hive TEXTFILE empty-line
handling.

### Check List (For Author)

- Test: Unit Test / Manual test
- Added TextV2ReaderTest coverage for Hive TEXTFILE empty line records,
single-column empty string fields, and COUNT pushdown.
    - Ran git diff --check.
- Ran clang-format v16 through build-support/run_clang_format.py for
changed files.
- Attempted ./run-be-ut.sh --run
--filter='TextV2ReaderTest.*:FileScannerV2Test.*', but the local run was
blocked because the script needed to update/download datasketches-cpp
and network access was unavailable; no BE UT binary was already built.
- Attempted clang-tidy with the available compile_commands.json, but it
pointed at a stale /mnt/disk3/gabriel path; the project clang-tidy
wrapper also requires bash 4+ while only system bash is available.
- Behavior changed: Yes. File scanner v2 now matches v1 load counter
gating and Hive TEXTFILE empty-line semantics.
- Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants