Skip to content

Comments

[WIP] backend/external: improve sort speed for external writer#65137

Open
joechenrh wants to merge 5 commits intopingcap:masterfrom
joechenrh:csv-test
Open

[WIP] backend/external: improve sort speed for external writer#65137
joechenrh wants to merge 5 commits intopingcap:masterfrom
joechenrh:csv-test

Conversation

@joechenrh
Copy link
Contributor

@joechenrh joechenrh commented Dec 19, 2025

What problem does this PR solve?

Issue Number: close #xxx

This PR is used to do some tests.

This PR enlarge the size of SliceLocation by 33%(12 bytes -> 16 bytes), which may increase memory usage of each subtask by 77MiB at most (for CPU/Mem = 1C/2GiB). Considering that memory usage of read step is an issue, this PR is just used to illustrate the cost of getting byte slice during sort.

Problem Summary:

What changed and how does it work?

Encapsulate 4 bytes key hint in SliceLocation to speed up sorting without enlarging the memory usage.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Below are the manual test of encode(read) speed

Add Index

  • row count:
  • 8B random string: 00:12:15 -> 00:08:29
  • 120B random string: 00:13:15 -> 00:08:41

We also test some worst case for this PR, to verify that there are no degradation.

Import Into with index

  • Table size: 500GiB
  • 01:00:54 -> 00:55:20

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 19, 2025
@tiprow
Copy link

tiprow bot commented Dec 19, 2025

Hi @joechenrh. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the byteSet implementation in the CSV parser from a value type using bitsets to a pointer-based type using a lookup table. The changes convert the byteSet from an array type to a struct containing both a bit array and a byte lookup table.

Key Changes:

  • Refactored byteSet from array type to struct with bits and table fields
  • Changed makeByteSet to return a pointer and initialize the lookup table
  • Updated CSV parser to use pointer types for byte sets and removed reference operators in function calls

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pkg/lightning/mydump/bytes.go Refactored byteSet from array to struct with lookup table, modified makeByteSet to return pointer, removed contains method, updated IndexAnyByte to use table lookup
pkg/lightning/mydump/csv_parser.go Changed byteSet field types from value to pointer, updated readUntil calls to pass pointers directly instead of taking references

c := chars[i]
as[c>>5] |= 1 << uint(c&31)
as.bits[c>>5] |= 1 << uint(c&31)
as.table[i] = 1
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table index should be c (the byte value), not i (the loop index). Currently this sets table[0], table[1], etc., instead of setting the correct positions for the actual byte values in the input. This will cause incorrect behavior when checking if a byte is in the set.

The line should be: as.table[c] = 1

Suggested change
as.table[i] = 1
as.table[c] = 1

Copilot uses AI. Check for mistakes.
// given byte value in the set.
type byteSet [8]uint32
type byteSet struct {
bits [8]uint32
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bits field is no longer used after the refactoring to use the table lookup approach. This field should either be removed or the bit-based lookup should be used instead of the table. Keeping unused fields increases memory overhead (32 bytes per byteSet instance) and may confuse future maintainers about the intended implementation.

Copilot uses AI. Check for mistakes.
Comment on lines 12 to 13
// byteSet is a 32-byte value, where each bit represents the presence of a
// given byte value in the set.
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation comment is now inaccurate. The byteSet is no longer a "32-byte value" - it's now a struct containing a 32-byte bits array and a 256-byte table array (288 bytes total). The comment should be updated to reflect the actual structure, or explain the purpose of the lookup table approach if that's the intended design.

Suggested change
// byteSet is a 32-byte value, where each bit represents the presence of a
// given byte value in the set.
// byteSet represents a set of byte values. It uses a 32-byte bitset (bits)
// where each bit indicates the presence of a given byte value, and a 256-byte
// lookup table (table) for O(1) membership checks by byte value.

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Dec 19, 2025

Codecov Report

❌ Patch coverage is 1.51515% with 65 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.9148%. Comparing base (c6b0d30) to head (a3c066e).
⚠️ Report is 39 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #65137        +/-   ##
================================================
+ Coverage   77.8425%   78.9148%   +1.0722%     
================================================
  Files          1972       1915        -57     
  Lines        541074     539550      -1524     
================================================
+ Hits         421186     425785      +4599     
+ Misses       118229     111551      -6678     
- Partials       1659       2214       +555     
Flag Coverage Δ
integration 47.6098% <1.5151%> (-0.5867%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 66.1568% <ø> (+5.0093%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 19, 2025
@joechenrh joechenrh force-pushed the csv-test branch 3 times, most recently from 522d5d9 to 2e209e6 Compare December 26, 2025 12:08
@ti-chi-bot ti-chi-bot bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 26, 2025
@ti-chi-bot ti-chi-bot bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 26, 2025
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 27, 2025
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 27, 2025
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 7, 2026
@joechenrh joechenrh force-pushed the csv-test branch 2 times, most recently from 5933e2c to ddcfb68 Compare January 11, 2026 12:46
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 11, 2026
@joechenrh joechenrh force-pushed the csv-test branch 2 times, most recently from 662e05b to cf1b0f9 Compare January 12, 2026 07:19
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 12, 2026
@joechenrh joechenrh force-pushed the csv-test branch 3 times, most recently from cabf6f0 to 8fd74e6 Compare January 14, 2026 10:26
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 14, 2026
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@joechenrh joechenrh force-pushed the csv-test branch 2 times, most recently from 19c1991 to 846407c Compare January 15, 2026 08:50
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@joechenrh joechenrh changed the title [DNM] *: test [WIP] backend/external: improve sort speed for external writer Jan 23, 2026
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 23, 2026
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 23, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 23, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gmhdbjd for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 23, 2026

@joechenrh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen a3c066e link true /test pull-unit-test-next-gen
idc-jenkins-ci-tidb/unit-test a3c066e link true /test unit-test
pull-unit-test-ddlv1 a3c066e link true /test pull-unit-test-ddlv1

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant