[WIP] backend/external: improve sort speed for external writer#65137
[WIP] backend/external: improve sort speed for external writer#65137joechenrh wants to merge 5 commits intopingcap:masterfrom
Conversation
|
Hi @joechenrh. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
This PR refactors the byteSet implementation in the CSV parser from a value type using bitsets to a pointer-based type using a lookup table. The changes convert the byteSet from an array type to a struct containing both a bit array and a byte lookup table.
Key Changes:
- Refactored
byteSetfrom array type to struct withbitsandtablefields - Changed
makeByteSetto return a pointer and initialize the lookup table - Updated CSV parser to use pointer types for byte sets and removed reference operators in function calls
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/lightning/mydump/bytes.go | Refactored byteSet from array to struct with lookup table, modified makeByteSet to return pointer, removed contains method, updated IndexAnyByte to use table lookup |
| pkg/lightning/mydump/csv_parser.go | Changed byteSet field types from value to pointer, updated readUntil calls to pass pointers directly instead of taking references |
pkg/lightning/mydump/bytes.go
Outdated
| c := chars[i] | ||
| as[c>>5] |= 1 << uint(c&31) | ||
| as.bits[c>>5] |= 1 << uint(c&31) | ||
| as.table[i] = 1 |
There was a problem hiding this comment.
The table index should be c (the byte value), not i (the loop index). Currently this sets table[0], table[1], etc., instead of setting the correct positions for the actual byte values in the input. This will cause incorrect behavior when checking if a byte is in the set.
The line should be: as.table[c] = 1
| as.table[i] = 1 | |
| as.table[c] = 1 |
pkg/lightning/mydump/bytes.go
Outdated
| // given byte value in the set. | ||
| type byteSet [8]uint32 | ||
| type byteSet struct { | ||
| bits [8]uint32 |
There was a problem hiding this comment.
The bits field is no longer used after the refactoring to use the table lookup approach. This field should either be removed or the bit-based lookup should be used instead of the table. Keeping unused fields increases memory overhead (32 bytes per byteSet instance) and may confuse future maintainers about the intended implementation.
| // byteSet is a 32-byte value, where each bit represents the presence of a | ||
| // given byte value in the set. |
There was a problem hiding this comment.
The documentation comment is now inaccurate. The byteSet is no longer a "32-byte value" - it's now a struct containing a 32-byte bits array and a 256-byte table array (288 bytes total). The comment should be updated to reflect the actual structure, or explain the purpose of the lookup table approach if that's the intended design.
| // byteSet is a 32-byte value, where each bit represents the presence of a | |
| // given byte value in the set. | |
| // byteSet represents a set of byte values. It uses a 32-byte bitset (bits) | |
| // where each bit indicates the presence of a given byte value, and a 256-byte | |
| // lookup table (table) for O(1) membership checks by byte value. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #65137 +/- ##
================================================
+ Coverage 77.8425% 78.9148% +1.0722%
================================================
Files 1972 1915 -57
Lines 541074 539550 -1524
================================================
+ Hits 421186 425785 +4599
+ Misses 118229 111551 -6678
- Partials 1659 2214 +555
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
522d5d9 to
2e209e6
Compare
5933e2c to
ddcfb68
Compare
662e05b to
cf1b0f9
Compare
cabf6f0 to
8fd74e6
Compare
19c1991 to
846407c
Compare
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@joechenrh: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #xxx
This PR is used to do some tests.
This PR enlarge the size ofSliceLocationby 33%(12 bytes -> 16 bytes), which may increase memory usage of each subtask by 77MiB at most (for CPU/Mem = 1C/2GiB). Considering that memory usage of read step is an issue, this PR is just used to illustrate the cost of getting byte slice during sort.Problem Summary:
What changed and how does it work?
Encapsulate 4 bytes key hint in
SliceLocationto speed up sorting without enlarging the memory usage.Check List
Tests
Below are the manual test of encode(read) speed
Add Index
We also test some worst case for this PR, to verify that there are no degradation.
Import Into with index
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.