[flink] Copy bytes with multiple threads when preforming precommit compact for changelogs#4907
Merged
Conversation
wwj6591812
reviewed
Jan 14, 2025
| .intType() | ||
| .noDefaultValue() | ||
| .withDescription( | ||
| "Maximum number of threads to copy bytes form small changelog files. " |
| int numThreads = | ||
| options.getOptional(FlinkConnectorOptions.CHANGELOG_PRECOMMIT_COMPACT_THREAD_NUM) | ||
| .orElse(Runtime.getRuntime().availableProcessors()); | ||
| LOG.info("Creating thread poll of size {} for changelog compaction.", numThreads); |
| private void readFully() { | ||
| try { | ||
| result = IOUtils.readFully(table.fileIO().newInputStream(path), true); | ||
| table.fileIO().deleteQuietly(path); |
Contributor
There was a problem hiding this comment.
If job failover after table.fileIO().deleteQuietly(path); and before copy all files into a new big file.
Is there a risk of file loss here?
Contributor
Author
There was a problem hiding this comment.
If job fails then no changelog will be committed, thus no risk of file loss.
Contributor
|
+1 |
JingsongLi
reviewed
Jan 15, 2025
| ThreadPoolUtils.randomlyExecuteSequentialReturn( | ||
| executor, | ||
| t -> { | ||
| // Total lengths of all bytes will not exceed `targetFileSize * 2`, |
Contributor
There was a problem hiding this comment.
I feel it is better to use workers and queue and consumer. Even max targetFileSize * 2, if target file size is 1GB, this is too still large.
Workers and queue is safer.
…mpact for changelogs
Contributor
|
+1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
In #4380 we introduce pre-commit compact for changelog files. Multiple changelog files from the same partition will be merged into one big file in one worker parallelism to decrease the number of small files.
However, when the number of changelog files to merge is large (while each file itself is small enough), the copying process will be slow, because opening these many files from the filesystem takes a lot of time.
In this PR, we add a thread pool to the worker operator, so that when performing pre-commit compact for changelogs, we can copy the bytes with multiple threads, thus speeding up the process.
Tests
Existing IT cases should cover this change. This PR also adds a unit test for the coordinator operator.
API and Format
No format changes.
Documentation
Document is also updated.