Flink: handle rescale properly and refactor statistics#10457
Conversation
| private final StatisticsType type; | ||
| private final Map<SortKey, Long> keyFrequency; | ||
| private final SortKey[] rangeBounds; | ||
| private final SortKey[] keySamples; |
There was a problem hiding this comment.
this rename is needed so that AggregatedStatistics can be used to store both the complete samples and calculated range bounds.
There was a problem hiding this comment.
Would it be better to have a separate class for global stat, and aggregate stat? It's a bit late here, and it's a bit hard to follow when the keys keySamples are actually range bounds, and when they contain the full data...
Maybe tomorrow it will be easier to follow 😀
There was a problem hiding this comment.
I can definitely agree that it can be confusing. I also thought about it. We need to duplicate the serializer too. That was the main reason that I didn't go that route. I can make the change if you think it is better to separate them out.
There was a problem hiding this comment.
Let's talk about this offline. I don't see all the cons and pros ATM.
There was a problem hiding this comment.
Reading through the code again, I'm more-and-more convinced, that we have 2 different objects here:
- RangeBounds (former global statistics) - key-values of the weights used by the partitioner with hash
- Statistics (former completed statistics) - Sketch or Map without hash, but full of data
I think we are just confusing them because of historical reasons.
There was a problem hiding this comment.
Yes, conceptually we have two types of objects here. for Map statistics, there is no difference btw global statistics and completed statistics, as there is no further reduction in stats size. For sketch statistics, global statistics is a lot smaller with range bounds.
We can introduce two types CompleteStatistics and GlobalStatistics. We can also introduce a base type AggregatedStatistics. I am trying to avoid duplicate the AggregatedStatisticsSerializer as it can work for both types. Maybe generics can enable code reused and solve the most duplications.
There was a problem hiding this comment.
last commit separated out the two types
There was a problem hiding this comment.
I have removed the AggregatedStatistics base class. With the change to the GlobalStatistics on Map key assignment, the base class doesn't make much sense anymore. Now GlobalStatistics is used by partitioner and CompletedStatistics are raw aggregated stats.
| StatisticsUtil.deserializeAggregatedStatistics( | ||
| statisticsEvent.statisticsBytes(), aggregatedStatisticsSerializer); | ||
| checkStatisticsTypeMigration(); | ||
| output.collect(new StreamRecord<>(StatisticsOrRecord.fromStatistics(globalStatistics))); |
There was a problem hiding this comment.
it is a bug previously. we actually don't want to apply the new stats immediately during normal aggregation and propagation phase. switch happens at checkpoint boundary.
applyImmediately flag is added in this PR to distinguish the stats requested during rescale case. In this case, immediate application is desired.
There was a problem hiding this comment.
Why don't apply the new statistics immediately?
There was a problem hiding this comment.
Oh... i remember. Don't want to mess with the ongoing files
There was a problem hiding this comment.
yeah. because Iceberg sink flush and commit at checkpoint boundary, switching at checkpoint boundary allows all subtasks to switch to the new stats for the same checkpoint cycle. Otherwise, some records are shuffled based on old stats and some are shuffled based on new stats.
| // Asynchronously request the latest global statistics calculated with new downstream | ||
| // parallelism. It is possible events may have started flowing before coordinator responds | ||
| // with global statistics. In this case, range partitioner would just blindly shuffle | ||
| // records in round robin fashion. |
There was a problem hiding this comment.
Maybe we could be better off if we use the old ranges with some heuristics?
- If we have higher number of subtasks, just use the old ranges, and leave idle tasks
- If we have lower number of subtasks, then use modulo?
Or the communication is fast enough, that it doesn't worth the complexity?
There was a problem hiding this comment.
great question. I also thought about the fallback heuristic. Initially, I was thinking maybe start with simple and we can improve this part if it turns out to be a problem. communication should be relatively fast (maybe a few to 10 ms) if parallelism/fan-out is not very high.
If we have higher number of subtasks, just use the old ranges, and leave idle tasks
scale-up doesn't require any code change. it works like this already
If we have lower number of subtasks, then use modulo?
I agree modulo could be a sensible strategy
- pro: better clustering than round-robin
- con: uneven distribution. some subtasks may get double the loads than the other subtasks. but if the stats refresh is fast (like less than a dozen of ms). maybe this is not a concern.
Hence, I am in favor of implementing the fallback behavior for rescale
There was a problem hiding this comment.
BTW, we haven't added the SketchRangePartitioner yet. fallback handling would be implemented there. so it will be out of the scope for this PR.
There was a problem hiding this comment.
added the SketchRangePartitioner and RangePartitioner to this PR. so the scope is bigger now
| CHAR_KEYS.get("b"), | ||
| CHAR_KEYS.get("b"), | ||
| CHAR_KEYS.get("a"), | ||
| CHAR_KEYS.get("a"), |
There was a problem hiding this comment.
So the key samples are not ordered, just randomly selected samples from the incoming records?
There was a problem hiding this comment.
I have expected a bit more (based on my tests with sketches over long values), but having arbitrary keys probably doesn't allow better approximation
There was a problem hiding this comment.
Sketch returns the array that captures the reservoir samples. before array is full, samples were added to the array in order.
01ef84a to
42c0113
Compare
|
@stevenzwu: Unrelated question, but come up when I have reading the PR:
|
CDC/Upsert should use existing hash distribution in |
I think we can overlay the hash distribution above the ranges, and we could make it work, but I undestand your reluctance to try to grab too much in one go. |
Not sure if we want that, hash distribution (keyBy) is simple and low overhead. Range distribution requires statistics collection. |
If you partition your orders by time, and need to update the order if it was canceled, the your key/partition is not equally distributed, and hashing is probably not a good option. I would like to see the range partitioning as a precursor for writing ordered files with Flink. If we use similar constructs as the Flink SQL ORDER_BY then we can order the rows before writing them out. If we want to do this for CDC streams then we need to send the records with the same id to the same subtask. Again, not something immediate, but might worth revisiting later. |
I am happy to discuss it here. Agree that it is not sth we want to address in this PR. ORDER_BY is essentially the SortOrder defined in table properties. Note that currently Flink writer doesn't sort rows within a data file. Range partitioner only range split keys across files for better clustering. In the example of orders table partitioned by time (say hourly), the primary keys would be |
39235fe to
e6b3d9c
Compare
| Map<SortKey, Long> keyFrequency, | ||
| SortKey[] rangeBounds) { |
There was a problem hiding this comment.
Don't we just send the rangeBounds back as a GlobalStatistics?
There was a problem hiding this comment.
yes, we do. rangeBounds is also an array of SortKey.
Here is an example from SketchUtil
* To understand how range bounds are used in range partitioning, here is an example for human
* ages with 4 partitions: [15, 32, 60]. The 4 ranges would be
*
* <ul>
* <li>age <= 15
* <li>age > 15 && age <= 32
* <li>age >32 && age <= 60
* <li>age > 60
* </ul>
2bfd489 to
00fed5b
Compare
00fed5b to
163dbd7
Compare
| private Partitioner<RowData> delegatePartitioner(GlobalStatistics statistics) { | ||
| if (statistics.type() == StatisticsType.Map) { | ||
| return new MapRangePartitioner(schema, sortOrder, statistics.mapAssignment()); | ||
| } else if (statistics.type() == StatisticsType.Sketch) { | ||
| return new SketchRangePartitioner(schema, sortOrder, statistics.rangeBounds()); | ||
| } else { | ||
| throw new IllegalArgumentException( | ||
| String.format("Invalid statistics type: %s. Should be Map or Sketch", statistics.type())); | ||
| } | ||
| } |
There was a problem hiding this comment.
I still struggling a bit creating a good mental model for the GlobalStatistics distribution.
I feel that separating out the GlobalStatistics would be good if we could create a data structure which could be used by both MapRangePartitioner and SketchRangePartitioner, but sadly these partitioners require different input data.
Would it make sense to serialize/parametrize the partitioner on the JM side and send it in the records instead of the GlobalStatistics in the StatisticsOrRecord?
Like PartitionerOrRecord, and on the partitioner side we just call the partition and forget it.
Would this be easier to understand for others?
There was a problem hiding this comment.
I feel that separating out the GlobalStatistics would be good if we could create a data structure which could be used by both MapRangePartitioner and SketchRangePartitioner
GlobalStatistics is for that purpose. it contains either map assignment or range bounds.
Would it make sense to serialize/parametrize the partitioner on the JM side and send it in the records instead of the GlobalStatistics in the StatisticsOrRecord?
not sure if there is much benefit to serialize the delegate partitioner on the JM side and ship to the subtasks. that would require Java serialization and make checkpoint state more complex for schema evolution.
There was a problem hiding this comment.
My issue with the GlobalStatistics is that it is not really global. The main payload (mapAssignments and rangeBounds) are different for the different Partitioners. So basically we pretend that we send a statistics, but in reality the data already describes the partitioner. Maybe it would be cleaner to accept this, and mirror this change in the code/class names etc.
Instead of writing a serializer for the GlobalStatistics, we can write a serializer for the respective partitioners. I don't see this as a blocker.
I don't have a very strong opinion on this, and I wanted you to understand why I feel that the code is a bit awkward in this case. What is usually a yellow flag for me, is to have an object where some fields are mutually exclusive. I try to examine those again to understand if we have a single object, or we just merged 2 objects to a single one.
There was a problem hiding this comment.
GlobalStatistics is global in the sense it is aggregated statistics by coordinator. We have a facade/proxy partitioner that would delegate the partition decision to underline partitioner based on the statistics type.
There was a problem hiding this comment.
After our offline discussion I understand your points better. I understand that the concept is
- That we collect the statistics
- Then use the statistics to create a partitioner
and we try to stick to the concept.
I still think it is confusing, that currently we collect 2 types of statistics and each of them are tightly coupled to the 2 types of partitioners we have. (Type.Map is always used by MapRangePartitioner, and Type.Sketch is always used by SketchRangePartitioner). We could create the MapRangePartitioner and the SketchRangePartitioner on the coordinator side, and send them to a PartitionerExecutor which just deserializes them and runs them. So the real logic would be in a single place (DataStatisticsCoordinator).
That said, the improvement is just coding style preference, as the Partitioners still need to serialize the underlying statistics, so the performance would be the same.
So we can move forward with your proposed solution.
Thanks for the discussion!
163dbd7 to
7051b49
Compare
pvary
left a comment
There was a problem hiding this comment.
Hi Steven,
I think we discussed every comment.
Could we run the tests one more time before merging? It was a long time ago when they were running, and it might be good to double check before merging.
let me rebase with the latest main branch |
…atistics to operators
7051b49 to
7446959
Compare
|
thanks @pvary for the review |
…efactoring in smart shuffling
(cherry picked from commit 604b2bb)
(cherry picked from commit 4dbc7f5)
(cherry picked from commit 604b2bb)
(cherry picked from commit 4dbc7f5)
close issue #10441