Flink: handle rescale properly and refactor statistics by stevenzwu · Pull Request #10457 · apache/iceberg

stevenzwu · 2024-06-06T18:18:47Z

close issue #10441

stevenzwu · 2024-06-06T18:19:40Z

  private final StatisticsType type;
  private final Map<SortKey, Long> keyFrequency;
-  private final SortKey[] rangeBounds;
+  private final SortKey[] keySamples;


this rename is needed so that AggregatedStatistics can be used to store both the complete samples and calculated range bounds.

Would it be better to have a separate class for global stat, and aggregate stat? It's a bit late here, and it's a bit hard to follow when the keys keySamples are actually range bounds, and when they contain the full data...
Maybe tomorrow it will be easier to follow 😀

I can definitely agree that it can be confusing. I also thought about it. We need to duplicate the serializer too. That was the main reason that I didn't go that route. I can make the change if you think it is better to separate them out.

Let's talk about this offline. I don't see all the cons and pros ATM.

Reading through the code again, I'm more-and-more convinced, that we have 2 different objects here:

RangeBounds (former global statistics) - key-values of the weights used by the partitioner with hash

Statistics (former completed statistics) - Sketch or Map without hash, but full of data

I think we are just confusing them because of historical reasons.

Yes, conceptually we have two types of objects here. for Map statistics, there is no difference btw global statistics and completed statistics, as there is no further reduction in stats size. For sketch statistics, global statistics is a lot smaller with range bounds.

We can introduce two types CompleteStatistics and GlobalStatistics. We can also introduce a base type AggregatedStatistics. I am trying to avoid duplicate the AggregatedStatisticsSerializer as it can work for both types. Maybe generics can enable code reused and solve the most duplications.

last commit separated out the two types

I have removed the AggregatedStatistics base class. With the change to the GlobalStatistics on Map key assignment, the base class doesn't make much sense anymore. Now GlobalStatistics is used by partitioner and CompletedStatistics are raw aggregated stats.

stevenzwu · 2024-06-06T18:24:11Z

        StatisticsUtil.deserializeAggregatedStatistics(
            statisticsEvent.statisticsBytes(), aggregatedStatisticsSerializer);
    checkStatisticsTypeMigration();
-    output.collect(new StreamRecord<>(StatisticsOrRecord.fromStatistics(globalStatistics)));


it is a bug previously. we actually don't want to apply the new stats immediately during normal aggregation and propagation phase. switch happens at checkpoint boundary.

applyImmediately flag is added in this PR to distinguish the stats requested during rescale case. In this case, immediate application is desired.

Why don't apply the new statistics immediately?

Oh... i remember. Don't want to mess with the ongoing files

yeah. because Iceberg sink flush and commit at checkpoint boundary, switching at checkpoint boundary allows all subtasks to switch to the new stats for the same checkpoint cycle. Otherwise, some records are shuffled based on old stats and some are shuffled based on new stats.

pvary · 2024-06-06T20:08:07Z

+          // Asynchronously request the latest global statistics calculated with new downstream
+          // parallelism. It is possible events may have started flowing before coordinator responds
+          // with global statistics. In this case, range partitioner would just blindly shuffle
+          // records in round robin fashion.


Maybe we could be better off if we use the old ranges with some heuristics?

If we have higher number of subtasks, just use the old ranges, and leave idle tasks

If we have lower number of subtasks, then use modulo?

Or the communication is fast enough, that it doesn't worth the complexity?

great question. I also thought about the fallback heuristic. Initially, I was thinking maybe start with simple and we can improve this part if it turns out to be a problem. communication should be relatively fast (maybe a few to 10 ms) if parallelism/fan-out is not very high.

If we have higher number of subtasks, just use the old ranges, and leave idle tasks

scale-up doesn't require any code change. it works like this already

If we have lower number of subtasks, then use modulo?

I agree modulo could be a sensible strategy

pro: better clustering than round-robin

con: uneven distribution. some subtasks may get double the loads than the other subtasks. but if the stats refresh is fast (like less than a dozen of ms). maybe this is not a concern.

Hence, I am in favor of implementing the fallback behavior for rescale

BTW, we haven't added the SketchRangePartitioner yet. fallback handling would be implemented there. so it will be out of the scope for this PR.

added the SketchRangePartitioner and RangePartitioner to this PR. so the scope is bigger now

pvary · 2024-06-06T20:12:05Z

+              CHAR_KEYS.get("b"),
+              CHAR_KEYS.get("b"),
+              CHAR_KEYS.get("a"),
+              CHAR_KEYS.get("a"),


So the key samples are not ordered, just randomly selected samples from the incoming records?

I have expected a bit more (based on my tests with sketches over long values), but having arbitrary keys probably doesn't allow better approximation

Sketch returns the array that captures the reservoir samples. before array is full, samples were added to the array in order.

pvary · 2024-06-10T08:33:09Z

@stevenzwu: Unrelated question, but come up when I have reading the PR:

What happens in CDC cases, where the partition of the record might end up on multiple ranges. Did we make sure that the records with the same id go to the same subtask, so we can create a positional delete if there are multiple changes for the same id?

stevenzwu · 2024-06-10T14:28:42Z

What happens in CDC cases, where the partition of the record might end up on multiple ranges. Did we make sure that the records with the same id go to the same subtask, so we can create a positional delete if there are multiple changes for the same id?

CDC/Upsert should use existing hash distribution in FlinkSink. range distribution shouldn't be used in this case

pvary · 2024-06-10T18:02:29Z

What happens in CDC cases, where the partition of the record might end up on multiple ranges. Did we make sure that the records with the same id go to the same subtask, so we can create a positional delete if there are multiple changes for the same id?

CDC/Upsert should use existing hash distribution in FlinkSink. range distribution shouldn't be used in this case

I think we can overlay the hash distribution above the ranges, and we could make it work, but I undestand your reluctance to try to grab too much in one go.

stevenzwu · 2024-06-10T18:12:57Z

I think we can overlay the hash distribution above the ranges,

Not sure if we want that, hash distribution (keyBy) is simple and low overhead. Range distribution requires statistics collection.

pvary · 2024-06-10T20:26:52Z

I think we can overlay the hash distribution above the ranges,

Not sure if we want that, hash distribution (keyBy) is simple and low overhead. Range distribution requires statistics collection.

If you partition your orders by time, and need to update the order if it was canceled, the your key/partition is not equally distributed, and hashing is probably not a good option.

I would like to see the range partitioning as a precursor for writing ordered files with Flink. If we use similar constructs as the Flink SQL ORDER_BY then we can order the rows before writing them out. If we want to do this for CDC streams then we need to send the records with the same id to the same subtask.

Again, not something immediate, but might worth revisiting later.

stevenzwu · 2024-06-10T21:03:14Z

I think we can overlay the hash distribution above the ranges,

Not sure if we want that, hash distribution (keyBy) is simple and low overhead. Range distribution requires statistics collection.

If you partition your orders by time, and need to update the order if it was canceled, the your key/partition is not equally distributed, and hashing is probably not a good option.

I would like to see the range partitioning as a precursor for writing ordered files with Flink. If we use similar constructs as the Flink SQL ORDER_BY then we can order the rows before writing them out. If we want to do this for CDC streams then we need to send the records with the same id to the same subtask.

Again, not something immediate, but might worth revisiting later.

I am happy to discuss it here. Agree that it is not sth we want to address in this PR.

ORDER_BY is essentially the SortOrder defined in table properties. Note that currently Flink writer doesn't sort rows within a data file. Range partitioner only range split keys across files for better clustering.

In the example of orders table partitioned by time (say hourly), the primary keys would be (hour(ts), order_id) tuple. hash distribution (keyBy) can work and ensure correctness. You are saying range distribution would work better, because the better clustering of order_id within an hourly partition, correct? With hash distribution, each subtask writes the number of data files equals to the number of hours in a checkpoint cycle. Range distribution can handles the event time skew (recent hours have more data than distant past). Yes, I agree with that assessment. However, there are some challenges with handling new values (like new hours as time advances). Right now, it is handled by round-robin, as there is no assumption of SortOrder should be handled as primary keys. Primary keys are better/safer handled as hash distribution.

pvary · 2024-06-11T09:22:04Z

+      Map<SortKey, Long> keyFrequency,
+      SortKey[] rangeBounds) {


Don't we just send the rangeBounds back as a GlobalStatistics?

yes, we do. rangeBounds is also an array of SortKey.

Here is an example from SketchUtil

* To understand how range bounds are used in range partitioning, here is an example for human * ages with 4 partitions: [15, 32, 60]. The 4 ranges would be * * <ul> * <li>age <= 15 * <li>age > 15 && age <= 32 * <li>age >32 && age <= 60 * <li>age > 60 * </ul>

pvary · 2024-06-17T12:02:27Z

+  private Partitioner<RowData> delegatePartitioner(GlobalStatistics statistics) {
+    if (statistics.type() == StatisticsType.Map) {
+      return new MapRangePartitioner(schema, sortOrder, statistics.mapAssignment());
+    } else if (statistics.type() == StatisticsType.Sketch) {
+      return new SketchRangePartitioner(schema, sortOrder, statistics.rangeBounds());
+    } else {
+      throw new IllegalArgumentException(
+          String.format("Invalid statistics type: %s. Should be Map or Sketch", statistics.type()));
+    }
+  }


I still struggling a bit creating a good mental model for the GlobalStatistics distribution.

I feel that separating out the GlobalStatistics would be good if we could create a data structure which could be used by both MapRangePartitioner and SketchRangePartitioner, but sadly these partitioners require different input data.

Would it make sense to serialize/parametrize the partitioner on the JM side and send it in the records instead of the GlobalStatistics in the StatisticsOrRecord?
Like PartitionerOrRecord, and on the partitioner side we just call the partition and forget it.

Would this be easier to understand for others?

I feel that separating out the GlobalStatistics would be good if we could create a data structure which could be used by both MapRangePartitioner and SketchRangePartitioner

GlobalStatistics is for that purpose. it contains either map assignment or range bounds.

Would it make sense to serialize/parametrize the partitioner on the JM side and send it in the records instead of the GlobalStatistics in the StatisticsOrRecord?

not sure if there is much benefit to serialize the delegate partitioner on the JM side and ship to the subtasks. that would require Java serialization and make checkpoint state more complex for schema evolution.

My issue with the GlobalStatistics is that it is not really global. The main payload (mapAssignments and rangeBounds) are different for the different Partitioners. So basically we pretend that we send a statistics, but in reality the data already describes the partitioner. Maybe it would be cleaner to accept this, and mirror this change in the code/class names etc.

Instead of writing a serializer for the GlobalStatistics, we can write a serializer for the respective partitioners. I don't see this as a blocker.

I don't have a very strong opinion on this, and I wanted you to understand why I feel that the code is a bit awkward in this case. What is usually a yellow flag for me, is to have an object where some fields are mutually exclusive. I try to examine those again to understand if we have a single object, or we just merged 2 objects to a single one.

GlobalStatistics is global in the sense it is aggregated statistics by coordinator. We have a facade/proxy partitioner that would delegate the partition decision to underline partitioner based on the statistics type.

After our offline discussion I understand your points better. I understand that the concept is

That we collect the statistics

Then use the statistics to create a partitioner
and we try to stick to the concept.

I still think it is confusing, that currently we collect 2 types of statistics and each of them are tightly coupled to the 2 types of partitioners we have. (Type.Map is always used by MapRangePartitioner, and Type.Sketch is always used by SketchRangePartitioner). We could create the MapRangePartitioner and the SketchRangePartitioner on the coordinator side, and send them to a PartitionerExecutor which just deserializes them and runs them. So the real logic would be in a single place (DataStatisticsCoordinator).

That said, the improvement is just coding style preference, as the Partitioners still need to serialize the underlying statistics, so the performance would be the same.

So we can move forward with your proposed solution.
Thanks for the discussion!

pvary

Hi Steven,
I think we discussed every comment.
Could we run the tests one more time before merging? It was a long time ago when they were running, and it might be good to double check before merging.

stevenzwu · 2024-07-20T22:24:05Z

Hi Steven, I think we discussed every comment. Could we run the tests one more time before merging? It was a long time ago when they were running, and it might be good to double check before merging.

let me rebase with the latest main branch

…or that is more broadly useful

…atistics to operators

stevenzwu · 2024-07-22T20:59:32Z

thanks @pvary for the review

…efactoring in smart shuffling

(cherry picked from commit 604b2bb)

(cherry picked from commit 4dbc7f5)

(cherry picked from commit 604b2bb)

(cherry picked from commit 4dbc7f5)

stevenzwu requested a review from pvary June 6, 2024 18:18

github-actions Bot added the flink label Jun 6, 2024

stevenzwu commented Jun 6, 2024

View reviewed changes

pvary reviewed Jun 6, 2024

View reviewed changes

Comment thread ....19/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsCoordinator.java

pvary reviewed Jun 6, 2024

View reviewed changes

stevenzwu force-pushed the shuffle-rescale-handling branch 3 times, most recently from 01ef84a to 42c0113 Compare June 7, 2024 03:51

stevenzwu force-pushed the shuffle-rescale-handling branch 2 times, most recently from 39235fe to e6b3d9c Compare June 10, 2024 23:43

pvary reviewed Jun 11, 2024

View reviewed changes

stevenzwu force-pushed the shuffle-rescale-handling branch 3 times, most recently from 2bfd489 to 00fed5b Compare June 11, 2024 23:19

stevenzwu changed the title ~~Flink: handle rescale properly for range bounds in sketch statistics~~ Flink: handle rescale properly and refactor statistics Jun 12, 2024

stevenzwu force-pushed the shuffle-rescale-handling branch from 00fed5b to 163dbd7 Compare June 12, 2024 04:40

pvary reviewed Jun 17, 2024

View reviewed changes

stevenzwu force-pushed the shuffle-rescale-handling branch from 163dbd7 to 7051b49 Compare June 18, 2024 23:39

pvary approved these changes Jul 19, 2024

View reviewed changes

stevenzwu added 3 commits July 20, 2024 15:24

Flink: handle rescale properly for range bounds in sketch statistics

8c97890

implement general statistics reconcilation btw operator and coordinat…

7ffd5bb

…or that is more broadly useful

separate out CompletedStatistics and GlobalStatistics

8a94f3d

compute the key assignment in the coordinator before sending GlobalSt…

7446959

…atistics to operators

stevenzwu force-pushed the shuffle-rescale-handling branch from 7051b49 to 7446959 Compare July 20, 2024 22:24

stevenzwu merged commit 604b2bb into apache:main Jul 22, 2024

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jul 23, 2024

Flink: backport PR apache#10457 for handling rescale and statistics r…

cdf4426

…efactoring in smart shuffling

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jul 23, 2024

Flink: backport PR apache#10331 and PR apache#10457

a5d2adc

stevenzwu added a commit that referenced this pull request Jul 26, 2024

Flink: backport PR #10331 and PR #10457 (#10757)

4dbc7f5

jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024

Flink: handle rescale properly and refactor statistics (apache#10457)

d81b1e4

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Flink: handle rescale properly and refactor statistics (apache#10457)

01338b8

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Flink: backport PR apache#10331 and PR apache#10457 (apache#10757)

9d23bb0

czy006 pushed a commit to czy006/iceberg that referenced this pull request Apr 2, 2025

Flink: handle rescale properly and refactor statistics (apache#10457)

869cdc4

(cherry picked from commit 604b2bb)

czy006 pushed a commit to czy006/iceberg that referenced this pull request Apr 2, 2025

Flink: backport PR apache#10331 and PR apache#10457 (apache#10757)

a8155fa

(cherry picked from commit 4dbc7f5)

czy006 pushed a commit to czy006/iceberg that referenced this pull request Apr 2, 2025

Flink: handle rescale properly and refactor statistics (apache#10457)

fd1c96a

(cherry picked from commit 604b2bb)

czy006 pushed a commit to czy006/iceberg that referenced this pull request Apr 2, 2025

Flink: backport PR apache#10331 and PR apache#10457 (apache#10757)

8464923

(cherry picked from commit 4dbc7f5)

Conversation

stevenzwu commented Jun 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary commented Jun 10, 2024

Uh oh!

stevenzwu commented Jun 10, 2024

Uh oh!

pvary commented Jun 10, 2024

Uh oh!

stevenzwu commented Jun 10, 2024

Uh oh!

pvary commented Jun 10, 2024

Uh oh!

stevenzwu commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary left a comment

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented Jul 20, 2024

Uh oh!

stevenzwu commented Jul 22, 2024

stevenzwu Jun 6, 2024 •

edited

Loading

stevenzwu commented Jun 10, 2024 •

edited

Loading