persist: periodically yield in Replay operator by danhhz · Pull Request #9615 · MaterializeInc/materialize

danhhz · 2021-12-15T20:12:56Z

Instead of emitting it all the first time the operator is scheduled.
This allows downstream operators a chance to process the data it has
emitted, possibly reducing it down and evening out memory usage.

Until the persist Replay operator has finished, it can't downgrade its
output capability. However, Frank has said "I would say that operators
are mostly either 1. stateless and benefit from streaming through, or 2.
arrange."

Even arrange has some opportunity for data reduction. An arrange
downstream of replay would just be buffering the replayed data until the
frontier advances, but it does consolidate within this buffer. Persist
keeps things mostly consolidated, but unsealed data is unconsolidated
and even trace data is not across batches (which could be relevant
depending on the as_of).

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
[N/A] This PR adds a release note for any user-facing behavior changes.

This change is

aljoscha

The changes look good to me! Do we have any idea if there's a negative (or positive) impact from this? Any benchmarks where this change moves anything?

ruchirK

LGTM!

ruchirK · 2021-12-15T21:51:35Z

                                    debug_assert!(snapshot_since.less_equal(ts));
                                }
                                session.give(x);
+                                if idx + 1 >= outputs_per_yield {


it was not obvious to me that enumerate will restart counting from zero every time we re-enter the operator - maybe worth a comment?

danhhz · 2021-12-16T19:36:13Z

Good call on the benchmarks. I'll wait until we have the dec milestone benchmarks and use those (along with our microbenchmarks and the WIP "feature benchmarks" one that I think Philip has) to eval the perf diff. Will add ruchir's requested comment and then mark as draft in the meantime

ruchirK · 2021-12-16T19:47:33Z

I'm going to test this out locally actually. I'm seeing my pr with compaction enabled for kafka upsert sources be really fast on initial load (comparable to the non-persisted case, query times and mem usage are all as expected), but on restart memory usage spikes up a lot, roughly double the expected (where expected already factors in the doubling due to BlobCache), before settling down to something closer to the steady state.

ruchirK · 2022-01-12T08:06:03Z

The benchmark I'm currently working on no longer restarts Materialize so it won't be relevant in testing this. Have you tried testing with the KafkaRecovery feature benchmark - that seems like the current best test?

danhhz · 2022-01-12T15:23:21Z

The benchmark I'm currently working on no longer restarts Materialize so it won't be relevant in testing this. Have you tried testing with the KafkaRecovery feature benchmark - that seems like the current best test?

No, but I don't expect there to be much difference. KafkaRecovery is over a pretty small amount of data and this has exactly the same behavior as before up to 1m records. One of the things I said in my big comment on your PR is that we need something similar to KafkaRecovery but with a much larger amount of data.

philip-stoev · 2022-01-12T15:29:58Z

KafkaRecovery is 10M records. I plan to make this configurable, but in the meantime you can edit scenarios.py and have it ingest more records.

danhhz · 2022-01-12T16:54:06Z

KafkaRecovery is 10M records. I plan to make this configurable, but in the meantime you can edit scenarios.py and have it ingest more records.

@philip-stoev The context here is that one of our (now overdue) dec goals was a repeatable benchmark for this. As I was mentioning in the other comment, KafkaRecovery is basically perfect, but we need it to be quite a big bigger for this goal. Editing scenarios.py works as a workaround, but the milestone goal isn't so much about having a workaround, but having a declared thing we can easily run. One idea is to fork KafkaRecovery into a new KafkaBigRecovery with more records (1B?) and different timeouts. (Ideally it would also be something more like a chbench schema + key distribution, but maybe we can skip that in the short term.) I worry though about how much different this KafkaBigRecovery would be than the other feature benchmarks, would it be abusing the system?

philip-stoev · 2022-01-12T17:03:52Z

@danhhz understood, let me poke a bit and I will get back to you. I think doing 1B within the existing test infrastructure with a separately-named KafkaRecoveryBig is totally doable, but 1B rows are a lot of rows, recovering 10M rows causes the memory to spike to 7Gb. 1B rows is 100x that . So I guess they can not be 1B unique rows ... if you can point me to e.g. kgen parameters that you think are what you would like to see, I will try to cook the rest.

danhhz · 2022-01-12T17:10:33Z

To be clear, this is a goal for the persist team, so don't feel like you need to be randomized by it. We're also happy to do the work with your design direction.

I'm not familiar with kgen yet, so I'll have to get back to you on that. 1B was just a number I pulled out of the air becuase it's 100x the current test size and 100x the current running time of the test was around the size I'm looking for. As for the memory usage, I think instead of a materialized source, we'd want a non-materialized source with some very simple derived view that reduces down the data a bunch.

ruchirK · 2022-01-12T17:28:05Z

@danhhz understood. We had talked yesterday about how this PR was blocked on the missing large benchmark so I was wondering if there were ways to unblock you quickly more than anything else. I'll push out a more self contained restarts benchmark in a separate pr ASAP and we can go from there.

danhhz · 2022-01-12T17:36:01Z

This PR isn't being blocked isn't an issue, there's no immediate pressure to get it merged (beyond that it should be merged by end of Jan). I'm intentionally holding it in reserve as a test case for our new benchmark infra.

danhhz

we have KafkaRecoveryBig now, so rebased this and will benchmark it using that. most of the conflicts were straightforward, with the exception of the as_of error path's early return not playing well with the done bool. I used an inline closure, but I'm not entirely happy with that. will continue to think on this while the benchmark runs

danhhz · 2022-01-24T16:24:20Z

                                    debug_assert!(snapshot_since.less_equal(ts));
                                }
                                session.give(x);
+                                if idx + 1 >= outputs_per_yield {


danhhz · 2022-01-26T01:01:11Z

Hmm didn't see any change, which is a little surprising. At the very least, I should be able to see memory differences, I'll mull over how to verify that.

NAME                      |    THIS     |    OTHER    |  Regression?  | 'THIS' is:
----------------------------------------------------------------------------------------------------
KafkaRecoveryBig          |     232.562 |     238.432 |      no       | 2.5 pct   faster

Raw data

THIS
Measurement: 330.2072992324829
Measurement: 234.98467683792114
Measurement: 232.56237816810608
Measurement: 234.65807366371155
Measurement: 234.2739589214325
Measurement: 234.28390955924988
Measurement: 232.67034482955933
Measurement: 236.2846715450287
Measurement: 233.5317666530609
Measurement: 234.31296253204346
Measurement: 237.96830582618713
Measurement: 238.04564118385315
Measurement: 235.69454836845398
Measurement: 234.7007234096527
Measurement: 236.71924591064453
Measurement: 235.60178422927856
Measurement: 233.97286939620972
Measurement: 233.72099256515503
Measurement: 235.5216281414032
Measurement: 234.14577746391296
Measurement: 236.52861404418945
Measurement: 237.27555513381958
Measurement: 236.75176739692688
Measurement: 234.7529091835022
Measurement: 233.6336591243744
OTHER
Measurement: 319.0679326057434
Measurement: 242.08777618408203
Measurement: 239.47660517692566
Measurement: 239.54767870903015
Measurement: 240.82523226737976
Measurement: 239.7476623058319
Measurement: 241.9002721309662
Measurement: 244.15596652030945
Measurement: 239.57650589942932
Measurement: 239.7907793521881
Measurement: 241.4216079711914
Measurement: 242.72080755233765
Measurement: 241.24875330924988
Measurement: 240.4369044303894
Measurement: 238.4321653842926
Measurement: 243.7324435710907
Measurement: 239.53394269943237
Measurement: 243.04467511177063
Measurement: 240.1298336982727
Measurement: 238.93053650856018
Measurement: 243.06122875213623
Measurement: 242.3199062347412
Measurement: 244.07579517364502
Measurement: 242.8982322216034
Measurement: 242.10720443725586

aljoscha · 2022-01-26T09:04:07Z

I'm thinking we maybe shouldn't expect differences? With KafkaRecoveryBig, I would expect the unsealeds to be tiny compared to the trace, and maybe the trace data also cannot be compacted. So the additional yielding we do doesn't allow any downstream work and only when we have read everything will the worker start doing other work.

ruchirK · 2022-01-26T16:25:42Z

you can use docker stats to look at the memory footprint over time for now? Also, I'm looking at the benchmark and the actual view that gets rendered is

CREATE MATERIALIZED VIEW s1_is_complete AS SELECT COUNT(*) = 256 FROM s1 WHERE key0 <= '\\x00000000000000ff'

so hypothesis: we don't see a mem difference because data doesn't make it into the holding pen for reduce because it gets filtered out by this extremely restrictive filter? I haven't actually verified but imo you could create a materialized view that tracks e.g. the max mz_offset and asserts that the sum of the max mz_offset by partition = number of records ingested? that view will use all the data and is maybe more likely to see a memory reduction?

Instead of emitting it all the first time the operator is scheduled. This allows downstream operators a chance to process the data it has emitted, possibly reducing it down and evening out memory usage. Until the persist Replay operator has finished, it can't downgrade its output capability. However, Frank has said "I would say that operators are mostly either 1. stateless and benefit from streaming through, or 2. arrange." Even arrange has some opportunity for data reduction. An arrange downstream of replay would just be buffering the replayed data until the frontier advances, but it does consolidate within this buffer. Persist keeps things mostly consolidated, but unsealed data is unconsolidated and even trace data is not across batches (which could be relevant depending on the as_of).

danhhz · 2022-02-15T20:17:46Z

Ran KafkaRecoveryBig hooked up to prometheus to try to investigate and uhhhh....

+++ Benchmark Report
NAME                      |    THIS     |    OTHER    |  Regression?  | 'THIS' is:
----------------------------------------------------------------------------------------------------
KafkaRecoveryBig          |     384.605 |     185.154 |    !!YES!!    | 2.1 TIMES slower

I only ran it the once and the first results looked pretty noisy before so who knows what happened. I'll run it again a few times.

Meanwhile, I couldn't particularly see any differences in the memory usage, but the metrics make it pretty obvious that compaction is happening while we're replaying, so that will certainly contribute to the noise. I think the real answer here is running the KafkaRecoveryBig restarted mz with compaction disabled, but in the short term, maybe it will even out after a few runs?

danhhz · 2022-02-15T21:16:45Z

Okay, running KafkaRecoveryBig with --max-runs 5 resulted in something that looks similar to my previous results (modulo both sides have since been sped up). Memory usage still seems too noisy to draw any strong conclusions, but we can definitely see the effect when looking at timely thread cpu.

+++ Benchmark Report
NAME                      |    THIS     |    OTHER    |  Regression?  | 'THIS' is:
----------------------------------------------------------------------------------------------------
KafkaRecoveryBig          |     159.069 |     161.538 |      no       | 1.5 pct   faster

THIS
Measurement: 345.1443648338318
Measurement: 159.06885743141174
Measurement: 161.4558823108673
Measurement: 163.5795922279358
Measurement: 163.95773577690125
OTHER
Measurement: 192.78956246376038
Measurement: 161.53803634643555
Measurement: 170.65334010124207
Measurement: 179.84094738960266
Measurement: 173.65583276748657

@ruchirK This seems like it might be a small throughput improvement (1%) and it seems to be doing what it's supposed to. I'm inclined to merge it at this point, but curious what your thoughts are. We could also take a swing at a cli flag to disable unsealed_drain and trace_compaction to clean up the benchmark results.

ruchirK · 2022-02-15T21:46:45Z

That all seems fine and I agree its fine to merge

re: cli flags to disable compaction/drain unsealed -- materialize already has a builtin flag for compaction --logical-compaction-window=off. There's no obvious analogue for drain unsealed, although I think you could observe the effect of doing so by setting --timestamp-frequency=100s or something like that? I'm way more confident in the first than the second

danhhz · 2022-02-15T22:00:22Z

Oh that's clever. I'm not sure whether it's too "action at a distance" to check it in to KafkaRecoveryBig like that, but it certainly should be good enough to run a quick experiment on this branch

danhhz · 2022-02-15T22:01:08Z

Unsealed is less of an issue because in practice it seem to only poison the first run of each side

danhhz · 2022-02-15T22:25:53Z

Huh, something isn't working with --logical-compaction-window=off

Measurement: 0.0012362003326416016

danhhz requested review from aljoscha and ruchirK December 15, 2021 20:12

danhhz force-pushed the persist_replay_yield branch from aacda5b to c94e2eb Compare December 15, 2021 21:13

aljoscha approved these changes Dec 15, 2021

View reviewed changes

ruchirK approved these changes Dec 15, 2021

View reviewed changes

danhhz marked this pull request as draft December 16, 2021 19:37

aljoscha mentioned this pull request Dec 17, 2021

persist,dataflow: verify expected as_of when replaying streams #9659

Merged

1 task

danhhz force-pushed the persist_replay_yield branch from c400cb3 to 71b9b40 Compare January 24, 2022 16:23

danhhz commented Jan 24, 2022

View reviewed changes

danhhz marked this pull request as ready for review January 24, 2022 16:27

danhhz force-pushed the persist_replay_yield branch from 71b9b40 to 327024d Compare January 24, 2022 16:50

danhhz force-pushed the persist_replay_yield branch from 327024d to eb4f8a6 Compare February 15, 2022 19:13

danhhz merged commit 4a54b29 into MaterializeInc:main Feb 16, 2022

danhhz deleted the persist_replay_yield branch February 16, 2022 15:26

Conversation

danhhz commented Dec 15, 2021 • edited by umanwizard Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

aljoscha left a comment

Choose a reason for hiding this comment

Uh oh!

ruchirK left a comment

Choose a reason for hiding this comment

Uh oh!

ruchirK Dec 15, 2021

Choose a reason for hiding this comment

Uh oh!

danhhz Jan 24, 2022

Choose a reason for hiding this comment

Uh oh!

danhhz commented Dec 16, 2021

Uh oh!

ruchirK commented Dec 16, 2021

Uh oh!

ruchirK commented Jan 12, 2022

Uh oh!

danhhz commented Jan 12, 2022

Uh oh!

philip-stoev commented Jan 12, 2022

Uh oh!

danhhz commented Jan 12, 2022

Uh oh!

philip-stoev commented Jan 12, 2022

Uh oh!

danhhz commented Jan 12, 2022

Uh oh!

ruchirK commented Jan 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danhhz commented Jan 12, 2022

Uh oh!

danhhz left a comment

Choose a reason for hiding this comment

Uh oh!

danhhz Jan 24, 2022

Choose a reason for hiding this comment

Uh oh!

danhhz commented Jan 26, 2022

Uh oh!

aljoscha commented Jan 26, 2022

Uh oh!

ruchirK commented Jan 26, 2022

Uh oh!

danhhz commented Feb 15, 2022

Uh oh!

danhhz commented Feb 15, 2022

Uh oh!

ruchirK commented Feb 15, 2022

Uh oh!

danhhz commented Feb 15, 2022

Uh oh!

danhhz commented Feb 15, 2022

Uh oh!

danhhz commented Feb 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danhhz commented Dec 15, 2021 •

edited by umanwizard

Loading

ruchirK commented Jan 12, 2022 •

edited

Loading