Skip to content

persist: periodically yield in Replay operator#9615

Merged
danhhz merged 1 commit into
MaterializeInc:mainfrom
danhhz:persist_replay_yield
Feb 16, 2022
Merged

persist: periodically yield in Replay operator#9615
danhhz merged 1 commit into
MaterializeInc:mainfrom
danhhz:persist_replay_yield

Conversation

@danhhz
Copy link
Copy Markdown
Contributor

@danhhz danhhz commented Dec 15, 2021

Instead of emitting it all the first time the operator is scheduled.
This allows downstream operators a chance to process the data it has
emitted, possibly reducing it down and evening out memory usage.

Until the persist Replay operator has finished, it can't downgrade its
output capability. However, Frank has said "I would say that operators
are mostly either 1. stateless and benefit from streaming through, or 2.
arrange."

Even arrange has some opportunity for data reduction. An arrange
downstream of replay would just be buffering the replayed data until the
frontier advances, but it does consolidate within this buffer. Persist
keeps things mostly consolidated, but unsealed data is unconsolidated
and even trace data is not across batches (which could be relevant
depending on the as_of).

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • [N/A] This PR adds a release note for any user-facing behavior changes.

This change is Reviewable

@danhhz danhhz requested review from aljoscha and ruchirK December 15, 2021 20:12
@danhhz danhhz force-pushed the persist_replay_yield branch from aacda5b to c94e2eb Compare December 15, 2021 21:13
Copy link
Copy Markdown
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me! Do we have any idea if there's a negative (or positive) impact from this? Any benchmarks where this change moves anything?

Copy link
Copy Markdown
Contributor

@ruchirK ruchirK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

debug_assert!(snapshot_since.less_equal(ts));
}
session.give(x);
if idx + 1 >= outputs_per_yield {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was not obvious to me that enumerate will restart counting from zero every time we re-enter the operator - maybe worth a comment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Dec 16, 2021

Good call on the benchmarks. I'll wait until we have the dec milestone benchmarks and use those (along with our microbenchmarks and the WIP "feature benchmarks" one that I think Philip has) to eval the perf diff. Will add ruchir's requested comment and then mark as draft in the meantime

@danhhz danhhz marked this pull request as draft December 16, 2021 19:37
@ruchirK
Copy link
Copy Markdown
Contributor

ruchirK commented Dec 16, 2021

I'm going to test this out locally actually. I'm seeing my pr with compaction enabled for kafka upsert sources be really fast on initial load (comparable to the non-persisted case, query times and mem usage are all as expected), but on restart memory usage spikes up a lot, roughly double the expected (where expected already factors in the doubling due to BlobCache), before settling down to something closer to the steady state.

@ruchirK
Copy link
Copy Markdown
Contributor

ruchirK commented Jan 12, 2022

The benchmark I'm currently working on no longer restarts Materialize so it won't be relevant in testing this. Have you tried testing with the KafkaRecovery feature benchmark - that seems like the current best test?

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Jan 12, 2022

The benchmark I'm currently working on no longer restarts Materialize so it won't be relevant in testing this. Have you tried testing with the KafkaRecovery feature benchmark - that seems like the current best test?

No, but I don't expect there to be much difference. KafkaRecovery is over a pretty small amount of data and this has exactly the same behavior as before up to 1m records. One of the things I said in my big comment on your PR is that we need something similar to KafkaRecovery but with a much larger amount of data.

@philip-stoev
Copy link
Copy Markdown
Contributor

KafkaRecovery is 10M records. I plan to make this configurable, but in the meantime you can edit scenarios.py and have it ingest more records.

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Jan 12, 2022

KafkaRecovery is 10M records. I plan to make this configurable, but in the meantime you can edit scenarios.py and have it ingest more records.

@philip-stoev The context here is that one of our (now overdue) dec goals was a repeatable benchmark for this. As I was mentioning in the other comment, KafkaRecovery is basically perfect, but we need it to be quite a big bigger for this goal. Editing scenarios.py works as a workaround, but the milestone goal isn't so much about having a workaround, but having a declared thing we can easily run. One idea is to fork KafkaRecovery into a new KafkaBigRecovery with more records (1B?) and different timeouts. (Ideally it would also be something more like a chbench schema + key distribution, but maybe we can skip that in the short term.) I worry though about how much different this KafkaBigRecovery would be than the other feature benchmarks, would it be abusing the system?

@philip-stoev
Copy link
Copy Markdown
Contributor

@danhhz understood, let me poke a bit and I will get back to you. I think doing 1B within the existing test infrastructure with a separately-named KafkaRecoveryBig is totally doable, but 1B rows are a lot of rows, recovering 10M rows causes the memory to spike to 7Gb. 1B rows is 100x that . So I guess they can not be 1B unique rows ... if you can point me to e.g. kgen parameters that you think are what you would like to see, I will try to cook the rest.

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Jan 12, 2022

To be clear, this is a goal for the persist team, so don't feel like you need to be randomized by it. We're also happy to do the work with your design direction.

I'm not familiar with kgen yet, so I'll have to get back to you on that. 1B was just a number I pulled out of the air becuase it's 100x the current test size and 100x the current running time of the test was around the size I'm looking for. As for the memory usage, I think instead of a materialized source, we'd want a non-materialized source with some very simple derived view that reduces down the data a bunch.

@ruchirK
Copy link
Copy Markdown
Contributor

ruchirK commented Jan 12, 2022

@danhhz understood. We had talked yesterday about how this PR was blocked on the missing large benchmark so I was wondering if there were ways to unblock you quickly more than anything else. I'll push out a more self contained restarts benchmark in a separate pr ASAP and we can go from there.

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Jan 12, 2022

This PR isn't being blocked isn't an issue, there's no immediate pressure to get it merged (beyond that it should be merged by end of Jan). I'm intentionally holding it in reserve as a test case for our new benchmark infra.

@danhhz danhhz force-pushed the persist_replay_yield branch from c400cb3 to 71b9b40 Compare January 24, 2022 16:23
Copy link
Copy Markdown
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have KafkaRecoveryBig now, so rebased this and will benchmark it using that. most of the conflicts were straightforward, with the exception of the as_of error path's early return not playing well with the done bool. I used an inline closure, but I'm not entirely happy with that. will continue to think on this while the benchmark runs

debug_assert!(snapshot_since.less_equal(ts));
}
session.give(x);
if idx + 1 >= outputs_per_yield {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@danhhz danhhz marked this pull request as ready for review January 24, 2022 16:27
@danhhz danhhz force-pushed the persist_replay_yield branch from 71b9b40 to 327024d Compare January 24, 2022 16:50
@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Jan 26, 2022

Hmm didn't see any change, which is a little surprising. At the very least, I should be able to see memory differences, I'll mull over how to verify that.

NAME                      |    THIS     |    OTHER    |  Regression?  | 'THIS' is:
----------------------------------------------------------------------------------------------------
KafkaRecoveryBig          |     232.562 |     238.432 |      no       | 2.5 pct   faster

Raw data

THIS
Measurement: 330.2072992324829
Measurement: 234.98467683792114
Measurement: 232.56237816810608
Measurement: 234.65807366371155
Measurement: 234.2739589214325
Measurement: 234.28390955924988
Measurement: 232.67034482955933
Measurement: 236.2846715450287
Measurement: 233.5317666530609
Measurement: 234.31296253204346
Measurement: 237.96830582618713
Measurement: 238.04564118385315
Measurement: 235.69454836845398
Measurement: 234.7007234096527
Measurement: 236.71924591064453
Measurement: 235.60178422927856
Measurement: 233.97286939620972
Measurement: 233.72099256515503
Measurement: 235.5216281414032
Measurement: 234.14577746391296
Measurement: 236.52861404418945
Measurement: 237.27555513381958
Measurement: 236.75176739692688
Measurement: 234.7529091835022
Measurement: 233.6336591243744
OTHER
Measurement: 319.0679326057434
Measurement: 242.08777618408203
Measurement: 239.47660517692566
Measurement: 239.54767870903015
Measurement: 240.82523226737976
Measurement: 239.7476623058319
Measurement: 241.9002721309662
Measurement: 244.15596652030945
Measurement: 239.57650589942932
Measurement: 239.7907793521881
Measurement: 241.4216079711914
Measurement: 242.72080755233765
Measurement: 241.24875330924988
Measurement: 240.4369044303894
Measurement: 238.4321653842926
Measurement: 243.7324435710907
Measurement: 239.53394269943237
Measurement: 243.04467511177063
Measurement: 240.1298336982727
Measurement: 238.93053650856018
Measurement: 243.06122875213623
Measurement: 242.3199062347412
Measurement: 244.07579517364502
Measurement: 242.8982322216034
Measurement: 242.10720443725586

@aljoscha
Copy link
Copy Markdown
Contributor

I'm thinking we maybe shouldn't expect differences? With KafkaRecoveryBig, I would expect the unsealeds to be tiny compared to the trace, and maybe the trace data also cannot be compacted. So the additional yielding we do doesn't allow any downstream work and only when we have read everything will the worker start doing other work.

@ruchirK
Copy link
Copy Markdown
Contributor

ruchirK commented Jan 26, 2022

you can use docker stats to look at the memory footprint over time for now? Also, I'm looking at the benchmark and the actual view that gets rendered is

CREATE MATERIALIZED VIEW s1_is_complete AS SELECT COUNT(*) = 256 FROM s1 WHERE key0 <= '\\x00000000000000ff'

so hypothesis: we don't see a mem difference because data doesn't make it into the holding pen for reduce because it gets filtered out by this extremely restrictive filter? I haven't actually verified but imo you could create a materialized view that tracks e.g. the max mz_offset and asserts that the sum of the max mz_offset by partition = number of records ingested? that view will use all the data and is maybe more likely to see a memory reduction?

Instead of emitting it all the first time the operator is scheduled.
This allows downstream operators a chance to process the data it has
emitted, possibly reducing it down and evening out memory usage.

Until the persist Replay operator has finished, it can't downgrade its
output capability. However, Frank has said "I would say that operators
are mostly either 1. stateless and benefit from streaming through, or 2.
arrange."

Even arrange has some opportunity for data reduction. An arrange
downstream of replay would just be buffering the replayed data until the
frontier advances, but it does consolidate within this buffer. Persist
keeps things mostly consolidated, but unsealed data is unconsolidated
and even trace data is not across batches (which could be relevant
depending on the as_of).
@danhhz danhhz force-pushed the persist_replay_yield branch from 327024d to eb4f8a6 Compare February 15, 2022 19:13
@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Feb 15, 2022

Ran KafkaRecoveryBig hooked up to prometheus to try to investigate and uhhhh....

+++ Benchmark Report
NAME                      |    THIS     |    OTHER    |  Regression?  | 'THIS' is:
----------------------------------------------------------------------------------------------------
KafkaRecoveryBig          |     384.605 |     185.154 |    !!YES!!    | 2.1 TIMES slower

I only ran it the once and the first results looked pretty noisy before so who knows what happened. I'll run it again a few times.

Meanwhile, I couldn't particularly see any differences in the memory usage, but the metrics make it pretty obvious that compaction is happening while we're replaying, so that will certainly contribute to the noise. I think the real answer here is running the KafkaRecoveryBig restarted mz with compaction disabled, but in the short term, maybe it will even out after a few runs?

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Feb 15, 2022

Okay, running KafkaRecoveryBig with --max-runs 5 resulted in something that looks similar to my previous results (modulo both sides have since been sped up). Memory usage still seems too noisy to draw any strong conclusions, but we can definitely see the effect when looking at timely thread cpu.

+++ Benchmark Report
NAME                      |    THIS     |    OTHER    |  Regression?  | 'THIS' is:
----------------------------------------------------------------------------------------------------
KafkaRecoveryBig          |     159.069 |     161.538 |      no       | 1.5 pct   faster

THIS
Measurement: 345.1443648338318
Measurement: 159.06885743141174
Measurement: 161.4558823108673
Measurement: 163.5795922279358
Measurement: 163.95773577690125
OTHER
Measurement: 192.78956246376038
Measurement: 161.53803634643555
Measurement: 170.65334010124207
Measurement: 179.84094738960266
Measurement: 173.65583276748657

@ruchirK This seems like it might be a small throughput improvement (1%) and it seems to be doing what it's supposed to. I'm inclined to merge it at this point, but curious what your thoughts are. We could also take a swing at a cli flag to disable unsealed_drain and trace_compaction to clean up the benchmark results.

@ruchirK
Copy link
Copy Markdown
Contributor

ruchirK commented Feb 15, 2022

That all seems fine and I agree its fine to merge

re: cli flags to disable compaction/drain unsealed -- materialize already has a builtin flag for compaction --logical-compaction-window=off. There's no obvious analogue for drain unsealed, although I think you could observe the effect of doing so by setting --timestamp-frequency=100s or something like that? I'm way more confident in the first than the second

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Feb 15, 2022

Oh that's clever. I'm not sure whether it's too "action at a distance" to check it in to KafkaRecoveryBig like that, but it certainly should be good enough to run a quick experiment on this branch

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Feb 15, 2022

Unsealed is less of an issue because in practice it seem to only poison the first run of each side

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Feb 15, 2022

Huh, something isn't working with --logical-compaction-window=off

Measurement: 0.0012362003326416016

@danhhz danhhz merged commit 4a54b29 into MaterializeInc:main Feb 16, 2022
@danhhz danhhz deleted the persist_replay_yield branch February 16, 2022 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants