Improve AAE fullsync by estimating number of keys by jtuple · Pull Request #623 · basho/riak_repl

jtuple · 2014-10-06T22:31:00Z

This pull-request extends AAE fullsync to sample the AAE trees to estimate the number of keys in a given partition.

This estimate is used to properly size the bloom filter, as well as enable a new percentage-based direct send threshold (eg. direct send up to 10% of differing keys).

To support AAE-based key estimation, this pull-request changes the fullsync logic to update all AAE trees before proceeding to the exchange phase. This change is necessary because all trees must be updated and sampled to calculate a correct estimate.

This pull-request also delays the creation of the bloom filter until it is needed. Fullsyncs that manage to send all differences directly will therefore avoid creating a bloom filter.

This work was performed by @lixen, @rsltrifork, and @krestenkrab. I rebased/squashed commits as necessary to make things apply cleanly on top of the current 2.0 branch, as this work was completed in parallel on top of older changes that were accidentally made against the develop branch.

Note: several commits were squashed together/re-ordered to make porting to the 2.0 branch easier. The original branch had numerous bi-directional merges that made things difficult to port. For reference, the original commits can be found in the jdb-miklix_estimate_keys branch.

Sibling pull-requests: basho/riak_core#633, basho/riak_kv#1030

This commit slightly refactors the riak_repl_aae_source code to make state transitions more obvious. This change also eases the porting of commits from a separate feature branch that implemented a slightly different approach to improved AAE syncing.

This commit merges in the separately prototyped buffer-then-send approach to directly sending key differences. The mode used is determined by the `fullsync_direct_mode` application variable, which defaults to `inline`. * inline differences selected for direct repair are sent as they are encountered during an AAE exchange. * buffered differences selected for direct repair are buffered in memory during an exchange. After the exchange, the buffer is sorted and keys are then sent in order.

A deadlock in AAE fullsync was identified during testing. The deadlock occurs when both the AAE fullsync source and sink processes block on TCP send due to TCP backpressure. Since both processes are blocked, neither can perform a receive to unblock the other. This commit addresses the issue by making the AAE sink use a helper process for sending, therefore ensuring the sink never blocks on a send. The sink will therefore always get around to receiving data at some point, unblocking the source if needed. NOTE: the SSL transport wraps the underlying socket in a gen_server, and all socket functions route through this server. As such, the SSL transport is still vulnerable to this deadlock. Fixing this requires either changing the AAE protocol or patching the OTP SSL implementation. One of the two is planned future work.

When performing direct sends, the AAE fullsync source retrieves individual objects in order to send them. Previously, this was accomplished using normal riak_client gets using the n_val=1 option. However, there are no guarantees about which replica the object would come from using such an approach. This commit changes the AAE fullsync source to instead retrieve objects directly from the vnode performing the fullsync. This ensures that the object sent aligns with the AAE hash used to determine differences. This approach may also have better performance.

Although keylisting has a pool for issuing gets, both the AAE and the Keylist strategies serialize writing objects from the sink process. This adds a pool with a default size of 100 shared across all fullsync sinks that make the writes async. No difference in safety properties, neither version checks the response from the put FSM.

…source data" msg. Minor changes for user experience, clarified with _limit to permit a future _percentage version of fullsync_direct. Lowered the "Gathering source data" message displayed every minute to a debug message.

Reviewed-by: jtuple

…x-2.0' Reviewed-by: jtuple

Improved AAE fullsync integration/2.0 pull request Reviewed-by: jonmeredith

…imate-keys Conflicts: src/riak_repl_aae_source.erl

jtuple · 2014-10-07T04:34:10Z

I've cleaned up and rebased/merged all of the key estimation changes on top of the 2.0 branch, as well as finished up reviewing the code across all 3 repositories.

In general, all the code looks good and does what it's supposed to do. Everything continues to pass the AAE replication tests I've been running, as well as the newer tests I've tried out from the krab-aae-fullsync-wip branch of riak_test.

A few questions/comments.

First, I recall @jonmeredith mentioning something about the protocol negotiation (as stored in the source state) not actually being used. Jon, could you elaborate? More importantly, do you have any comments about the changes made in this pull-request related to protocol negotiation (commits c8ec0d9 and a13c865)?

Second, my merge commit in this pull-request is rather substantial, with lots of code massaging. This is the same as the prior merge I did on the jdb-miklix_estimate_keys branch. I've looked over the code many times and believe it's all good, but having another person review my work would be beneficial. I remember someone mentioning on one of our prior meetings that they looked over the previous merge on the jdb-miklix_estimate_keys branch, so perhaps the same person would make sense to double check again on this pull-request.

Third, is there any benefit to having a toggle to disable key estimation? Could there ever be a scenario where we would want to avoid iterating over 1% of the AAE tree's segments to compute the estimate? Eg. heavily loaded cluster?

Finally, are we happy with the log messages in this branch from a user-facing perspective? Example output:

2014-10-06 20:26:59.801 [info] <0.31687.0>@riak_repl_aae_source:init:90 AAE fullsync source worker started for partition 182687704666362864775460604089535377456991567872
2014-10-06 20:26:59.801 [info] <0.31687.0>@riak_repl_aae_source:update_trees:233 Start update for partition,IndexN 182687704666362864775460604089535377456991567872,[{1278813932664540053428224228626747642198940975104,3},{0,3},{182687704666362864775460604089535377456991567872,3}]
2014-10-06 20:27:00.185 [info] <0.31687.0>@riak_repl_aae_source:update_trees:256 EstimatedNrKeys 4097 for partition 182687704666362864775460604089535377456991567872
2014-10-06 20:27:01.019 [info] <0.31696.0>@riak_repl_aae_source:key_exchange:371 Full-sync with site "B"; fullsync difference generator for 182687704666362864775460604089535377456991567872 complete (completed in 0.83 secs)
2014-10-06 20:27:01.822 [info] <0.31718.0>@riak_repl_aae_source:key_exchange:371 Full-sync with site "B"; fullsync difference generator for 182687704666362864775460604089535377456991567872 complete (completed in 0.8 secs)
2014-10-06 20:27:02.645 [info] <0.31729.0>@riak_repl_aae_source:key_exchange:371 Full-sync with site "B"; fullsync difference generator for 182687704666362864775460604089535377456991567872 complete (completed in 0.82 secs)
2014-10-06 20:27:02.647 [info] <0.31687.0>@riak_repl_aae_source:maybe_send_direct:421 Directly sending 1000 differences for partition 182687704666362864775460604089535377456991567872
2014-10-06 20:27:02.861 [info] <0.31687.0>@riak_repl_aae_source:finish_sending_differences:467 Bloom folding over 2690/3690 differences for partition 182687704666362864775460604089535377456991567872 with EstimatedNrKeys 4097

And in cases where we don't do the bloom fold:

2014-10-06 20:27:09.141 [info] <0.31859.0>@riak_repl_aae_source:finish_sending_differences:453 No Bloom folding over 0/0 differences for partition 548063113999088594326381812268606132370974703616 with EstimatedNrKeys 3497

Note: many of these message were the same/similar before this pull-request. So, this is more an open question than something specific to this pull-request itself: should we evaluate the signal/noise/usefulness factor of the AAE log messages before we ship 2.0.2?

krestenkrab · 2014-10-07T10:01:39Z

Joe, you're amazing. Nice work.

I see there are other commits happening and we're moving to so-called 0-based partition names; would be nice to have "partition 3/16" rather than "182687704666362864775460604089535377456991567872" printed, eh?

lixen · 2014-10-14T09:11:53Z

First:
Second: I had a look and can't find any issues. But I guess I might not be the best reviewer of this.
Third: I think the benefit of not creating unnecessary bloom could cover the cost of iterating 1%
Finally: I agree, there is a lot of noise in the logs that only are useful when debugging.

krestenkrab · 2014-10-16T11:12:39Z

Some of those logs are used in the riak_test @rsltrifork and I wrote a while back as far as I remember.

lixen · 2014-10-16T11:17:34Z

ok I will verify that they still work or update them.

Keep needed for testing as debug.

engelsanchez · 2014-12-08T18:59:49Z

@bsparrow435 Could you or another repl battle hardened CSE take a look at the logging that is being silenced by this PR?

lordnull · 2014-12-08T19:37:15Z

If this pr requires the changes in basho/riak_kv#1030, we should wait until that pr is merged, then merge this one, rather than change the rebar.config.

lukebakken · 2014-12-09T18:40:33Z

Could this be switched to lager:debug like the other messages instead of being removed outright?

OK to just go with this?
2014-10-16 11:26:53.304 [info] <0.1531.0>@riak_repl2_fssource_sup:enable:15 Starting replication fullsync source for 1278813932664540053428224228626747642198940975104 from 'dev1@127.0.0.1' to {{127,0,0,1},10026}

Going with the log entry from riak_repl2_fssource_sup sounds good to me.

bearcage · 2014-12-09T19:19:59Z

Apologies for the pedantry; cliserv's needs from the logs basically amount to being able to reconstruct a narrative of which partition began when, and ended when (and ended in what way; cancelled, succeeded, or the autogenerated error_logger entry). IMO anyway.

lixen · 2014-12-10T12:02:51Z

@aberghage Thanks for the comments important to not have black holes in the logs while troubleshooting .

engelsanchez · 2014-12-12T22:26:12Z

Here we are sending ourselves a not_responsible message. When it reaches us, we stop. I don't see the point of postponing that. Why not recode this so that it immediately bails? Just move it to a different function with the appropriate clauses. Or is there a subtle reason why we want to do this now?

Remove loop so we can receive cancel_fullsync during update of remote trees.

engelsanchez · 2014-12-15T19:37:36Z

Just for future reference: these changes fix a problem where using AAE FS with a cluster that could not understand the legacy binary format wouldn't work, as it was hard coded to send the newest format.

engelsanchez · 2014-12-15T21:12:31Z

I'm finished with the review of the existing changes. As soon as the commont about the not_responsible message in the update_trees state is addressed, I'll start running it through the tests.

engelsanchez · 2014-12-19T14:49:56Z

👍 6bdc099
Testing this was quite hard due to the many failures the repl builds are having. Finally all passed and made sense if tested with this branch (now merged to master) basho/riak_test#714

engelsanchez · 2014-12-19T14:50:35Z

@borshop merge

Improve AAE fullsync by estimating number of keys Reviewed-by: engelsanchez

jtuple and others added 12 commits September 22, 2014 12:44

Remove non-pipelined AAE fullsync mode

e444a03

Merge pull request #618 from branch 'feature/fssink-pool-2.0'

e7a2f09

Reviewed-by: jtuple

Merge pull request #619 from branch 'feature/pipelined-aae-fullsync-u…

916be13

…x-2.0' Reviewed-by: jtuple

Merge pull request #620 from basho/feature/pipelined-aae-fullsync-2.0

86faa77

Improved AAE fullsync integration/2.0 pull request Reviewed-by: jonmeredith

Merge branch 'miklix_estimate_keys-2.0' into feature/aae-fullsync-est…

d1c4130

…imate-keys Conflicts: src/riak_repl_aae_source.erl

Remove unused riak_repl_aae_source:replicate_diff/3

779c1ec

This was referenced Oct 7, 2014

Implement key count estimation via AAE trees basho/riak_kv#1030

Merged

Implement key count estimation via AAE trees basho/riak_core#633

Merged

jtuple self-assigned this Oct 7, 2014

Remove redundant and unnecessary logging

83dd192

Mikael Lixenstrand added 2 commits October 16, 2014 17:18

Revert some of "Remove redundant and unnecessary logging"

e43257a

Keep needed for testing as debug.

Fix xref and dialyzer

67b9535

lordnull reviewed Dec 8, 2014
View reviewed changes

Revert riak_kv deps changes

38b7b4b

lukebakken reviewed Dec 9, 2014
View reviewed changes

Update logging after review

ce0c1b4

engelsanchez reviewed Dec 12, 2014
View reviewed changes

Cancel directly on not_responsible from remote cluster

da47f0a

Remove loop so we can receive cancel_fullsync during update of remote trees.

engelsanchez reviewed Dec 15, 2014
View reviewed changes

handle not_responsible for local partitions

6bdc099

borshop added a commit that referenced this pull request Dec 19, 2014

Merge pull request #623 from basho/feature/aae-fullsync-estimate-keys

df42bcf

Improve AAE fullsync by estimating number of keys Reviewed-by: engelsanchez

borshop merged commit 6bdc099 into 2.0 Dec 19, 2014

lixen deleted the feature/aae-fullsync-estimate-keys branch January 8, 2015 10:30

jtuple unassigned jtuple Apr 25, 2015

Conversation

jtuple commented Oct 6, 2014

Uh oh!

jtuple commented Oct 7, 2014

Uh oh!

krestenkrab commented Oct 7, 2014

Uh oh!

lixen commented Oct 14, 2014

Uh oh!

krestenkrab commented Oct 16, 2014

Uh oh!

lixen commented Oct 16, 2014

Uh oh!

engelsanchez commented Dec 8, 2014

Uh oh!

lordnull Dec 8, 2014

Choose a reason for hiding this comment

Uh oh!

lukebakken Dec 9, 2014

Choose a reason for hiding this comment

Uh oh!

lixen Dec 10, 2014

Choose a reason for hiding this comment

Uh oh!

engelsanchez Dec 15, 2014

Choose a reason for hiding this comment

Uh oh!

bearcage commented Dec 9, 2014

Uh oh!

lixen commented Dec 10, 2014

Uh oh!

engelsanchez Dec 12, 2014

Choose a reason for hiding this comment

Uh oh!

engelsanchez Dec 15, 2014

Choose a reason for hiding this comment

Uh oh!

engelsanchez commented Dec 15, 2014

Uh oh!

engelsanchez commented Dec 19, 2014

Uh oh!

engelsanchez commented Dec 19, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants