Improve AAE fullsync by estimating number of keys#623
Conversation
This commit slightly refactors the riak_repl_aae_source code to make state transitions more obvious. This change also eases the porting of commits from a separate feature branch that implemented a slightly different approach to improved AAE syncing.
This commit merges in the separately prototyped buffer-then-send approach to directly sending key differences. The mode used is determined by the `fullsync_direct_mode` application variable, which defaults to `inline`. * inline differences selected for direct repair are sent as they are encountered during an AAE exchange. * buffered differences selected for direct repair are buffered in memory during an exchange. After the exchange, the buffer is sorted and keys are then sent in order.
A deadlock in AAE fullsync was identified during testing. The deadlock occurs when both the AAE fullsync source and sink processes block on TCP send due to TCP backpressure. Since both processes are blocked, neither can perform a receive to unblock the other. This commit addresses the issue by making the AAE sink use a helper process for sending, therefore ensuring the sink never blocks on a send. The sink will therefore always get around to receiving data at some point, unblocking the source if needed. NOTE: the SSL transport wraps the underlying socket in a gen_server, and all socket functions route through this server. As such, the SSL transport is still vulnerable to this deadlock. Fixing this requires either changing the AAE protocol or patching the OTP SSL implementation. One of the two is planned future work.
When performing direct sends, the AAE fullsync source retrieves individual objects in order to send them. Previously, this was accomplished using normal riak_client gets using the n_val=1 option. However, there are no guarantees about which replica the object would come from using such an approach. This commit changes the AAE fullsync source to instead retrieve objects directly from the vnode performing the fullsync. This ensures that the object sent aligns with the AAE hash used to determine differences. This approach may also have better performance.
Although keylisting has a pool for issuing gets, both the AAE and the Keylist strategies serialize writing objects from the sink process. This adds a pool with a default size of 100 shared across all fullsync sinks that make the writes async. No difference in safety properties, neither version checks the response from the put FSM.
…source data" msg. Minor changes for user experience, clarified with _limit to permit a future _percentage version of fullsync_direct. Lowered the "Gathering source data" message displayed every minute to a debug message.
Reviewed-by: jtuple
…x-2.0' Reviewed-by: jtuple
Improved AAE fullsync integration/2.0 pull request Reviewed-by: jonmeredith
…imate-keys Conflicts: src/riak_repl_aae_source.erl
|
I've cleaned up and rebased/merged all of the key estimation changes on top of the 2.0 branch, as well as finished up reviewing the code across all 3 repositories. In general, all the code looks good and does what it's supposed to do. Everything continues to pass the AAE replication tests I've been running, as well as the newer tests I've tried out from the A few questions/comments. First, I recall @jonmeredith mentioning something about the protocol negotiation (as stored in the source state) not actually being used. Jon, could you elaborate? More importantly, do you have any comments about the changes made in this pull-request related to protocol negotiation (commits c8ec0d9 and a13c865)? Second, my merge commit in this pull-request is rather substantial, with lots of code massaging. This is the same as the prior merge I did on the Third, is there any benefit to having a toggle to disable key estimation? Could there ever be a scenario where we would want to avoid iterating over 1% of the AAE tree's segments to compute the estimate? Eg. heavily loaded cluster? Finally, are we happy with the log messages in this branch from a user-facing perspective? Example output: And in cases where we don't do the bloom fold: Note: many of these message were the same/similar before this pull-request. So, this is more an open question than something specific to this pull-request itself: should we evaluate the signal/noise/usefulness factor of the AAE log messages before we ship 2.0.2? |
|
Joe, you're amazing. Nice work. I see there are other commits happening and we're moving to so-called 0-based partition names; would be nice to have "partition 3/16" rather than "182687704666362864775460604089535377456991567872" printed, eh? |
|
First: |
|
Some of those logs are used in the riak_test @rsltrifork and I wrote a while back as far as I remember. |
|
ok I will verify that they still work or update them. |
Keep needed for testing as debug.
|
@bsparrow435 Could you or another repl battle hardened CSE take a look at the logging that is being silenced by this PR? |
There was a problem hiding this comment.
If this pr requires the changes in basho/riak_kv#1030, we should wait until that pr is merged, then merge this one, rather than change the rebar.config.
There was a problem hiding this comment.
Could this be switched to lager:debug like the other messages instead of being removed outright?
There was a problem hiding this comment.
OK to just go with this?
2014-10-16 11:26:53.304 [info] <0.1531.0>@riak_repl2_fssource_sup:enable:15 Starting replication fullsync source for 1278813932664540053428224228626747642198940975104 from 'dev1@127.0.0.1' to {{127,0,0,1},10026}
There was a problem hiding this comment.
Going with the log entry from riak_repl2_fssource_sup sounds good to me.
|
Apologies for the pedantry; cliserv's needs from the logs basically amount to being able to reconstruct a narrative of which partition began when, and ended when (and ended in what way; cancelled, succeeded, or the autogenerated error_logger entry). IMO anyway. |
|
@aberghage Thanks for the comments important to not have black holes in the logs while troubleshooting . |
There was a problem hiding this comment.
Here we are sending ourselves a not_responsible message. When it reaches us, we stop. I don't see the point of postponing that. Why not recode this so that it immediately bails? Just move it to a different function with the appropriate clauses. Or is there a subtle reason why we want to do this now?
Remove loop so we can receive cancel_fullsync during update of remote trees.
There was a problem hiding this comment.
Just for future reference: these changes fix a problem where using AAE FS with a cluster that could not understand the legacy binary format wouldn't work, as it was hard coded to send the newest format.
|
I'm finished with the review of the existing changes. As soon as the commont about the not_responsible message in the update_trees state is addressed, I'll start running it through the tests. |
|
👍 6bdc099 |
|
@borshop merge |
Improve AAE fullsync by estimating number of keys Reviewed-by: engelsanchez
This pull-request extends AAE fullsync to sample the AAE trees to estimate the number of keys in a given partition.
This estimate is used to properly size the bloom filter, as well as enable a new percentage-based direct send threshold (eg. direct send up to 10% of differing keys).
To support AAE-based key estimation, this pull-request changes the fullsync logic to update all AAE trees before proceeding to the exchange phase. This change is necessary because all trees must be updated and sampled to calculate a correct estimate.
This pull-request also delays the creation of the bloom filter until it is needed. Fullsyncs that manage to send all differences directly will therefore avoid creating a bloom filter.
This work was performed by @lixen, @rsltrifork, and @krestenkrab. I rebased/squashed commits as necessary to make things apply cleanly on top of the current 2.0 branch, as this work was completed in parallel on top of older changes that were accidentally made against the develop branch.
Note: several commits were squashed together/re-ordered to make porting to the 2.0 branch easier. The original branch had numerous bi-directional merges that made things difficult to port. For reference, the original commits can be found in the jdb-miklix_estimate_keys branch.
Sibling pull-requests: basho/riak_core#633, basho/riak_kv#1030