Improved, pipelined AAE fullsync by jtuple · Pull Request #616 · basho/riak_repl

jtuple · 2014-09-18T06:40:56Z

This pull-request includes multiple improves to AAE fullsync, including:

Making AAE exchange pipelined. This is the most significant change, since non-pipelined AAE is sensitive to network latency. And in the worst case was shown to take weeks to complete, even for a trivial exchange. The new pipelined approach is designed to handle high latency links.
Ensuring AAE fullsync only does at most 1 bloom fold per partition. Previously, the implementation did a bloom fold per AAE tree per partition.
Directly sending up to a configurable threshold of key differences, thus avoiding the bloom fold entirely when clusters are mostly in sync.
Using the internal n_val=1 get option when retrieving objects to send. This sends only a single request to a single vnode, therefore reducing cluster load during a fullsync.

Preliminary test result can be found here. More testing is underway/planned.

Sibling pull-requests: basho/riak_core#628, basho/riak_kv#1023

The current AAE fullsync exchange was based on the exchange FSM used by Riak for local cluster AAE. However, this was a bad design decision as local AAE was never designed to tolerate high latency links. In the worst case, this design can require millions of sequential roundtrip messages. This is not an issue for a fast LAN, but is impractical for WANs with 10-200ms RTTs. This commit changes the AAE fullsync exchange to use a pipelined approach. Rather than performing a synchronous request for each bucket, the exchange determines all differences for a given level of the AAE tree and then requests the data for all differing buckets upfront. Ideally, this would supported by a new streaming protocol. However to retain protocol compatibility the pipelined approach sends all the known bucket requests upfront and relies upon TCP ordering to deliver the responses in the order requested. Note: this commit relies upon changes made to riak_core and riak_kv to extend the AAE subsystem to support the new level-by-level exchange that is utilized by this new fullsync approach.

Previous code created a bloom filter for each IndexN, populating it with diffs, and then did a backend fold for that vnode sending objects along the way. The code now 1. creates the bloom filter in init() 2. runs key exchange for each IndexN 3. then does a vnode fold to find objects in the bloom filter.

This change introduces a threshold for how to process fullsync diffs. A small number of changed KVs (currently fixed at 1000 per vnode) are read using GET; above that limit, objects are read using a backend fold. For small number og changed KVs, this avoids doing a full scan of the backend data.

This commit slightly refactors the riak_repl_aae_source code to make state transitions more obvious. This change also eases the porting of commits from a separate feature branch that implemented a slightly different approach to improved AAE syncing.

This commit merges in the separately prototyped buffer-then-send approach to directly sending key differences. The mode used is determined by the `fullsync_direct_mode` application variable, which defaults to `inline`. * inline differences selected for direct repair are sent as they are encountered during an AAE exchange. * buffered differences selected for direct repair are buffered in memory during an exchange. After the exchange, the buffer is sorted and keys are then sent in order.

A deadlock in AAE fullsync was identified during testing. The deadlock occurs when both the AAE fullsync source and sink processes block on TCP send due to TCP backpressure. Since both processes are blocked, neither can perform a receive to unblock the other. This commit addresses the issue by making the AAE sink use a helper process for sending, therefore ensuring the sink never blocks on a send. The sink will therefore always get around to receiving data at some point, unblocking the source if needed. NOTE: the SSL transport wraps the underlying socket in a gen_server, and all socket functions route through this server. As such, the SSL transport is still vulnerable to this deadlock. Fixing this requires either changing the AAE protocol or patching the OTP SSL implementation. One of the two is planned future work.

krestenkrab · 2014-09-18T07:57:45Z

Maybe we should use riak_kv_vnode:local_get(PartitionIndex,BKey) rather than Client:get(..., [{n_val,1}.... This would even further reduce the cluster load, as we don't risk going over the network to get the value. We are, after all, running in the local vnode that is supposed to have the value. The only obvious downside I can see is that local_getdoesn't let you specify a timeout.

krestenkrab · 2014-09-18T11:57:14Z

I'd like to propose that we rename fullsync_direct to be fullsync_direct_limit. In our next round of changes we have a sibling to this config option which is the cutoff percentage (of #keys in partition); and that one I'd like to call fullsync_direct_percentage_limit.

jonmeredith · 2014-09-18T21:50:34Z

buffer can also be undefined.

jonmeredith · 2014-09-19T14:23:39Z

Couldn't work out how to inline respond to Kresten's comment, but yes, I think it does need to use riak_kv_vnode:local_get/2 as there are no guarantees you will hit the local vnode if you use a reduced n_val get (which is probably something we should also fix for CS).

jonmeredith · 2014-09-19T14:39:12Z

Unrelated to the AAE changes, the mechanics for changing the controlling process are totally unnecessary for the Remote function defined in key_exchange/2 as there are no messages delivered by an {active, false} socket.

jonmeredith · 2014-09-19T14:49:56Z

If you get here and you have a bloom filter, you do the direct gets (at the moment using get FSMs) for all of the entries. Is there a good reason not to detect you will need to bloom fold anyway and just set the elements?

jonmeredith · 2014-09-19T17:06:18Z

Spoke with @jtuple and @matthewvon on the path to merging. This is relatively minimal change PR to a point release to address a significant pain point, so I'm accepting that we won't clean up some things I would like to.

Code review looks logically good. The only major thing that needs to be fixed is the switch to riak_kv_vnode:local_get/2. The reporting improvements and general refactoring can be done on the 2.1 schedule. Depending on how much improvement is needed to the protocol the module may be significantly refactored anyway. Joe said the AAE estimation branch moved the bloom filter creation, so if we get that merged it will take care of that.

Will approve once I've run some of the r_t AAE tests.

When performing direct sends, the AAE fullsync source retrieves individual objects in order to send them. Previously, this was accomplished using normal riak_client gets using the n_val=1 option. However, there are no guarantees about which replica the object would come from using such an approach. This commit changes the AAE fullsync source to instead retrieve objects directly from the vnode performing the fullsync. This ensures that the object sent aligns with the AAE hash used to determine differences. This approach may also have better performance.

jonmeredith · 2014-09-22T18:41:05Z

Compared riak_test output to 2.0.0 and get similar results for the intermittent cases. A little work to do there, but ok.

Like the other pull requests, this is targeted at develop, not 2.0.0

Plus one to merge once target is fixed.

…x-2.0' Reviewed-by: jtuple

jtuple and others added 9 commits September 17, 2014 22:28

Use N=1 for reading values for replication

5bf71c2

Make fullsync direct send threshold configurable

32316fa

Remove non-pipelined AAE fullsync mode

4109e75

This was referenced Sep 18, 2014

Update to use new breadth-first AAE exchange basho/riak_kv#1023

Closed

Add breadth-first AAE exchange basho/riak_core#628

Closed

jtuple assigned jonmeredith Sep 18, 2014

jonmeredith reviewed Sep 18, 2014
View reviewed changes

Comment thread src/riak_repl_aae_source.erl

Copy link
Copy Markdown

Contributor

jonmeredith Sep 18, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer can also be undefined.

jonmeredith reviewed Sep 19, 2014
View reviewed changes

jonmeredith mentioned this pull request Sep 22, 2014

Small user experience fixes. #619

Closed

jtuple mentioned this pull request Sep 22, 2014

Improved AAE fullsync integration/2.0 pull request #620

Merged

jtuple referenced this pull request Sep 22, 2014

Merge pull request #619 from branch 'feature/pipelined-aae-fullsync-u…

916be13

…x-2.0' Reviewed-by: jtuple

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved, pipelined AAE fullsync#616

Improved, pipelined AAE fullsync#616
jtuple wants to merge 10 commits into
developfrom
feature/pipelined-aae-fullsync

jtuple commented Sep 18, 2014

Uh oh!

krestenkrab commented Sep 18, 2014

Uh oh!

krestenkrab commented Sep 18, 2014

Uh oh!

jonmeredith Sep 18, 2014

Uh oh!

jonmeredith commented Sep 19, 2014

Uh oh!

jonmeredith commented Sep 19, 2014

Uh oh!

jonmeredith Sep 19, 2014

Uh oh!

jonmeredith commented Sep 19, 2014

Uh oh!

jonmeredith commented Sep 22, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jtuple commented Sep 18, 2014

Uh oh!

krestenkrab commented Sep 18, 2014

Uh oh!

krestenkrab commented Sep 18, 2014

Uh oh!

jonmeredith Sep 18, 2014

Choose a reason for hiding this comment

Uh oh!

jonmeredith commented Sep 19, 2014

Uh oh!

jonmeredith commented Sep 19, 2014

Uh oh!

jonmeredith Sep 19, 2014

Choose a reason for hiding this comment

Uh oh!

jonmeredith commented Sep 19, 2014

Uh oh!

jonmeredith commented Sep 22, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants