Skip to content

Improved, pipelined AAE fullsync#616

Closed
jtuple wants to merge 10 commits into
developfrom
feature/pipelined-aae-fullsync
Closed

Improved, pipelined AAE fullsync#616
jtuple wants to merge 10 commits into
developfrom
feature/pipelined-aae-fullsync

Conversation

@jtuple
Copy link
Copy Markdown
Contributor

@jtuple jtuple commented Sep 18, 2014

This pull-request includes multiple improves to AAE fullsync, including:

  1. Making AAE exchange pipelined. This is the most significant change, since non-pipelined AAE is sensitive to network latency. And in the worst case was shown to take weeks to complete, even for a trivial exchange. The new pipelined approach is designed to handle high latency links.
  2. Ensuring AAE fullsync only does at most 1 bloom fold per partition. Previously, the implementation did a bloom fold per AAE tree per partition.
  3. Directly sending up to a configurable threshold of key differences, thus avoiding the bloom fold entirely when clusters are mostly in sync.
  4. Using the internal n_val=1 get option when retrieving objects to send. This sends only a single request to a single vnode, therefore reducing cluster load during a fullsync.

Preliminary test result can be found here. More testing is underway/planned.

Sibling pull-requests: basho/riak_core#628, basho/riak_kv#1023

jtuple and others added 9 commits September 17, 2014 22:28
The current AAE fullsync exchange was based on the exchange FSM used by
Riak for local cluster AAE. However, this was a bad design decision as
local AAE was never designed to tolerate high latency links.

In the worst case, this design can require millions of sequential
roundtrip messages. This is not an issue for a fast LAN, but is
impractical for WANs with 10-200ms RTTs.

This commit changes the AAE fullsync exchange to use a pipelined approach.
Rather than performing a synchronous request for each bucket, the exchange
determines all differences for a given level of the AAE tree and then
requests the data for all differing buckets upfront.

Ideally, this would supported by a new streaming protocol. However to
retain protocol compatibility the pipelined approach sends all the
known bucket requests upfront and relies upon TCP ordering to deliver
the responses in the order requested.

Note: this commit relies upon changes made to riak_core and riak_kv to
extend the AAE subsystem to support the new level-by-level exchange that
is utilized by this new fullsync approach.
Previous code created a bloom filter for each
IndexN, populating it with diffs, and then did
a backend fold for that vnode sending objects
along the way.

The code now

1. creates the bloom filter in init()
2. runs key exchange for each IndexN
3. then does a vnode fold to find objects in the
   bloom filter.
This change introduces a threshold for how
to process fullsync diffs.  A small number of
changed KVs (currently fixed at 1000 per vnode)
are read using GET; above that limit, objects
are read using a backend fold.

For small number og changed KVs, this avoids
doing a full scan of the backend data.
This commit slightly refactors the riak_repl_aae_source code to
make state transitions more obvious.

This change also eases the porting of commits from a separate feature
branch that implemented a slightly different approach to improved AAE
syncing.
This commit merges in the separately prototyped buffer-then-send
approach to directly sending key differences. The mode used is
determined by the `fullsync_direct_mode` application variable,
which defaults to `inline`.

* inline
  differences selected for direct repair are sent as they are encountered
  during an AAE exchange.

* buffered
  differences selected for direct repair are buffered in memory during an
  exchange. After the exchange, the buffer is sorted and keys are then sent
  in order.
A deadlock in AAE fullsync was identified during testing.

The deadlock occurs when both the AAE fullsync source and sink
processes block on TCP send due to TCP backpressure. Since both
processes are blocked, neither can perform a receive to unblock
the other.

This commit addresses the issue by making the AAE sink use a helper
process for sending, therefore ensuring the sink never blocks on a
send. The sink will therefore always get around to receiving data at
some point, unblocking the source if needed.

NOTE: the SSL transport wraps the underlying socket in a gen_server, and
all socket functions route through this server. As such, the SSL transport
is still vulnerable to this deadlock. Fixing this requires either changing
the AAE protocol or patching the OTP SSL implementation. One of the two is
planned future work.
@krestenkrab
Copy link
Copy Markdown
Contributor

Maybe we should use riak_kv_vnode:local_get(PartitionIndex,BKey) rather than Client:get(..., [{n_val,1}.... This would even further reduce the cluster load, as we don't risk going over the network to get the value. We are, after all, running in the local vnode that is supposed to have the value. The only obvious downside I can see is that local_getdoesn't let you specify a timeout.

@krestenkrab
Copy link
Copy Markdown
Contributor

I'd like to propose that we rename fullsync_direct to be fullsync_direct_limit. In our next round of changes we have a sibling to this config option which is the cutoff percentage (of #keys in partition); and that one I'd like to call fullsync_direct_percentage_limit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer can also be undefined.

@jonmeredith
Copy link
Copy Markdown
Contributor

Couldn't work out how to inline respond to Kresten's comment, but yes, I think it does need to use riak_kv_vnode:local_get/2 as there are no guarantees you will hit the local vnode if you use a reduced n_val get (which is probably something we should also fix for CS).

@jonmeredith
Copy link
Copy Markdown
Contributor

Unrelated to the AAE changes, the mechanics for changing the controlling process are totally unnecessary for the Remote function defined in key_exchange/2 as there are no messages delivered by an {active, false} socket.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you get here and you have a bloom filter, you do the direct gets (at the moment using get FSMs) for all of the entries. Is there a good reason not to detect you will need to bloom fold anyway and just set the elements?

@jonmeredith
Copy link
Copy Markdown
Contributor

Spoke with @jtuple and @matthewvon on the path to merging. This is relatively minimal change PR to a point release to address a significant pain point, so I'm accepting that we won't clean up some things I would like to.

Code review looks logically good. The only major thing that needs to be fixed is the switch to riak_kv_vnode:local_get/2. The reporting improvements and general refactoring can be done on the 2.1 schedule. Depending on how much improvement is needed to the protocol the module may be significantly refactored anyway. Joe said the AAE estimation branch moved the bloom filter creation, so if we get that merged it will take care of that.

Will approve once I've run some of the r_t AAE tests.

When performing direct sends, the AAE fullsync source retrieves
individual objects in order to send them. Previously, this was
accomplished using normal riak_client gets using the n_val=1
option. However, there are no guarantees about which replica
the object would come from using such an approach.

This commit changes the AAE fullsync source to instead retrieve
objects directly from the vnode performing the fullsync. This
ensures that the object sent aligns with the AAE hash used to
determine differences. This approach may also have better
performance.
@jonmeredith
Copy link
Copy Markdown
Contributor

Compared riak_test output to 2.0.0 and get similar results for the intermittent cases. A little work to do there, but ok.

Like the other pull requests, this is targeted at develop, not 2.0.0

Plus one to merge once target is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants