Improved, pipelined AAE fullsync#616
Conversation
The current AAE fullsync exchange was based on the exchange FSM used by Riak for local cluster AAE. However, this was a bad design decision as local AAE was never designed to tolerate high latency links. In the worst case, this design can require millions of sequential roundtrip messages. This is not an issue for a fast LAN, but is impractical for WANs with 10-200ms RTTs. This commit changes the AAE fullsync exchange to use a pipelined approach. Rather than performing a synchronous request for each bucket, the exchange determines all differences for a given level of the AAE tree and then requests the data for all differing buckets upfront. Ideally, this would supported by a new streaming protocol. However to retain protocol compatibility the pipelined approach sends all the known bucket requests upfront and relies upon TCP ordering to deliver the responses in the order requested. Note: this commit relies upon changes made to riak_core and riak_kv to extend the AAE subsystem to support the new level-by-level exchange that is utilized by this new fullsync approach.
Previous code created a bloom filter for each IndexN, populating it with diffs, and then did a backend fold for that vnode sending objects along the way. The code now 1. creates the bloom filter in init() 2. runs key exchange for each IndexN 3. then does a vnode fold to find objects in the bloom filter.
This change introduces a threshold for how to process fullsync diffs. A small number of changed KVs (currently fixed at 1000 per vnode) are read using GET; above that limit, objects are read using a backend fold. For small number og changed KVs, this avoids doing a full scan of the backend data.
This commit slightly refactors the riak_repl_aae_source code to make state transitions more obvious. This change also eases the porting of commits from a separate feature branch that implemented a slightly different approach to improved AAE syncing.
This commit merges in the separately prototyped buffer-then-send approach to directly sending key differences. The mode used is determined by the `fullsync_direct_mode` application variable, which defaults to `inline`. * inline differences selected for direct repair are sent as they are encountered during an AAE exchange. * buffered differences selected for direct repair are buffered in memory during an exchange. After the exchange, the buffer is sorted and keys are then sent in order.
A deadlock in AAE fullsync was identified during testing. The deadlock occurs when both the AAE fullsync source and sink processes block on TCP send due to TCP backpressure. Since both processes are blocked, neither can perform a receive to unblock the other. This commit addresses the issue by making the AAE sink use a helper process for sending, therefore ensuring the sink never blocks on a send. The sink will therefore always get around to receiving data at some point, unblocking the source if needed. NOTE: the SSL transport wraps the underlying socket in a gen_server, and all socket functions route through this server. As such, the SSL transport is still vulnerable to this deadlock. Fixing this requires either changing the AAE protocol or patching the OTP SSL implementation. One of the two is planned future work.
|
Maybe we should use |
|
I'd like to propose that we rename |
There was a problem hiding this comment.
buffer can also be undefined.
|
Couldn't work out how to inline respond to Kresten's comment, but yes, I think it does need to use |
|
Unrelated to the AAE changes, the mechanics for changing the controlling process are totally unnecessary for the |
There was a problem hiding this comment.
If you get here and you have a bloom filter, you do the direct gets (at the moment using get FSMs) for all of the entries. Is there a good reason not to detect you will need to bloom fold anyway and just set the elements?
|
Spoke with @jtuple and @matthewvon on the path to merging. This is relatively minimal change PR to a point release to address a significant pain point, so I'm accepting that we won't clean up some things I would like to. Code review looks logically good. The only major thing that needs to be fixed is the switch to Will approve once I've run some of the r_t AAE tests. |
When performing direct sends, the AAE fullsync source retrieves individual objects in order to send them. Previously, this was accomplished using normal riak_client gets using the n_val=1 option. However, there are no guarantees about which replica the object would come from using such an approach. This commit changes the AAE fullsync source to instead retrieve objects directly from the vnode performing the fullsync. This ensures that the object sent aligns with the AAE hash used to determine differences. This approach may also have better performance.
|
Compared riak_test output to 2.0.0 and get similar results for the intermittent cases. A little work to do there, but ok. Like the other pull requests, this is targeted at develop, not 2.0.0 Plus one to merge once target is fixed. |
This pull-request includes multiple improves to AAE fullsync, including:
n_val=1get option when retrieving objects to send. This sends only a single request to a single vnode, therefore reducing cluster load during a fullsync.Preliminary test result can be found here. More testing is underway/planned.
Sibling pull-requests: basho/riak_core#628, basho/riak_kv#1023