Skip to content

Improve AAE fullsync by estimating number of keys#623

Merged
borshop merged 19 commits into
2.0from
feature/aae-fullsync-estimate-keys
Dec 19, 2014
Merged

Improve AAE fullsync by estimating number of keys#623
borshop merged 19 commits into
2.0from
feature/aae-fullsync-estimate-keys

Conversation

@jtuple
Copy link
Copy Markdown
Contributor

@jtuple jtuple commented Oct 6, 2014

This pull-request extends AAE fullsync to sample the AAE trees to estimate the number of keys in a given partition.

This estimate is used to properly size the bloom filter, as well as enable a new percentage-based direct send threshold (eg. direct send up to 10% of differing keys).

To support AAE-based key estimation, this pull-request changes the fullsync logic to update all AAE trees before proceeding to the exchange phase. This change is necessary because all trees must be updated and sampled to calculate a correct estimate.

This pull-request also delays the creation of the bloom filter until it is needed. Fullsyncs that manage to send all differences directly will therefore avoid creating a bloom filter.

This work was performed by @lixen, @rsltrifork, and @krestenkrab. I rebased/squashed commits as necessary to make things apply cleanly on top of the current 2.0 branch, as this work was completed in parallel on top of older changes that were accidentally made against the develop branch.

Note: several commits were squashed together/re-ordered to make porting to the 2.0 branch easier. The original branch had numerous bi-directional merges that made things difficult to port. For reference, the original commits can be found in the jdb-miklix_estimate_keys branch.

Sibling pull-requests: basho/riak_core#633, basho/riak_kv#1030

jtuple and others added 12 commits September 22, 2014 12:44
This commit slightly refactors the riak_repl_aae_source code to
make state transitions more obvious.

This change also eases the porting of commits from a separate feature
branch that implemented a slightly different approach to improved AAE
syncing.
This commit merges in the separately prototyped buffer-then-send
approach to directly sending key differences. The mode used is
determined by the `fullsync_direct_mode` application variable,
which defaults to `inline`.

* inline
  differences selected for direct repair are sent as they are encountered
  during an AAE exchange.

* buffered
  differences selected for direct repair are buffered in memory during an
  exchange. After the exchange, the buffer is sorted and keys are then sent
  in order.
A deadlock in AAE fullsync was identified during testing.

The deadlock occurs when both the AAE fullsync source and sink
processes block on TCP send due to TCP backpressure. Since both
processes are blocked, neither can perform a receive to unblock
the other.

This commit addresses the issue by making the AAE sink use a helper
process for sending, therefore ensuring the sink never blocks on a
send. The sink will therefore always get around to receiving data at
some point, unblocking the source if needed.

NOTE: the SSL transport wraps the underlying socket in a gen_server, and
all socket functions route through this server. As such, the SSL transport
is still vulnerable to this deadlock. Fixing this requires either changing
the AAE protocol or patching the OTP SSL implementation. One of the two is
planned future work.
When performing direct sends, the AAE fullsync source retrieves
individual objects in order to send them. Previously, this was
accomplished using normal riak_client gets using the n_val=1
option. However, there are no guarantees about which replica
the object would come from using such an approach.

This commit changes the AAE fullsync source to instead retrieve
objects directly from the vnode performing the fullsync. This
ensures that the object sent aligns with the AAE hash used to
determine differences. This approach may also have better
performance.
Although keylisting has a pool for issuing gets, both the
AAE and the Keylist strategies serialize writing objects
from the sink process.  This adds a pool with a default size
of 100 shared across all fullsync sinks that make the writes async.

No difference in safety properties, neither version checks
the response from the put FSM.
…source data" msg.

Minor changes for user experience, clarified with _limit to permit a future
_percentage version of fullsync_direct.

Lowered the "Gathering source data" message displayed every minute to a
debug message.
Improved AAE fullsync integration/2.0 pull request

Reviewed-by: jonmeredith
…imate-keys

Conflicts:
	src/riak_repl_aae_source.erl
@jtuple
Copy link
Copy Markdown
Contributor Author

jtuple commented Oct 7, 2014

I've cleaned up and rebased/merged all of the key estimation changes on top of the 2.0 branch, as well as finished up reviewing the code across all 3 repositories.

In general, all the code looks good and does what it's supposed to do. Everything continues to pass the AAE replication tests I've been running, as well as the newer tests I've tried out from the krab-aae-fullsync-wip branch of riak_test.

A few questions/comments.

First, I recall @jonmeredith mentioning something about the protocol negotiation (as stored in the source state) not actually being used. Jon, could you elaborate? More importantly, do you have any comments about the changes made in this pull-request related to protocol negotiation (commits c8ec0d9 and a13c865)?

Second, my merge commit in this pull-request is rather substantial, with lots of code massaging. This is the same as the prior merge I did on the jdb-miklix_estimate_keys branch. I've looked over the code many times and believe it's all good, but having another person review my work would be beneficial. I remember someone mentioning on one of our prior meetings that they looked over the previous merge on the jdb-miklix_estimate_keys branch, so perhaps the same person would make sense to double check again on this pull-request.

Third, is there any benefit to having a toggle to disable key estimation? Could there ever be a scenario where we would want to avoid iterating over 1% of the AAE tree's segments to compute the estimate? Eg. heavily loaded cluster?

Finally, are we happy with the log messages in this branch from a user-facing perspective? Example output:

2014-10-06 20:26:59.801 [info] <0.31687.0>@riak_repl_aae_source:init:90 AAE fullsync source worker started for partition 182687704666362864775460604089535377456991567872
2014-10-06 20:26:59.801 [info] <0.31687.0>@riak_repl_aae_source:update_trees:233 Start update for partition,IndexN 182687704666362864775460604089535377456991567872,[{1278813932664540053428224228626747642198940975104,3},{0,3},{182687704666362864775460604089535377456991567872,3}]
2014-10-06 20:27:00.185 [info] <0.31687.0>@riak_repl_aae_source:update_trees:256 EstimatedNrKeys 4097 for partition 182687704666362864775460604089535377456991567872
2014-10-06 20:27:01.019 [info] <0.31696.0>@riak_repl_aae_source:key_exchange:371 Full-sync with site "B"; fullsync difference generator for 182687704666362864775460604089535377456991567872 complete (completed in 0.83 secs)
2014-10-06 20:27:01.822 [info] <0.31718.0>@riak_repl_aae_source:key_exchange:371 Full-sync with site "B"; fullsync difference generator for 182687704666362864775460604089535377456991567872 complete (completed in 0.8 secs)
2014-10-06 20:27:02.645 [info] <0.31729.0>@riak_repl_aae_source:key_exchange:371 Full-sync with site "B"; fullsync difference generator for 182687704666362864775460604089535377456991567872 complete (completed in 0.82 secs)
2014-10-06 20:27:02.647 [info] <0.31687.0>@riak_repl_aae_source:maybe_send_direct:421 Directly sending 1000 differences for partition 182687704666362864775460604089535377456991567872
2014-10-06 20:27:02.861 [info] <0.31687.0>@riak_repl_aae_source:finish_sending_differences:467 Bloom folding over 2690/3690 differences for partition 182687704666362864775460604089535377456991567872 with EstimatedNrKeys 4097

And in cases where we don't do the bloom fold:

2014-10-06 20:27:09.141 [info] <0.31859.0>@riak_repl_aae_source:finish_sending_differences:453 No Bloom folding over 0/0 differences for partition 548063113999088594326381812268606132370974703616 with EstimatedNrKeys 3497

Note: many of these message were the same/similar before this pull-request. So, this is more an open question than something specific to this pull-request itself: should we evaluate the signal/noise/usefulness factor of the AAE log messages before we ship 2.0.2?

@krestenkrab
Copy link
Copy Markdown
Contributor

Joe, you're amazing. Nice work.

I see there are other commits happening and we're moving to so-called 0-based partition names; would be nice to have "partition 3/16" rather than "182687704666362864775460604089535377456991567872" printed, eh?

@lixen
Copy link
Copy Markdown
Contributor

lixen commented Oct 14, 2014

First:
Second: I had a look and can't find any issues. But I guess I might not be the best reviewer of this.
Third: I think the benefit of not creating unnecessary bloom could cover the cost of iterating 1%
Finally: I agree, there is a lot of noise in the logs that only are useful when debugging.

@krestenkrab
Copy link
Copy Markdown
Contributor

Some of those logs are used in the riak_test @rsltrifork and I wrote a while back as far as I remember.

@lixen
Copy link
Copy Markdown
Contributor

lixen commented Oct 16, 2014

ok I will verify that they still work or update them.

Mikael Lixenstrand added 2 commits October 16, 2014 17:18
@engelsanchez
Copy link
Copy Markdown
Contributor

@bsparrow435 Could you or another repl battle hardened CSE take a look at the logging that is being silenced by this PR?

Comment thread rebar.config Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this pr requires the changes in basho/riak_kv#1030, we should wait until that pr is merged, then merge this one, rather than change the rebar.config.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be switched to lager:debug like the other messages instead of being removed outright?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to just go with this?
2014-10-16 11:26:53.304 [info] <0.1531.0>@riak_repl2_fssource_sup:enable:15 Starting replication fullsync source for 1278813932664540053428224228626747642198940975104 from 'dev1@127.0.0.1' to {{127,0,0,1},10026}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going with the log entry from riak_repl2_fssource_sup sounds good to me.

@bearcage
Copy link
Copy Markdown

bearcage commented Dec 9, 2014

Apologies for the pedantry; cliserv's needs from the logs basically amount to being able to reconstruct a narrative of which partition began when, and ended when (and ended in what way; cancelled, succeeded, or the autogenerated error_logger entry). IMO anyway.

@lixen
Copy link
Copy Markdown
Contributor

lixen commented Dec 10, 2014

@aberghage Thanks for the comments important to not have black holes in the logs while troubleshooting .

Comment thread src/riak_repl_aae_source.erl Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are sending ourselves a not_responsible message. When it reaches us, we stop. I don't see the point of postponing that. Why not recode this so that it immediately bails? Just move it to a different function with the appropriate clauses. Or is there a subtle reason why we want to do this now?

Remove loop so we can receive cancel_fullsync during
update of remote trees.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for future reference: these changes fix a problem where using AAE FS with a cluster that could not understand the legacy binary format wouldn't work, as it was hard coded to send the newest format.

@engelsanchez
Copy link
Copy Markdown
Contributor

I'm finished with the review of the existing changes. As soon as the commont about the not_responsible message in the update_trees state is addressed, I'll start running it through the tests.

@engelsanchez
Copy link
Copy Markdown
Contributor

👍 6bdc099
Testing this was quite hard due to the many failures the repl builds are having. Finally all passed and made sense if tested with this branch (now merged to master) basho/riak_test#714

@engelsanchez
Copy link
Copy Markdown
Contributor

@borshop merge

borshop added a commit that referenced this pull request Dec 19, 2014
Improve AAE fullsync by estimating number of keys

Reviewed-by: engelsanchez
@borshop borshop merged commit 6bdc099 into 2.0 Dec 19, 2014
@lixen lixen deleted the feature/aae-fullsync-estimate-keys branch January 8, 2015 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants