2.0 port of AAE transient FS failures#640
Merged
Merged
Conversation
Problem: transient failures of aae, such as trees not yet built or locks not
being aquired, would cause an aae fullsync process to exit abnormally. This
could happen several times in a row, creating log spam.
Resolution: the concept of soft_exit. A soft_exit is a message sent from a soon
to be exiting process to a soft_linked process. The exiting process would then
exit normally, while any soft_linked processes could handle the soft_exit
message in a similar fashion as an exit message. This would indicate an exit
reason that should be handled, but not bad enough to have the system logger
know about it.
The soft_exit message sent from the aae worker to the fscoordinator is
as simple as `{soft_exit, pid(), term()}'.
The current implementation is not generic. There can only one soft_link to
the aae, and there's no general mechanism to use soft_link's or soft_exits
elsewhere in the code base. Sorry.
Another change rolled into this is consistent use of a #partition_info record
in the fscoordinator, and error tracking the fscoordinator's state. By swapping
to useing a single data structure in the partition queue, whereis waiting list,
and purgatory queues it makes it easier to understand the fscordinator (as
there is less code modify structures).
This is a forward port of the fix done for 1.4. Conflicts favor existing code
where it does not directly effect the fix.
Conflicts:
Makefile
rebar.config
src/riak_repl2_fssource.erl
src/riak_repl2_rtq_proxy.erl
src/riak_repl_aae_source.erl
test/riak_core_cluster_mgr_tests.erl
Increment_error_dict expects the partition, elementN of error dict, and the state. It pulls the dict out of the state so it put it back in place, thus just returning the state. So this call that passed the dict in was wrong.
When a partition is not available, perhaps after a number of retries, the error exits stat should be incremented. Also, the retry exits stat should be incremented on each retry. This was discovered when backporting the repl_location_failures riak_test.
The one in riak_repl2_fssource is a legit bug in the code
Contributor
|
+1 5479089 |
Contributor
Author
|
@borshop merge |
borshop
added a commit
that referenced
this pull request
Dec 8, 2014
…ilures-2.0 2.0 port of AAE transient FS failures Reviewed-by: lordnull
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
2.0 version of PR #636