Skip to content

2.0 port of AAE transient FS failures#640

Merged
borshop merged 4 commits into
2.0from
feature/chatty-aae-transient-fs-failures-2.0
Dec 8, 2014
Merged

2.0 port of AAE transient FS failures#640
borshop merged 4 commits into
2.0from
feature/chatty-aae-transient-fs-failures-2.0

Conversation

@engelsanchez
Copy link
Copy Markdown
Contributor

2.0 version of PR #636

lordnull and others added 4 commits December 8, 2014 13:42
Problem: transient failures of aae, such as trees not yet built or locks not
being aquired, would cause an aae fullsync process to exit abnormally. This
could happen several times in a row, creating log spam.

Resolution: the concept of soft_exit. A soft_exit is a message sent from a soon
to be exiting process to a soft_linked process. The exiting process would then
exit normally, while any soft_linked processes could handle the soft_exit
message in a similar fashion as an exit message. This would indicate an exit
reason that should be handled, but not bad enough to have the system logger
know about it.

The soft_exit message sent from the aae worker to the fscoordinator is
as simple as `{soft_exit, pid(), term()}'.

The current implementation is not generic. There can only one soft_link to
the aae, and there's no general mechanism to use soft_link's or soft_exits
elsewhere in the code base. Sorry.

Another change rolled into this is consistent use of a #partition_info record
in the fscoordinator, and error tracking the fscoordinator's state. By swapping
to useing a single data structure in the partition queue, whereis waiting list,
and purgatory queues it makes it easier to understand the fscordinator (as
there is less code modify structures).

This is a forward port of the fix done for 1.4. Conflicts favor existing code
where it does not directly effect the fix.

Conflicts:
	Makefile
	rebar.config
	src/riak_repl2_fssource.erl
	src/riak_repl2_rtq_proxy.erl
	src/riak_repl_aae_source.erl
	test/riak_core_cluster_mgr_tests.erl
Increment_error_dict expects the partition, elementN of error dict, and the
state. It pulls the dict out of the state so it put it back in place, thus just
returning the state. So this call that passed the dict in was wrong.
When a partition is not available, perhaps after a number of retries,
the error exits stat should be incremented. Also, the retry exits stat
should be incremented on each retry.  This was discovered when
backporting the repl_location_failures riak_test.
The one in riak_repl2_fssource is a legit bug in the code
@lordnull
Copy link
Copy Markdown
Contributor

lordnull commented Dec 8, 2014

+1 5479089

@engelsanchez
Copy link
Copy Markdown
Contributor Author

@borshop merge

borshop added a commit that referenced this pull request Dec 8, 2014
…ilures-2.0

2.0 port of AAE transient FS failures

Reviewed-by: lordnull
@borshop borshop merged commit 5479089 into 2.0 Dec 8, 2014
@seancribbs seancribbs deleted the feature/chatty-aae-transient-fs-failures-2.0 branch April 1, 2015 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants