Skip to content

Implement soft_exit, primarily for aae_fullsyn.#636

Merged
borshop merged 4 commits into
developfrom
feature/mw/forward-port-chatty-transient-aae-fs-failures
Dec 8, 2014
Merged

Implement soft_exit, primarily for aae_fullsyn.#636
borshop merged 4 commits into
developfrom
feature/mw/forward-port-chatty-transient-aae-fs-failures

Conversation

@lordnull
Copy link
Copy Markdown
Contributor

Problem: transient failures of aae, such as trees not yet built or locks not
being aquired, would cause an aae fullsync process to exit abnormally. This
could happen several times in a row, creating log spam.

Resolution: the concept of soft_exit. A soft_exit is a message sent from a soon
to be exiting process to a soft_linked process. The exiting process would then
exit normally, while any soft_linked processes could handle the soft_exit
message in a similar fashion as an exit message. This would indicate an exit
reason that should be handled, but not bad enough to have the system logger
know about it.

The soft_exit message sent from the aae worker to the fscoordinator is
as simple as `{soft_exit, pid(), term()}'.

The current implementation is not generic. There can only one soft_link to
the aae, and there's no general mechanism to use soft_link's or soft_exits
elsewhere in the code base. Sorry.

Another change rolled into this is consistent use of a #partition_info record
in the fscoordinator, and error tracking the fscoordinator's state. By swapping
to useing a single data structure in the partition queue, whereis waiting list,
and purgatory queues it makes it easier to understand the fscordinator (as
there is less code modify structures).

This is a forward port of the fix done for 1.4 (#626). A riak_test exists at basho/riak_test#703. Conflicts favor existing code
where it does not directly effect the fix.

Conflicts:
Makefile
rebar.config
src/riak_repl2_fssource.erl
src/riak_repl2_rtq_proxy.erl
src/riak_repl_aae_source.erl
test/riak_core_cluster_mgr_tests.erl

lordnull and others added 3 commits November 18, 2014 16:10
Problem: transient failures of aae, such as trees not yet built or locks not
being aquired, would cause an aae fullsync process to exit abnormally. This
could happen several times in a row, creating log spam.

Resolution: the concept of soft_exit. A soft_exit is a message sent from a soon
to be exiting process to a soft_linked process. The exiting process would then
exit normally, while any soft_linked processes could handle the soft_exit
message in a similar fashion as an exit message. This would indicate an exit
reason that should be handled, but not bad enough to have the system logger
know about it.

The soft_exit message sent from the aae worker to the fscoordinator is
as simple as `{soft_exit, pid(), term()}'.

The current implementation is not generic. There can only one soft_link to
the aae, and there's no general mechanism to use soft_link's or soft_exits
elsewhere in the code base. Sorry.

Another change rolled into this is consistent use of a #partition_info record
in the fscoordinator, and error tracking the fscoordinator's state. By swapping
to useing a single data structure in the partition queue, whereis waiting list,
and purgatory queues it makes it easier to understand the fscordinator (as
there is less code modify structures).

This is a forward port of the fix done for 1.4. Conflicts favor existing code
where it does not directly effect the fix.

Conflicts:
	Makefile
	rebar.config
	src/riak_repl2_fssource.erl
	src/riak_repl2_rtq_proxy.erl
	src/riak_repl_aae_source.erl
	test/riak_core_cluster_mgr_tests.erl
Increment_error_dict expects the partition, elementN of error dict, and the
state. It pulls the dict out of the state so it put it back in place, thus just
returning the state. So this call that passed the dict in was wrong.
When a partition is not available, perhaps after a number of retries,
the error exits stat should be incremented. Also, the retry exits stat
should be incremented on each retry.  This was discovered when
backporting the repl_location_failures riak_test.
@engelsanchez engelsanchez force-pushed the feature/mw/forward-port-chatty-transient-aae-fs-failures branch from 812a9e1 to aea4b42 Compare December 5, 2014 15:54
@engelsanchez
Copy link
Copy Markdown
Contributor

Cherry picked commits f565ea2 and 7da94e8 from the 1.4 branch that fix an issue that crash the process due to incorrect syntax and the regression with the error_exits/retry_exits stats. The stats can be verified with the repl_location_failures riak test, which will fail without that commit.

The one in riak_repl2_fssource is a legit bug in the code
@engelsanchez
Copy link
Copy Markdown
Contributor

Added another commit that fixes dialyzer warnings, one of which was an actual bug. @lordnull can you take a look at it? Initializing the process with an owner actually never put the owner in the state. Not sure if that was used yet.

@lordnull
Copy link
Copy Markdown
Contributor Author

lordnull commented Dec 5, 2014

👍 new changes look good to me and don't explode the world.

borshop added a commit that referenced this pull request Dec 6, 2014
…nsient-aae-fs-failures

Implement soft_exit, primarily for aae_fullsyn.

Reviewed-by: engelsanchez
@engelsanchez
Copy link
Copy Markdown
Contributor

@borshop merge

@lordnull Unfortunately we should have done this in the 2.0 branch first :(. I'll open a PR for that branch now. So. Many. Branches.

@borshop borshop merged commit c9beaa4 into develop Dec 8, 2014
@seancribbs seancribbs deleted the feature/mw/forward-port-chatty-transient-aae-fs-failures branch April 1, 2015 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants