Bugfix mw chatty transiet aae fs failures#626
Conversation
|
I don't think this will work in a mixed cluster. Not sure if you saw the approach I outlined in the comments for https://bashoeng.atlassian.net/browse/RIAK-1235 If fscoordinator is running on a 1.3.X/1.4.10 or below node, it will try and start up processes on remote nodes by calling start_link/3 which will not exist on the older node and fail. https://github.com/basho/riak_repl/pull/626/files#diff-5075ea65b0560166a4e4684a839326b2R40 If fscoordinator is running on pre-1.4.11 nodes, and fssource is running on 1.4.11+ then it will send unhandled messages to undefined, and exit normally making the fscoordinator think it has succeeded when it has not. Why not add a step to the fscoordinator in And have On upgraded nodes, you could store it in the state and make a function that reports back Within a node between fssource/riak_repl_aae_source I think it is safe to just modify the start call, but you should verify there are not any external dependencies. |
riak_repl2_fscoordinator.erlSoft error countingIt would have been a smaller change to introduce a soft_retries dictionary to count Instead the factor to use the partition_info record is spread all over the module. then accessor code could be simplified to or even cleaner given how much the pattern is used, break out the accessor into a function There is at least one place where the new format is not handled correctly Presumably Soft error behaviorWhen soft errors are exceeded the partition is silently dropped from all lists. The log entry printed on completion does not include any information on Existing behavior is for the rtdirty flag to be reset regardless of exits. PurgatoryWhat is the purpose of the purgatory queue, why does it need any different Exit limitsThe existing setting to control how many times a partition can be retried is The That way the duplicate queing cases could be collapsed into a pair of cases. If purgatory is not needed, you could break it into |
There was a problem hiding this comment.
Why was the check for is_process_alive/1 removed before making the legacy_status request? Unnecessary as the sync_send_all_state checks? Why the change to an ugly if?
Is this change related to soft error handling?
There was a problem hiding this comment.
While testing the initial soft_exit change, I found that it was possible for the is_process_alive/1 to return true, but by the time it got to the sync_send_all_state_event/3, the process was no longer alive due to a soft_exit. The is_process_alive/1 was not doing it's job (ensure we only try to get stats from a living worker), so I removed it in favor of try/catch.
The more I've worked with erlang, the less 'if' has seemed ugly. It's a guard with a boolean output. Why is a case statement superior?
There was a problem hiding this comment.
Thanks for the explanation.
if is ugly to me because the fall you have to specify the fall through case
as true. I know others (CMJ, SeanC, Kresten) don't like it based on other
review feedback. In the rest of Erlang you'd specify that as an _Var on
something you didn't care about. As they aren't used much by others I
think it risks tripping people up.
On Tue, Oct 14, 2014 at 12:32 PM, Micah notifications@github.com wrote:
In src/riak_repl2_fssource.erl:
@@ -137,9 +145,22 @@ handle_call(stop_fullsync, _From, State=#state{fullsync_worker=FSW,
{reply, ok, State};
handle_call(legacy_status, _From, State=#state{fullsync_worker=FSW,
socket=Socket}) ->
- Res = case is_pid(FSW) andalso is_process_alive(FSW) of
true -> gen_fsm:sync_send_all_state_event(FSW, status, infinity);false -> []- lager:debug("Sending status to ~p", [FSW]),
- Res = if
While testing the initial soft_exit change, I found that it was possible
for the is_process_alive/1 to return true, but by the time it got to the
sync_send_all_state_event/3, the process was no longer alive due to a
soft_exit. The is_process_alive/1 was not doing it's job (ensure we only
try to get stats from a living worker), so I removed it in favor of
try/catch.The more I've worked with erlang, the less 'if' has seemed ugly. It's a
guard with a boolean output. Why is a case statement superior?—
Reply to this email directly or view it on GitHub
https://github.com/basho/riak_repl/pull/626/files#r18848806.
Jon Meredith
Chief Architect
Basho Technologies, Inc.
jmeredith@basho.com
|
@lordnull done reviewing for now. -1 until we resolve the rolling upgrade issue, lets talk about the best path forward in the morning. |
Soft Error CountingThe primary urge to move to using a record was for locality of information. Rather than needing to track data about a partition spread through out the state of the fscoordinator (retries in one dict, soft_retries in another, which node it's on in yet another data structure), it's all kept and updated in one record. That record gets updated and moved to various waiting areas and queues. Not to tied to the idea, but it seemed a cleaner method going forward than just tacking on another bit of state in the coordinator to reset. PurgatoryWhile testing, I found that the soft_exit_retries was increasing very rapidly. It was retrying partitions faster than the cause for the retry was resolved since the failed partition was put back into the partition_queue immediately. By putting the failed partition in a purgatory queue and feeding the partitions back in more slowly, this lowers the amount of useless work done. |
There was a problem hiding this comment.
This should count as an error_exit as the fullsync did not complete for all vnodes.
|
I've done further testing with the r_t linked above and eunit. The occasionally failure I will get in r_t is related to transfers, not repl; when it gets past transfers the test passes. I'm not getting the same errors on eunit as on the build machine; the errors I've gotten are related to rtq, while the build machine seems to have errors related to a test not properly cleaning itself up. Where do we go from here? |
|
I've tested this on a local rebase off of 1.4, and it still passes r_t still. Do we want to merge as is, see if the eunit tests can be fixed, or push the rebase and merge that? |
293b6ab to
2de8426
Compare
|
Looks good to me. |
|
I will be finishing the review for this PR |
There was a problem hiding this comment.
Here we are blocking for an 'EXIT' message from the fssource, which we should be about to get. But there is a chance of the disterl reconnection happening at the wrong time and this message getting lost forever. That would make this block forever too.
There was a problem hiding this comment.
The intent of the line is to avoid the extra log spam the secondary redundant exit message would cause. Given that, what do you think would be a reasonable wait time for the exit message?
There was a problem hiding this comment.
I suppose any timeout number is a guess, and that path will be very rarely touched in real life. So something like 1 min, etc would probably be OK. And normally these two messages will come at pretty much the same time.
Another alternative is to set up a monitor to the process here. Then in a function wait for EXIT optionally, and then DOWN. When you get the DOWN message, which is guaranteed, you know no more messages for that process will arrive. Since the process is likely already dead at this point, that should not take long normally.
There was a problem hiding this comment.
I prefer waiting on a known event than guessing at a timeout. Since we would be trying to monitor a dead process, the monitor option seems the most reliable and clean option.
Problem: transient failures of aae, such as trees not yet built or locks not
being aquired, would cause an aae fullsync process to exit abnormally. This
could happen several times in a row, creating log spam.
Resolution: the concept of soft_exit. A soft_exit is a message sent from a soon
to be exiting process to a soft_linked process. The exiting process would then
exit normally, while any soft_linked processes could handle the soft_exit
message in a similar fashion as an exit message. This would indicate an exit
reason that should be handled, but not bad enough to have the system logger
know about it.
The soft_exit message sent from the aae worker to the fscoordinator is
as simple as `{soft_exit, pid(), term()}'.
The current implementation is not generic. There can only one soft_link to
the aae, and there's no general mechanism to use soft_link's or soft_exits
elsewhere in the code base. Sorry.
Another change rolled into this is consistent use of a #partition_info record
in the fscoordinator, and error tracking the fscoordinator's state. By swapping
to useing a single data structure in the partition queue, whereis waiting list,
and purgatory queues it makes it easier to understand the fscordinator (as
there is less code modify structures).
1fdd46c to
0bdb1ba
Compare
|
+1 0bdb1ba We've finished reviewing this PR. All problems found during the review have been addressed. All eunit tests pass for me locally (the builder seems to have a setup problem though) and the aae fullsync riak_test passes too. I've also manually tested by interfering with AAE trees manually while AAE fullsyncs are running. As expected, the errors that used to spam the logs are only info entries now and the soft retries are incremented instead. This is good to go! |
…ailures Bugfix mw chatty transiet aae fs failures
There are cases where the aae full sync strategy will fail, but will be able to recover later, such as when the aae trees are not yet built. Until the cause of the transient failure is rectified (or the partition is restarted too many times), the crash log will be filled with useless messages. This PR introduces the use of a message 'soft_exit' to an owner process (passed into the start functions), and then exits the aae worker normally. This allows the owner process (fscoordinator) to retry the partition, but does not spam the crash logs.
The PR works, but an automated test is not yet complete.The aae full sync test has been updated: basho/riak_test/pull/691