Bugfix mw chatty transiet aae fs failures by lordnull · Pull Request #626 · basho/riak_repl

lordnull · 2014-10-10T21:23:30Z

There are cases where the aae full sync strategy will fail, but will be able to recover later, such as when the aae trees are not yet built. Until the cause of the transient failure is rectified (or the partition is restarted too many times), the crash log will be filled with useless messages. This PR introduces the use of a message 'soft_exit' to an owner process (passed into the start functions), and then exits the aae worker normally. This allows the owner process (fscoordinator) to retry the partition, but does not spam the crash logs.

~~The PR works, but an automated test is not yet complete.~~

The aae full sync test has been updated: basho/riak_test/pull/691

Edit/create riak_test to ensure feature works.

jonmeredith · 2014-10-13T17:45:54Z

jonmeredith · 2014-10-13T21:58:37Z

I don't think this will work in a mixed cluster. Not sure if you saw the approach I outlined in the comments for https://bashoeng.atlassian.net/browse/RIAK-1235

If fscoordinator is running on a 1.3.X/1.4.10 or below node, it will try and start up processes on remote nodes by calling start_link/3 which will not exist on the older node and fail.

https://github.com/basho/riak_repl/pull/626/files#diff-5075ea65b0560166a4e4684a839326b2R40

If fscoordinator is running on pre-1.4.11 nodes, and fssource is running on 1.4.11+ then it will send unhandled messages to undefined, and exit normally making the fscoordinator think it has succeeded when it has not.

Why not add a step to the fscoordinator in start_fssource to call something like

    case riak_repl2_fssource_sup:enable(LocalNode, Partition, {Ip, Port}) of
        {ok, Pid} ->
            link(Pid),
            riak_repl2_fssource:report_soft_exits(Pid),
            Running = orddict:store(Pid, PartitionVal, State#state.running_sources),
            State#state{running_sources = Running};

And have

report_soft_exits(Pid) ->
    try gen_server:call(Pid, {report_soft_exits, self()}, timer:seconds(15)) of
       ok -> %% Hit catchall
          false;
       {reporting_to, Pid} ->
          true
    catch
      _:Reason -> %% Probably a timeout if overloaded
          {error, Reason}
    end.

On upgraded nodes, you could store it in the state and make a function that reports back
{exit, soft_exit(Why, State), State}

soft_exit(Why, #state{soft_exits_to=undefined} = State) ->
  {stop, Why, State};
soft_exit(Why, #state{soft_exits_to=Pid} = State) ->
  Pid ! {soft_exit, self(), Why},
  {stop, normal, State}

Within a node between fssource/riak_repl_aae_source I think it is safe to just modify the start call, but you should verify there are not any external dependencies.

jonmeredith · 2014-10-13T23:43:48Z

riak_repl2_fscoordinator.erl

Soft error counting

It would have been a smaller change to introduce a soft_retries dictionary to count
those and check for going over. That would have only meant changing in the new
handle_abnormal_exit function, rather than in multiple places in the module.

Instead the factor to use the partition_info record is spread all over the module.
Given the switch to keytake in so many places, it would be cleaner to flatten
it as a list of records instead of an orddict.

-record(partition_info, {
    pid, index, retries = 0, soft_retries = 0, node
}).

then accessor code could be simplified to

    PartitionEntry = lists:keytake(Pid, #partition_info.pid, State#state.running_sources),
    case PartitionEntry of
        false ->
            % late exit or otherwise non-existant
            {noreply, State};
        {value, PartitionInfo, Running} ->
             do_something(PartitionInfo)
    end

or even cleaner given how much the pattern is used, break out the accessor into a function

   case take_source(Pid, State) of
     {false, _RunningSources} ->
         {noreply, State}
     {PartitionInfo, RunningSources} ->
         do_something(PartitionInfo)
   end

There is at least one place where the new format is not handled correctly
https://github.com/basho/riak_repl/pull/626/files#diff-ff7718aa59949a972fa2d2fae40b11d0R329

    [begin
        unlink(Pid),
        riak_repl2_fssource:stop_fullsync(Pid),
        riak_repl2_fssource_sup:disable(node(Pid), Part)
    end || {Pid, {Part, _PartN}} <- State#state.running_sources],

Presumably {Pid, {Part, _PartN}} should be {Pid, #partition_info(index = Part}}
or if you flattened running_sources, #partition_info{pid= Pid, index = Part}.

Soft error behavior

When soft errors are exceeded the partition is silently dropped from all lists.
Although the soft_retry count is updated, it should increment error_exits once
exceeded so that a status call can tell fullsync has not synchronized completely.

The log entry printed on completion does not include any information on
whether the current fullsync completed with error.

Existing behavior is for the rtdirty flag to be reset regardless of exits.
Should file a separate issue for that, outside the scope of this fix.

Purgatory

What is the purpose of the purgatory queue, why does it need any different
behavior from the partition_queue?

Exit limits

The existing setting to control how many times a partition can be retried is
max_fssource_retries, would recommend using something similar like max_fssource_soft_retries.

The if statements, seem ugly, how about writing a small helper like

hit_retry_limit(Count, infinity) ->
    false;
hit_retry_limit(Count, Limit) ->
    Count >= Limit.

That way the duplicate queing cases could be collapsed into a pair of cases.

If purgatory is not needed, you could break it into

{Count, Limit, State} = case ExitType of
    soft_exit ->
         %% update soft counts and get limits
    'EXIT'
         %% update hard exit counts and limits
    end
case hit_retry_limit(Count, Limit) of
  true ->
     %% Log failed message
  false ->
     %% Log retry message and add to partition_queue

jonmeredith · 2014-10-13T23:51:15Z

Why was the check for is_process_alive/1 removed before making the legacy_status request? Unnecessary as the sync_send_all_state checks? Why the change to an ugly if?

Is this change related to soft error handling?

While testing the initial soft_exit change, I found that it was possible for the is_process_alive/1 to return true, but by the time it got to the sync_send_all_state_event/3, the process was no longer alive due to a soft_exit. The is_process_alive/1 was not doing it's job (ensure we only try to get stats from a living worker), so I removed it in favor of try/catch.

The more I've worked with erlang, the less 'if' has seemed ugly. It's a guard with a boolean output. Why is a case statement superior?

Thanks for the explanation.

if is ugly to me because the fall you have to specify the fall through case
as true. I know others (CMJ, SeanC, Kresten) don't like it based on other
review feedback. In the rest of Erlang you'd specify that as an _Var on
something you didn't care about. As they aren't used much by others I
think it risks tripping people up.

On Tue, Oct 14, 2014 at 12:32 PM, Micah notifications@github.com wrote:

In src/riak_repl2_fssource.erl:

@@ -137,9 +145,22 @@ handle_call(stop_fullsync, _From, State=#state{fullsync_worker=FSW,
{reply, ok, State};
handle_call(legacy_status, _From, State=#state{fullsync_worker=FSW,
socket=Socket}) ->

Res = case is_pid(FSW) andalso is_process_alive(FSW) of

true -> gen_fsm:sync_send_all_state_event(FSW, status, infinity);

false -> []

lager:debug("Sending status to ~p", [FSW]),

Res = if

While testing the initial soft_exit change, I found that it was possible
for the is_process_alive/1 to return true, but by the time it got to the
sync_send_all_state_event/3, the process was no longer alive due to a
soft_exit. The is_process_alive/1 was not doing it's job (ensure we only
try to get stats from a living worker), so I removed it in favor of
try/catch.

The more I've worked with erlang, the less 'if' has seemed ugly. It's a
guard with a boolean output. Why is a case statement superior?

—
Reply to this email directly or view it on GitHub
https://github.com/basho/riak_repl/pull/626/files#r18848806.

Jon Meredith
Chief Architect
Basho Technologies, Inc.
jmeredith@basho.com

jonmeredith · 2014-10-14T01:35:51Z

@lordnull done reviewing for now. -1 until we resolve the rolling upgrade issue, lets talk about the best path forward in the morning.

lordnull · 2014-10-14T21:42:20Z

Soft Error Counting

The primary urge to move to using a record was for locality of information. Rather than needing to track data about a partition spread through out the state of the fscoordinator (retries in one dict, soft_retries in another, which node it's on in yet another data structure), it's all kept and updated in one record. That record gets updated and moved to various waiting areas and queues.

Not to tied to the idea, but it seemed a cleaner method going forward than just tacking on another bit of state in the coordinator to reset.

Purgatory

While testing, I found that the soft_exit_retries was increasing very rapidly. It was retrying partitions faster than the cause for the retry was resolved since the failed partition was put back into the partition_queue immediately. By putting the failed partition in a purgatory queue and feeding the partitions back in more slowly, this lowers the amount of useless work done.

jonmeredith · 2014-10-14T23:14:18Z

This should count as an error_exit as the fullsync did not complete for all vnodes.

lordnull · 2014-10-17T18:38:13Z

I've done further testing with the r_t linked above and eunit. The occasionally failure I will get in r_t is related to transfers, not repl; when it gets past transfers the test passes.

I'm not getting the same errors on eunit as on the build machine; the errors I've gotten are related to rtq, while the build machine seems to have errors related to a test not properly cleaning itself up.

Where do we go from here?

lordnull · 2014-10-22T21:38:03Z

I've tested this on a local rebase off of 1.4, and it still passes r_t still. Do we want to merge as is, see if the eunit tests can be fixed, or push the rebase and merge that?

cmeiklejohn · 2014-11-05T20:28:20Z

Pushed some minor cleanup changes, please review, @lordnull and/or @jonmeredith:

d6a987b
ec300dc
9ec4c7d
137bee8
94232f0
9ec4c7d

Let me know if you have any questions.

lordnull · 2014-11-05T22:46:08Z

Looks good to me.

engelsanchez · 2014-11-07T17:59:01Z

I will be finishing the review for this PR

engelsanchez · 2014-11-14T21:07:04Z

Here we are blocking for an 'EXIT' message from the fssource, which we should be about to get. But there is a chance of the disterl reconnection happening at the wrong time and this message getting lost forever. That would make this block forever too.

The intent of the line is to avoid the extra log spam the secondary redundant exit message would cause. Given that, what do you think would be a reasonable wait time for the exit message?

I suppose any timeout number is a guess, and that path will be very rarely touched in real life. So something like 1 min, etc would probably be OK. And normally these two messages will come at pretty much the same time.

Another alternative is to set up a monitor to the process here. Then in a function wait for EXIT optionally, and then DOWN. When you get the DOWN message, which is guaranteed, you know no more messages for that process will arrive. Since the process is likely already dead at this point, that should not take long normally.

I prefer waiting on a known event than guessing at a timeout. Since we would be trying to monitor a dead process, the monitor option seems the most reliable and clean option.

Problem: transient failures of aae, such as trees not yet built or locks not being aquired, would cause an aae fullsync process to exit abnormally. This could happen several times in a row, creating log spam. Resolution: the concept of soft_exit. A soft_exit is a message sent from a soon to be exiting process to a soft_linked process. The exiting process would then exit normally, while any soft_linked processes could handle the soft_exit message in a similar fashion as an exit message. This would indicate an exit reason that should be handled, but not bad enough to have the system logger know about it. The soft_exit message sent from the aae worker to the fscoordinator is as simple as `{soft_exit, pid(), term()}'. The current implementation is not generic. There can only one soft_link to the aae, and there's no general mechanism to use soft_link's or soft_exits elsewhere in the code base. Sorry. Another change rolled into this is consistent use of a #partition_info record in the fscoordinator, and error tracking the fscoordinator's state. By swapping to useing a single data structure in the partition queue, whereis waiting list, and purgatory queues it makes it easier to understand the fscordinator (as there is less code modify structures).

engelsanchez · 2014-11-18T20:37:52Z

+1 0bdb1ba

We've finished reviewing this PR. All problems found during the review have been addressed. All eunit tests pass for me locally (the builder seems to have a setup problem though) and the aae fullsync riak_test passes too. I've also manually tested by interfering with AAE trees manually while AAE fullsyncs are running. As expected, the errors that used to spam the logs are only info entries now and the soft retries are incremented instead. This is good to go!

…ailures Bugfix mw chatty transiet aae fs failures

jonmeredith reviewed Oct 13, 2014
View reviewed changes

Comment thread src/riak_repl2_fscoordinator.erl Outdated

Copy link
Copy Markdown

Contributor

jonmeredith Oct 13, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cruft

jonmeredith reviewed Oct 13, 2014
View reviewed changes

jonmeredith reviewed Oct 14, 2014
View reviewed changes

lordnull mentioned this pull request Oct 15, 2014

Fixed test to handle soft exits. basho/riak_test#691

Merged

lordnull force-pushed the bugfix-mw-chatty-transiet-aae-fs-failures branch from 293b6ab to 2de8426 Compare November 4, 2014 21:36

engelsanchez self-assigned this Nov 7, 2014

engelsanchez reviewed Nov 14, 2014
View reviewed changes

lordnull force-pushed the bugfix-mw-chatty-transiet-aae-fs-failures branch from 1fdd46c to 0bdb1ba Compare November 18, 2014 20:25

lordnull added a commit that referenced this pull request Nov 18, 2014

Merge pull request #626 from basho/bugfix-mw-chatty-transiet-aae-fs-f…

f902ae0

…ailures Bugfix mw chatty transiet aae fs failures

lordnull merged commit f902ae0 into 1.4 Nov 18, 2014

lordnull deleted the bugfix-mw-chatty-transiet-aae-fs-failures branch November 18, 2014 20:46

lordnull mentioned this pull request Nov 18, 2014

Implement soft_exit, primarily for aae_fullsyn. #636

Merged

Conversation

lordnull commented Oct 10, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonmeredith commented Oct 13, 2014

Uh oh!

jonmeredith commented Oct 13, 2014

riak_repl2_fscoordinator.erl

Soft error counting

Soft error behavior

Purgatory

Exit limits

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonmeredith commented Oct 14, 2014

Uh oh!

lordnull commented Oct 14, 2014

Soft Error Counting

Purgatory

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lordnull commented Oct 17, 2014

Uh oh!

lordnull commented Oct 22, 2014

Uh oh!

cmeiklejohn commented Nov 5, 2014

Uh oh!

lordnull commented Nov 5, 2014

Uh oh!

engelsanchez commented Nov 7, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

engelsanchez commented Nov 18, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants