Skip to content

Bugfix mw chatty transiet aae fs failures#626

Merged
lordnull merged 1 commit into
1.4from
bugfix-mw-chatty-transiet-aae-fs-failures
Nov 18, 2014
Merged

Bugfix mw chatty transiet aae fs failures#626
lordnull merged 1 commit into
1.4from
bugfix-mw-chatty-transiet-aae-fs-failures

Conversation

@lordnull
Copy link
Copy Markdown
Contributor

There are cases where the aae full sync strategy will fail, but will be able to recover later, such as when the aae trees are not yet built. Until the cause of the transient failure is rectified (or the partition is restarted too many times), the crash log will be filled with useless messages. This PR introduces the use of a message 'soft_exit' to an owner process (passed into the start functions), and then exits the aae worker normally. This allows the owner process (fscoordinator) to retry the partition, but does not spam the crash logs.

The PR works, but an automated test is not yet complete.

The aae full sync test has been updated: basho/riak_test/pull/691

  • Edit/create riak_test to ensure feature works.

Comment thread src/riak_repl2_fscoordinator.erl Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cruft

@jonmeredith
Copy link
Copy Markdown
Contributor

I don't think this will work in a mixed cluster. Not sure if you saw the approach I outlined in the comments for https://bashoeng.atlassian.net/browse/RIAK-1235

If fscoordinator is running on a 1.3.X/1.4.10 or below node, it will try and start up processes on remote nodes by calling start_link/3 which will not exist on the older node and fail.

https://github.com/basho/riak_repl/pull/626/files#diff-5075ea65b0560166a4e4684a839326b2R40

If fscoordinator is running on pre-1.4.11 nodes, and fssource is running on 1.4.11+ then it will send unhandled messages to undefined, and exit normally making the fscoordinator think it has succeeded when it has not.

Why not add a step to the fscoordinator in start_fssource to call something like

    case riak_repl2_fssource_sup:enable(LocalNode, Partition, {Ip, Port}) of
        {ok, Pid} ->
            link(Pid),
            riak_repl2_fssource:report_soft_exits(Pid),
            Running = orddict:store(Pid, PartitionVal, State#state.running_sources),
            State#state{running_sources = Running};

And have

report_soft_exits(Pid) ->
    try gen_server:call(Pid, {report_soft_exits, self()}, timer:seconds(15)) of
       ok -> %% Hit catchall
          false;
       {reporting_to, Pid} ->
          true
    catch
      _:Reason -> %% Probably a timeout if overloaded
          {error, Reason}
    end.

On upgraded nodes, you could store it in the state and make a function that reports back
{exit, soft_exit(Why, State), State}

soft_exit(Why, #state{soft_exits_to=undefined} = State) ->
  {stop, Why, State};
soft_exit(Why, #state{soft_exits_to=Pid} = State) ->
  Pid ! {soft_exit, self(), Why},
  {stop, normal, State}

Within a node between fssource/riak_repl_aae_source I think it is safe to just modify the start call, but you should verify there are not any external dependencies.

@jonmeredith
Copy link
Copy Markdown
Contributor

riak_repl2_fscoordinator.erl

Soft error counting

It would have been a smaller change to introduce a soft_retries dictionary to count
those and check for going over. That would have only meant changing in the new
handle_abnormal_exit function, rather than in multiple places in the module.

Instead the factor to use the partition_info record is spread all over the module.
Given the switch to keytake in so many places, it would be cleaner to flatten
it as a list of records instead of an orddict.

-record(partition_info, {
    pid, index, retries = 0, soft_retries = 0, node
}).

then accessor code could be simplified to

    PartitionEntry = lists:keytake(Pid, #partition_info.pid, State#state.running_sources),
    case PartitionEntry of
        false ->
            % late exit or otherwise non-existant
            {noreply, State};
        {value, PartitionInfo, Running} ->
             do_something(PartitionInfo)
    end

or even cleaner given how much the pattern is used, break out the accessor into a function

   case take_source(Pid, State) of
     {false, _RunningSources} ->
         {noreply, State}
     {PartitionInfo, RunningSources} ->
         do_something(PartitionInfo)
   end

There is at least one place where the new format is not handled correctly
https://github.com/basho/riak_repl/pull/626/files#diff-ff7718aa59949a972fa2d2fae40b11d0R329

    [begin
        unlink(Pid),
        riak_repl2_fssource:stop_fullsync(Pid),
        riak_repl2_fssource_sup:disable(node(Pid), Part)
    end || {Pid, {Part, _PartN}} <- State#state.running_sources],

Presumably {Pid, {Part, _PartN}} should be {Pid, #partition_info(index = Part}}
or if you flattened running_sources, #partition_info{pid= Pid, index = Part}.

Soft error behavior

When soft errors are exceeded the partition is silently dropped from all lists.
Although the soft_retry count is updated, it should increment error_exits once
exceeded so that a status call can tell fullsync has not synchronized completely.

The log entry printed on completion does not include any information on
whether the current fullsync completed with error.

Existing behavior is for the rtdirty flag to be reset regardless of exits.
Should file a separate issue for that, outside the scope of this fix.

Purgatory

What is the purpose of the purgatory queue, why does it need any different
behavior from the partition_queue?

Exit limits

The existing setting to control how many times a partition can be retried is
max_fssource_retries, would recommend using something similar like max_fssource_soft_retries.

The if statements, seem ugly, how about writing a small helper like

hit_retry_limit(Count, infinity) ->
    false;
hit_retry_limit(Count, Limit) ->
    Count >= Limit.

That way the duplicate queing cases could be collapsed into a pair of cases.

If purgatory is not needed, you could break it into

{Count, Limit, State} = case ExitType of
    soft_exit ->
         %% update soft counts and get limits
    'EXIT'
         %% update hard exit counts and limits
    end
case hit_retry_limit(Count, Limit) of
  true ->
     %% Log failed message
  false ->
     %% Log retry message and add to partition_queue

Comment thread src/riak_repl2_fssource.erl Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the check for is_process_alive/1 removed before making the legacy_status request? Unnecessary as the sync_send_all_state checks? Why the change to an ugly if?

Is this change related to soft error handling?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing the initial soft_exit change, I found that it was possible for the is_process_alive/1 to return true, but by the time it got to the sync_send_all_state_event/3, the process was no longer alive due to a soft_exit. The is_process_alive/1 was not doing it's job (ensure we only try to get stats from a living worker), so I removed it in favor of try/catch.

The more I've worked with erlang, the less 'if' has seemed ugly. It's a guard with a boolean output. Why is a case statement superior?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

if is ugly to me because the fall you have to specify the fall through case
as true. I know others (CMJ, SeanC, Kresten) don't like it based on other
review feedback. In the rest of Erlang you'd specify that as an _Var on
something you didn't care about. As they aren't used much by others I
think it risks tripping people up.

On Tue, Oct 14, 2014 at 12:32 PM, Micah notifications@github.com wrote:

In src/riak_repl2_fssource.erl:

@@ -137,9 +145,22 @@ handle_call(stop_fullsync, _From, State=#state{fullsync_worker=FSW,
{reply, ok, State};
handle_call(legacy_status, _From, State=#state{fullsync_worker=FSW,
socket=Socket}) ->

  • Res = case is_pid(FSW) andalso is_process_alive(FSW) of
  •    true -> gen_fsm:sync_send_all_state_event(FSW, status, infinity);
    
  •    false -> []
    
  • lager:debug("Sending status to ~p", [FSW]),
  • Res = if

While testing the initial soft_exit change, I found that it was possible
for the is_process_alive/1 to return true, but by the time it got to the
sync_send_all_state_event/3, the process was no longer alive due to a
soft_exit. The is_process_alive/1 was not doing it's job (ensure we only
try to get stats from a living worker), so I removed it in favor of
try/catch.

The more I've worked with erlang, the less 'if' has seemed ugly. It's a
guard with a boolean output. Why is a case statement superior?


Reply to this email directly or view it on GitHub
https://github.com/basho/riak_repl/pull/626/files#r18848806.

Jon Meredith
Chief Architect
Basho Technologies, Inc.
jmeredith@basho.com

@jonmeredith
Copy link
Copy Markdown
Contributor

@lordnull done reviewing for now. -1 until we resolve the rolling upgrade issue, lets talk about the best path forward in the morning.

@lordnull
Copy link
Copy Markdown
Contributor Author

Soft Error Counting

The primary urge to move to using a record was for locality of information. Rather than needing to track data about a partition spread through out the state of the fscoordinator (retries in one dict, soft_retries in another, which node it's on in yet another data structure), it's all kept and updated in one record. That record gets updated and moved to various waiting areas and queues.

Not to tied to the idea, but it seemed a cleaner method going forward than just tacking on another bit of state in the coordinator to reset.

Purgatory

While testing, I found that the soft_exit_retries was increasing very rapidly. It was retrying partitions faster than the cause for the retry was resolved since the failed partition was put back into the partition_queue immediately. By putting the failed partition in a purgatory queue and feeding the partitions back in more slowly, this lowers the amount of useless work done.

Comment thread src/riak_repl2_fscoordinator.erl Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should count as an error_exit as the fullsync did not complete for all vnodes.

@lordnull
Copy link
Copy Markdown
Contributor Author

I've done further testing with the r_t linked above and eunit. The occasionally failure I will get in r_t is related to transfers, not repl; when it gets past transfers the test passes.

I'm not getting the same errors on eunit as on the build machine; the errors I've gotten are related to rtq, while the build machine seems to have errors related to a test not properly cleaning itself up.

Where do we go from here?

@lordnull
Copy link
Copy Markdown
Contributor Author

I've tested this on a local rebase off of 1.4, and it still passes r_t still. Do we want to merge as is, see if the eunit tests can be fixed, or push the rebase and merge that?

@lordnull lordnull force-pushed the bugfix-mw-chatty-transiet-aae-fs-failures branch from 293b6ab to 2de8426 Compare November 4, 2014 21:36
@cmeiklejohn
Copy link
Copy Markdown
Contributor

Pushed some minor cleanup changes, please review, @lordnull and/or @jonmeredith:

d6a987b
ec300dc
9ec4c7d
137bee8
94232f0
9ec4c7d

Let me know if you have any questions.

@lordnull
Copy link
Copy Markdown
Contributor Author

lordnull commented Nov 5, 2014

Looks good to me.

@engelsanchez
Copy link
Copy Markdown
Contributor

I will be finishing the review for this PR

@engelsanchez engelsanchez self-assigned this Nov 7, 2014
Comment thread src/riak_repl2_fscoordinator.erl Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are blocking for an 'EXIT' message from the fssource, which we should be about to get. But there is a chance of the disterl reconnection happening at the wrong time and this message getting lost forever. That would make this block forever too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of the line is to avoid the extra log spam the secondary redundant exit message would cause. Given that, what do you think would be a reasonable wait time for the exit message?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose any timeout number is a guess, and that path will be very rarely touched in real life. So something like 1 min, etc would probably be OK. And normally these two messages will come at pretty much the same time.

Another alternative is to set up a monitor to the process here. Then in a function wait for EXIT optionally, and then DOWN. When you get the DOWN message, which is guaranteed, you know no more messages for that process will arrive. Since the process is likely already dead at this point, that should not take long normally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer waiting on a known event than guessing at a timeout. Since we would be trying to monitor a dead process, the monitor option seems the most reliable and clean option.

Problem: transient failures of aae, such as trees not yet built or locks not
being aquired, would cause an aae fullsync process to exit abnormally. This
could happen several times in a row, creating log spam.

Resolution: the concept of soft_exit. A soft_exit is a message sent from a soon
to be exiting process to a soft_linked process. The exiting process would then
exit normally, while any soft_linked processes could handle the soft_exit
message in a similar fashion as an exit message. This would indicate an exit
reason that should be handled, but not bad enough to have the system logger
know about it.

The soft_exit message sent from the aae worker to the fscoordinator is
as simple as `{soft_exit, pid(), term()}'.

The current implementation is not generic. There can only one soft_link to
the aae, and there's no general mechanism to use soft_link's or soft_exits
elsewhere in the code base. Sorry.

Another change rolled into this is consistent use of a #partition_info record
in the fscoordinator, and error tracking the fscoordinator's state. By swapping
to useing a single data structure in the partition queue, whereis waiting list,
and purgatory queues it makes it easier to understand the fscordinator (as
there is less code modify structures).
@lordnull lordnull force-pushed the bugfix-mw-chatty-transiet-aae-fs-failures branch from 1fdd46c to 0bdb1ba Compare November 18, 2014 20:25
@engelsanchez
Copy link
Copy Markdown
Contributor

+1 0bdb1ba

We've finished reviewing this PR. All problems found during the review have been addressed. All eunit tests pass for me locally (the builder seems to have a setup problem though) and the aae fullsync riak_test passes too. I've also manually tested by interfering with AAE trees manually while AAE fullsyncs are running. As expected, the errors that used to spam the logs are only info entries now and the soft retries are incremented instead. This is good to go!

lordnull added a commit that referenced this pull request Nov 18, 2014
…ailures

Bugfix mw chatty transiet aae fs failures
@lordnull lordnull merged commit f902ae0 into 1.4 Nov 18, 2014
@lordnull lordnull deleted the bugfix-mw-chatty-transiet-aae-fs-failures branch November 18, 2014 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants