Skip to content

Handle incompatible records between the 1.3 and 1.4 release#120

Closed
cmeiklejohn wants to merge 15 commits into1.4from
csm-1.4-bad-record
Closed

Handle incompatible records between the 1.3 and 1.4 release#120
cmeiklejohn wants to merge 15 commits into1.4from
csm-1.4-bad-record

Conversation

@cmeiklejohn
Copy link
Contributor

I've confirmed a problem with a rolling upgrade between the 1.3 and 1.4 versions of Riak.

The root of the problem is that the node hosting Riak Control initiates a RPC request to all nodes, and pattern matches the result against a record. The format of this record changed between the 1.3 and 1.4 releases, without providing a new name for the record, so the pattern match fails.

This can occur when either:

1.) Riak Control hosted on 1.3 node makes a RPC request to a 1.4 node.
2.) Riak Control hosted on 1.4 node makes a RPC request to a 1.3 node.

Both of these scenarios trigger a badmatch exception in the riak_control_session module. This causes the Riak supervisor to restart the riak_control application, which then continues to crash when polling for the state until the node shuts down triggered by a maximum restart threshold.

This patch provides a few things:

  • Use macros for record definitions moving forward.
  • Make the 1.3 record the definitive member_info record, and the proposed-1.4.1 record the definitive member_info_v2 record.
  • When receiving an incompatible 1.4 record, display the node as incompatible.
  • Suppress display of node-specific information when an incompatible node is returned.
  • Fix failed low-watermark handling in node resource.
  • Capability to ensure that we don't send 1.4.1+ records to 1.2 <= 1.3.x nodes.

Updates

  • Capability negotiation could not be used reliably because of race conditions when adding a node which a greater capability than the existing cluster, and would trigger node failures.
  • Instead, the RPC call was renamed, with the original call triggering a badrpc which is handled.

Rolling Upgrades

  • With this patch, both Riak 1.3 (see riak_control#121), Riak 1.4 (see riak_control#111 riak_control#120) will unfortunately throw webmachine errors on the nodes resource because of badarith handling during start of memsup, or when the node is in an incompatible state. This is unfortunate, but this patch at least addresses the major cluster crash.
  • I've tested branches 1.2, 1.3 and 1.4 against a mixed cluster with 1.4.1 nodes and couldn't trigger the crash scenario any longer, however the webmachine issues above do still occur.

I would appreciate someone testing these upgrade scenarios as well to ensure total coverage.

I've left the commits broken out for the reviewer, but will rebase before merge.

@jonmeredith @russelldb @seancribbs @jgnewman @bsparrow435 @rzezeski

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with matching against #member_info{} in this case? It seems the main difference is that the new record has additional fields.

A capability might also help, allowing new nodes to send the original record to old nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't match member_info because we can't guarantee a specific format for it. I've added a capability for handling legacy record formats, but we still don't have a solution for 1.4.0/1.4.1 compatibility.

@rzezeski
Copy link
Contributor

This patch ensures that a proposed-1.4.1 node can successfully run Riak Control, and talk to 1.3 and 1.4 nodes, but an earlier 1.3 node will still exhibit errors talking to a 1.4+ node.

Does this mean that a mixed cluster where the Riak Control node is not updated first will cause the Riak Control node to continually go down because of max restarts?

@cmeiklejohn
Copy link
Contributor Author

@rzezeski

I've added a capability that will prevent 1.4.1 nodes from sending the newer record to 1.2.x and 1.3.x nodes, however, there's still a situation where a 1.4.0 node is running Control and asks for information from a 1.4.1.

Handle race condition by triggering a badrpc exception, which has
backwards compatibility with <= 1.3.

Craft custom messages for 1.4.0 with the bad record format.
Capability negotiation has a race condition, so
it can't be relied on to determine 1.4.0 as it 
will look like a 1.4.0 momentarily if the node
is actual a 1.3.x node.
Capability negotation is problematic because race conditions in
negotation.
@rzezeski
Copy link
Contributor

I'm working on a riak test to verify this patch (so we don't have to test this by hand in the future). I've verified the crash upgrading from 1.2.1 to a recent master of Riak. I pasted the observed error below. I figure 1.2.1 -> current proves same issue as 1.3 or 1.4 -> current. Now I'm going to test the patch while upgrading from the 1.3/1.4 versions.

2013-07-31 13:42:23.682 [error] <0.647.0> Supervisor riak_control_sup had child riak_control_session started with riak_control_session:start_link() at <0.648.0> exit with reason no case clause matching {member_info,'dev3@127.0.0.1',undefined,true,[],[],undefined,undefined,32996760000,4792588000,18248904,undefined,undefined} in riak_control_session:get_member_info/2 line 239 in context child_terminated
2013-07-31 13:42:23.683 [error] <0.3537.0> CRASH REPORT Process <0.3537.0> with 0 neighbours exited with reason: no case clause matching {member_info,'dev3@127.0.0.1',undefined,true,[],[],undefined,undefined,32997180000,4795068000,18249096,undefined,undefined} in riak_control_session:get_member_info/2 line 239 in gen_server:init_it/6 line 328

@rzezeski
Copy link
Contributor

+1 to merge. The riak test I created fails without the patch and passes with it. So at least we know Riak Control should no longer crash a node in a mixed-cluster environment.

@cmeiklejohn
Copy link
Contributor Author

Rebased to #122.

@cmeiklejohn
Copy link
Contributor Author

Rather, rebased to #123.

rzezeski added a commit to basho/riak_test that referenced this pull request Jan 26, 2014
Some issues have been found when running a mixed cluster with Riak
Control enabled.  This tests the typical upgrade scenario of a
customer and verifies that Riak Control at least doesn't crash any of
the nodes while in a mixed cluster state.

See basho/riak_control#120
@seancribbs seancribbs deleted the csm-1.4-bad-record branch April 1, 2015 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants