Handle incompatible records between the 1.3 and 1.4 release by cmeiklejohn · Pull Request #120 · basho/riak_control

cmeiklejohn · 2013-07-29T18:01:22Z

I've confirmed a problem with a rolling upgrade between the 1.3 and 1.4 versions of Riak.

The root of the problem is that the node hosting Riak Control initiates a RPC request to all nodes, and pattern matches the result against a record. The format of this record changed between the 1.3 and 1.4 releases, without providing a new name for the record, so the pattern match fails.

This can occur when either:

1.) Riak Control hosted on 1.3 node makes a RPC request to a 1.4 node.
2.) Riak Control hosted on 1.4 node makes a RPC request to a 1.3 node.

Both of these scenarios trigger a badmatch exception in the riak_control_session module. This causes the Riak supervisor to restart the riak_control application, which then continues to crash when polling for the state until the node shuts down triggered by a maximum restart threshold.

This patch provides a few things:

Use macros for record definitions moving forward.
Make the 1.3 record the definitive member_info record, and the proposed-1.4.1 record the definitive member_info_v2 record.
When receiving an incompatible 1.4 record, display the node as incompatible.
Suppress display of node-specific information when an incompatible node is returned.
Fix failed low-watermark handling in node resource.
Capability to ensure that we don't send 1.4.1+ records to 1.2 <= 1.3.x nodes.

Updates

Capability negotiation could not be used reliably because of race conditions when adding a node which a greater capability than the existing cluster, and would trigger node failures.
Instead, the RPC call was renamed, with the original call triggering a badrpc which is handled.

Rolling Upgrades

With this patch, both Riak 1.3 (see riak_control#121), Riak 1.4 (see riak_control#111 riak_control#120) will unfortunately throw webmachine errors on the nodes resource because of badarith handling during start of memsup, or when the node is in an incompatible state. This is unfortunate, but this patch at least addresses the major cluster crash.
I've tested branches 1.2, 1.3 and 1.4 against a mixed cluster with 1.4.1 nodes and couldn't trigger the crash scenario any longer, however the webmachine issues above do still occur.

I would appreciate someone testing these upgrade scenarios as well to ensure total coverage.

I've left the commits broken out for the reviewer, but will rebase before merge.

@jonmeredith @russelldb @seancribbs @jgnewman @bsparrow435 @rzezeski

Suppress the display of RAM and related information if the node is known to be incompatible.

seancribbs · 2013-07-30T13:55:56Z

src/riak_control_session.erl

What's wrong with matching against #member_info{} in this case? It seems the main difference is that the new record has additional fields.

A capability might also help, allowing new nodes to send the original record to old nodes.

Can't match member_info because we can't guarantee a specific format for it. I've added a capability for handling legacy record formats, but we still don't have a solution for 1.4.0/1.4.1 compatibility.

rzezeski · 2013-07-30T16:36:45Z

This patch ensures that a proposed-1.4.1 node can successfully run Riak Control, and talk to 1.3 and 1.4 nodes, but an earlier 1.3 node will still exhibit errors talking to a 1.4+ node.

Does this mean that a mixed cluster where the Riak Control node is not updated first will cause the Riak Control node to continually go down because of max restarts?

cmeiklejohn · 2013-07-30T17:29:52Z

@rzezeski

I've added a capability that will prevent 1.4.1 nodes from sending the newer record to 1.2.x and 1.3.x nodes, however, there's still a situation where a 1.4.0 node is running Control and asks for information from a 1.4.1.

Handle race condition by triggering a badrpc exception, which has backwards compatibility with <= 1.3. Craft custom messages for 1.4.0 with the bad record format.

Capability negotiation has a race condition, so it can't be relied on to determine 1.4.0 as it will look like a 1.4.0 momentarily if the node is actual a 1.3.x node.

Capability negotation is problematic because race conditions in negotation.

rzezeski · 2013-07-31T17:49:43Z

I'm working on a riak test to verify this patch (so we don't have to test this by hand in the future). I've verified the crash upgrading from 1.2.1 to a recent master of Riak. I pasted the observed error below. I figure 1.2.1 -> current proves same issue as 1.3 or 1.4 -> current. Now I'm going to test the patch while upgrading from the 1.3/1.4 versions.

2013-07-31 13:42:23.682 [error] <0.647.0> Supervisor riak_control_sup had child riak_control_session started with riak_control_session:start_link() at <0.648.0> exit with reason no case clause matching {member_info,'dev3@127.0.0.1',undefined,true,[],[],undefined,undefined,32996760000,4792588000,18248904,undefined,undefined} in riak_control_session:get_member_info/2 line 239 in context child_terminated
2013-07-31 13:42:23.683 [error] <0.3537.0> CRASH REPORT Process <0.3537.0> with 0 neighbours exited with reason: no case clause matching {member_info,'dev3@127.0.0.1',undefined,true,[],[],undefined,undefined,32997180000,4795068000,18249096,undefined,undefined} in riak_control_session:get_member_info/2 line 239 in gen_server:init_it/6 line 328

rzezeski · 2013-07-31T21:23:05Z

+1 to merge. The riak test I created fails without the patch and passes with it. So at least we know Riak Control should no longer crash a node in a mixed-cluster environment.

cmeiklejohn · 2013-07-31T21:27:45Z

Rebased to #122.

cmeiklejohn · 2013-07-31T21:32:34Z

Rather, rebased to #123.

Some issues have been found when running a mixed cluster with Riak Control enabled. This tests the typical upgrade scenario of a customer and verifies that Riak Control at least doesn't crash any of the nodes while in a mixed cluster state. See basho/riak_control#120

cmeiklejohn added 5 commits July 29, 2013 09:55

Switch to macros for record definitions.

cbc414c

Add default handling case for incompatible records.

d9d7144

Make the current 1.4.1 record be v2.

8e34113

Update user interface to suppress RAM.

25ab9be

Suppress the display of RAM and related information if the node is known to be incompatible.

Use macros, and handle race condition.

9bf23ca

seancribbs reviewed Jul 30, 2013
View reviewed changes

jplock mentioned this pull request Jul 30, 2013

Riak node crashes during rolling upgrade from Riak 1.3.2 to Riak 1.4.0 if riak_control is enabled. #117

Closed

cmeiklejohn added 2 commits July 30, 2013 10:23

Use capabilities to enforce record format.

a5fc826

Prevent multiple resource calls.

f3c2605

cmeiklejohn added 8 commits July 30, 2013 11:03

Ensure start_link returns supervisor_ret.

0efa0af

Pattern match against legacy record.

52f1508

Assume v0 record when the capability is accessed before registered.

6941f07

Handle mixed RPC calls.

bf21e44

Handle race condition by triggering a badrpc exception, which has backwards compatibility with <= 1.3. Craft custom messages for 1.4.0 with the bad record format.

Use the member_info macro.

ca589c6

Don't attempt 1.4.0 detection.

18444f6

Capability negotiation has a race condition, so it can't be relied on to determine 1.4.0 as it will look like a 1.4.0 momentarily if the node is actual a 1.3.x node.

Fix ordering of negotiation.

8db02b7

Remove capability.

e037a9f

Capability negotation is problematic because race conditions in negotation.

rzezeski mentioned this pull request Jul 31, 2013

Add upgrade test for Riak Control basho/riak_test#342

Merged

cmeiklejohn closed this Jul 31, 2013

seancribbs deleted the csm-1.4-bad-record branch April 1, 2015 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle incompatible records between the 1.3 and 1.4 release#120

Handle incompatible records between the 1.3 and 1.4 release#120
cmeiklejohn wants to merge 15 commits into1.4from
csm-1.4-bad-record

cmeiklejohn commented Jul 29, 2013

Uh oh!

seancribbs Jul 30, 2013

Uh oh!

cmeiklejohn Jul 30, 2013

Uh oh!

rzezeski commented Jul 30, 2013

Uh oh!

cmeiklejohn commented Jul 30, 2013

Uh oh!

rzezeski commented Jul 31, 2013

Uh oh!

rzezeski commented Jul 31, 2013

Uh oh!

cmeiklejohn commented Jul 31, 2013

Uh oh!

cmeiklejohn commented Jul 31, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cmeiklejohn commented Jul 29, 2013

Uh oh!

seancribbs Jul 30, 2013

Choose a reason for hiding this comment

Uh oh!

cmeiklejohn Jul 30, 2013

Choose a reason for hiding this comment

Uh oh!

rzezeski commented Jul 30, 2013

Uh oh!

cmeiklejohn commented Jul 30, 2013

Uh oh!

rzezeski commented Jul 31, 2013

Uh oh!

rzezeski commented Jul 31, 2013

Uh oh!

cmeiklejohn commented Jul 31, 2013

Uh oh!

cmeiklejohn commented Jul 31, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants