Optimize to better handle large rings/nodes by jtuple · Pull Request #319 · basho/riak_core

jtuple · 2013-05-15T23:59:58Z

This pull-request consists of numerous changes that are all part of an effort to improve riak_core with respect to large rings and large clusters.

As a data point, consider this mailing list post:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-March/011331.html

That email shows that adding a node to a Riak cluster with a 2048-size ring takes about an hour. I have seen similar results in my own testing. With the changes in this pull-request, adding a node to a cluster with a 16384 size ring takes about 5 minutes.

In general, this was a continual round of whack-a-mole. Profile to find bottleneck A, fix it; profile to find bottleneck B. Rinse/repeat. I've tried to organize the pull-request into logical, self-contained commits. For code review, the best approach is to go through commit-by-commit, being sure to read the detailed commit messages provided.

This pull-request requires basho/riak_kv#554

As a whole, the primary changes are as follows.

First, it's been long known that using mochiglobal to read/write the ring has been a blessing and a curse. mochiglobal reads are very, very fast. They take advantage of the Erlang constant pool and allow multiple processes to read the same data without locks and without copying data into individual process heaps. Unfortunately, mochiglobal writes are very, very slow. And they get slower the larger the ring is. The majority of this time is not spent in the code loader but inside code that converts the bare term into the abstract syntax tree that is passed to the Erlang compiler.

These slow writes hurt Riak during periods of high gossip traffic (such as cluster transitions), where hundreds of ring events may be triggered per minute. The riak_core_ring_manager stalls, message queue backs up, etc. Bad news. Riak is now CPU bound, spinning it's wheels generating code at runtime.

Solution? A hybrid approach. When a ring is first written, it's written to a shared ETS table, and the atom ets is written to mochiglobal. If the ring doesn't change for 90 seconds, it is then promoted to mochiglobal. Thus, we get fast writes during high ring churn, and fast reads once things stabilize.

Of course, reading from ETS is slower than reading from mochiglobal. This is predominately due to the need to copy the data into the process heap, which for large rings is MBs of data.

To combat this, so that critical paths of Riak are not heavily affected during cluster transitions, the second change is to extract certain elements out of the ring and provide an alternate means to query them. Bucket properties are stored in the ETS table, and a binary version of the chash structure called chashbin is generated. Since the chashbin uses a binary for the bulk of it's data, the structure is quick to copy between processes due to off-heap binary sharing. The chashbin structure also provides nearly constant time operations for a variety of operations that the traditional chash structure did in O(N) time. So, it's a win/win.

Various other commit modify different aspects of Riak to use the chashbin structure: the preflist calculation in riak_core_apl, the coverage calculations, various parts of the vnode manager, etc. See the commits for more details.

The next set of changes optimizes various other things. This consist mostly of changing various algorithms that are O(N^3), O(N^2), or O(N) with large constant factors where N is either the ring size or number of nodes in the cluster to using different algorithms and/or data structures with reduced complexity. Again, see commits for details.

There is also a change to how ring events are triggered in the a few manager processes. The traditional ring event system triggers an event for every ring received. Most things in Riak only care if the ring has actually changed. Thus, a new approach that reduces the number of events generated / colacesing multiple events together was build. This change, along with some of the algorithmic changes, makes the riak_core_ring_manager, riak_core_vnode_manager, and riak_core_capability_manager much more lightweight, whereas before they were the three dominate processes with number reductions between them.

Finally, there are some other minor changes.

There's a lot of benchmarking / profiling I've done during this work, but most of it become outdated as I made further changes. Or I didn't keep good notes. Will add more details in the coming days, but wanted to get the code out for review.

One set of information I do have around is from a micro-benchmark that compares the cost of computing preference lists in Riak between master, this branch using non-chashbin approach, and this branch using chashbin approach. The results are as follows (time in microseconds):

riak_core_apl:get_apl
--------------------------------------------------------
                 master         chash       chashbin
Ring  Nodes    Min    Max    Min    Max    Min    Max
   64     1      2      2      1      2      1      1
   64    10      6      7      1      2      1      1
   64   100      6      8      1      2      1      1
   64  1000      6      8      1      2      1      1
  128     1      2      4      1      3      1      1
  128    10     11     13      1      3      1      1
  128   100     14     15      1      3      1      1
  128  1000     12     13      1      4      1      1
  256     1      4      7      1      6      1      1
  256    10     20     23      1      6      1      1
  256   100     23     26      1      6      1      1
  256  1000     25     26      1      7      1      2
  512     1      6     12      3     11      1      1
  512    10     39     45      3     11      1      1
  512   100     48     50      3     10      1      1
  512  1000     44     57      3     14      1      5
 1024     1     15     27      8     19      1      1
 1024    10     81     95      6     19      1      1
 1024   100     87    103      6     20      1      1
 1024  1000    116    121      6     18      1      1
 2048     1     32     53      9     39      1      1
 2048    10    153    181     10     39      1      1
 2048   100    174    203      8     36      1      1
 2048  1000    243    292      8     35      1      1
 4096     1     57     96     20     68      1      1
 4096    10    322    363     16     72      1      1
 4096   100    360    419     17     75      1      2
 4096  1000    477    548     18     71      1      1
 8192     1     98    208     35    162      1      1
 8192    10    596    701     45    133      1      1
 8192   100    711    803     41    138      1      1
 8192  1000    783    818     34    151      1      2
16384     1    196    380     81    295      1      1
16384    10   1242   1365     79    296      1      1
16384   100   1415   1512     97    296      1      1
16384  1000   1576   2003     83    276      1      2

Numerous processes concurrently read and access the ring in a variety of time sensitive code paths. To make this efficient, riak_core uses mochiglobal which exploits the Erlang constant pool to provide constant-time access to the ring without needing to copy data into individual process heaps. However, updating a mochiglobal value is very slow, and becomes slower the larger the item being stored. In Riak clusters with large rings and many nodes, storing a ring in mochiglobal can take seconds. This delay is too long during periods of high gossip, where hundreds of ring events are being triggered a second. The riak_core_ring_manager will stall and start to exhibit message queue growth. This commit uses a hybrid approach to solve this problem. When a ring is first written, it is written to a shared ETS table. If no ring events have occurred for 90 seconds, the ring is then promoted to mochiglobal. This provides fast updates during periods of ring churn, while eventually providing very fast reads after the ring stabilizes. The downside is that reading from the ETS table before promotion is slower than mochiglobal, and requires copying the ring into individual process heaps.

Several minor changes made to fix pre-existing warnings that would now prevent the build given the warning_as_errors option.

The hybrid ETS/mochiglobal approach solves the fast update problem, but hurts ring read performance during ring transitions. This issue can be avoided by exploiting the fact that the majority of Riak only reads the ring in order to determine index owners, preference lists, and bucket properties. This commit provides an alternate approach to getting access to these values without reading the entire ring. First, this commit creates a binary version of the chash structure which is stored in the ring manager's ETS table. This chashbin structure is fast to copy between processes due to off-heap binary sharing. Furthermore, this structure provides a secondary benefit of being much faster than the traditional chash structure for normal operations. Second, this commit extracts bucket metadata from the ring, storing the values in the ring manager's ETS table. This allows processes to directly query a given bucket's metadata without reading the ring.

This commit changes riak_core_apl to use the new chashbin structure, avoiding the need to read the ring in order to compute preflists. This change also makes preflist calculation significantly faster given the better performance of the chashbin structure over the traditional chash structure. Specifically, traditional preflist calculation time scales with the ring size whereas the new approach is essentially constant regardless of ring size.

Change riak_core_apl to use an unordered list and lists:member rather than ordsets and ordsets:is_element for determining up nodes. This approach is much faster in practice. Both ordsets:is_element and lists:member are O(N); however, lists:member is a BIF and has a significantly lower constant factor. Likewise, this approach avoids the lists:usort performed by ordsets:from_list.

The standard ring event handler sends an event for every ring received. The vnode manager and capability manager do not care about every ring received, they only care if the local ring actually changes. In other words, if a ring is received that provides no new information, and therefore the local ring does not change, there is no reason to trigger an event. This commit changes these managers to no longer listen to ring events, but instead poll the ring manager at a fixed frequency. This not only solves the above problem, but also allows the managers to ignore high frequency ring events. Given that the common case is for the ring to remain stable, this change also introduces the notion of the "ring id" which is a monotonic sequence number for the ring maintained by the ring manager. The periodic polling then simplifies to checking if the ring id has changed, and only performing the more expensive state update if so.

Change certain uses of orddict to dict where the size of the structure depends on either the ring size or number of nodes in a cluster. Change update_never_started to only query ring indices when there are actually unknown modules.

Change from O(n*m) to O(m) + O(n*log32(m))

The goal is to prevent multiple vnodes timing out simultaneously and overloading the vnode manager with tons of messages.

jrwest · 2013-05-16T20:00:39Z

Discussed some surface level changes (e.g. comments/test updates) w/ @jtuple offline. Unable to line comment on this PR currently due to GH problem, however. Besides making those small changes, +1.

This PR clearly improves performance of large clusters. Ran a simple test: try to make a 2-node cluster of ring size 2048 before and after this change (aae off, forced_ownership_handoff raised, process limit raised). Before, I cancelled the test after ~10 min. During this time the vnode manager was causing all sorts of system monitors to fire. After the change, the test completed in ~30 seconds and riak operated as normal.

EDIT: also ran a few riak tests including verify_reset_bucket_properties against these changes

-- Add module edoc to riak_core_ring_manager discussing hybrid approach. -- Comment the approach used to update the bucket props in ETS. -- Ensure we do not silently ignore any errors in ensure_vnodes_started. -- Change chashbin:index_owner/2 error condition. -- Fix variable naming for chashbin:iterator/2.

Also, fix several build warnings since #319 includes a change to treat warnings as errors. Conflicts: src/riak_core_util.erl

jtuple added 16 commits May 15, 2013 16:01

Change riak_core_ring_manager to use gen_server

2ce32c3

Enable warning_as_errors build option and verbose eunit

0f94e56

Several minor changes made to fix pre-existing warnings that would now prevent the build given the warning_as_errors option.

Optimize coverage plan using chashbin

b39bd4c

Update vnode manager to use chashbin

3ed0a56

Make vnode manager responsible for ensuring vnodes started

fa932e6

Change handoff manager to use sets instead of ordsets

2bf01a8

Reduce file server bottleneck when writing ring to disk

29500c1

Optimize vnode manager

73591f7

Change certain uses of orddict to dict where the size of the structure depends on either the ring size or number of nodes in a cluster. Change update_never_started to only query ring indices when there are actually unknown modules.

Optimize riak_core_vnode_manager:compute_forwarding

16635e0

Change from O(n*m) to O(m) + O(n*log32(m))

Change how vnode manager uses ETS to avoid table scans

8320cfd

Randomize vnode inactivity timeout

82137c0

The goal is to prevent multiple vnodes timing out simultaneously and overloading the vnode manager with tons of messages.

jtuple mentioned this pull request May 16, 2013

Optimize to better handle large rings/nodes basho/riak_kv#554

Merged

jrwest mentioned this pull request May 23, 2013

Ring Resizing #301

Closed

jtuple added a commit that referenced this pull request May 28, 2013

Merge #319 from jdb-large-master into staging-1.4

21c0ccb

Also, fix several build warnings since #319 includes a change to treat warnings as errors. Conflicts: src/riak_core_util.erl

jtuple added a commit that referenced this pull request May 30, 2013

Fix riak_core_connection_mgr_tests after PR #319

6fc5e41

jtuple added a commit that referenced this pull request May 30, 2013

Fix additional connection/service manager tests after PR #319

ffc09b0

jrwest merged commit 7a3b9f4 into master May 30, 2013

jtuple mentioned this pull request Jun 20, 2013

Refactor and better organization of the coverage plan module #341

Closed

rzezeski mentioned this pull request Sep 7, 2013

Various KV changes to potentially port to Yokozuna basho/yokozuna#168

Closed

10 tasks

seancribbs deleted the jdb-large-master branch April 1, 2015 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize to better handle large rings/nodes#319

Optimize to better handle large rings/nodes#319
jrwest merged 17 commits intomasterfrom
jdb-large-master

jtuple commented May 15, 2013

Uh oh!

jrwest commented May 16, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jtuple commented May 15, 2013

Uh oh!

jrwest commented May 16, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants