Optimize to better handle large rings/nodes#319
Merged
Conversation
Numerous processes concurrently read and access the ring in a variety of time sensitive code paths. To make this efficient, riak_core uses mochiglobal which exploits the Erlang constant pool to provide constant-time access to the ring without needing to copy data into individual process heaps. However, updating a mochiglobal value is very slow, and becomes slower the larger the item being stored. In Riak clusters with large rings and many nodes, storing a ring in mochiglobal can take seconds. This delay is too long during periods of high gossip, where hundreds of ring events are being triggered a second. The riak_core_ring_manager will stall and start to exhibit message queue growth. This commit uses a hybrid approach to solve this problem. When a ring is first written, it is written to a shared ETS table. If no ring events have occurred for 90 seconds, the ring is then promoted to mochiglobal. This provides fast updates during periods of ring churn, while eventually providing very fast reads after the ring stabilizes. The downside is that reading from the ETS table before promotion is slower than mochiglobal, and requires copying the ring into individual process heaps.
Several minor changes made to fix pre-existing warnings that would now prevent the build given the warning_as_errors option.
The hybrid ETS/mochiglobal approach solves the fast update problem, but hurts ring read performance during ring transitions. This issue can be avoided by exploiting the fact that the majority of Riak only reads the ring in order to determine index owners, preference lists, and bucket properties. This commit provides an alternate approach to getting access to these values without reading the entire ring. First, this commit creates a binary version of the chash structure which is stored in the ring manager's ETS table. This chashbin structure is fast to copy between processes due to off-heap binary sharing. Furthermore, this structure provides a secondary benefit of being much faster than the traditional chash structure for normal operations. Second, this commit extracts bucket metadata from the ring, storing the values in the ring manager's ETS table. This allows processes to directly query a given bucket's metadata without reading the ring.
This commit changes riak_core_apl to use the new chashbin structure, avoiding the need to read the ring in order to compute preflists. This change also makes preflist calculation significantly faster given the better performance of the chashbin structure over the traditional chash structure. Specifically, traditional preflist calculation time scales with the ring size whereas the new approach is essentially constant regardless of ring size.
Change riak_core_apl to use an unordered list and lists:member rather than ordsets and ordsets:is_element for determining up nodes. This approach is much faster in practice. Both ordsets:is_element and lists:member are O(N); however, lists:member is a BIF and has a significantly lower constant factor. Likewise, this approach avoids the lists:usort performed by ordsets:from_list.
The standard ring event handler sends an event for every ring received. The vnode manager and capability manager do not care about every ring received, they only care if the local ring actually changes. In other words, if a ring is received that provides no new information, and therefore the local ring does not change, there is no reason to trigger an event. This commit changes these managers to no longer listen to ring events, but instead poll the ring manager at a fixed frequency. This not only solves the above problem, but also allows the managers to ignore high frequency ring events. Given that the common case is for the ring to remain stable, this change also introduces the notion of the "ring id" which is a monotonic sequence number for the ring maintained by the ring manager. The periodic polling then simplifies to checking if the ring id has changed, and only performing the more expensive state update if so.
Change certain uses of orddict to dict where the size of the structure depends on either the ring size or number of nodes in a cluster. Change update_never_started to only query ring indices when there are actually unknown modules.
Change from O(n*m) to O(m) + O(n*log32(m))
The goal is to prevent multiple vnodes timing out simultaneously and overloading the vnode manager with tons of messages.
Contributor
|
Discussed some surface level changes (e.g. comments/test updates) w/ @jtuple offline. Unable to line comment on this PR currently due to GH problem, however. Besides making those small changes, +1. This PR clearly improves performance of large clusters. Ran a simple test: try to make a 2-node cluster of ring size 2048 before and after this change (aae off, EDIT: also ran a few riak tests including |
Closed
-- Add module edoc to riak_core_ring_manager discussing hybrid approach. -- Comment the approach used to update the bucket props in ETS. -- Ensure we do not silently ignore any errors in ensure_vnodes_started. -- Change chashbin:index_owner/2 error condition. -- Fix variable naming for chashbin:iterator/2.
jtuple
added a commit
that referenced
this pull request
May 28, 2013
Also, fix several build warnings since #319 includes a change to treat warnings as errors. Conflicts: src/riak_core_util.erl
jtuple
added a commit
that referenced
this pull request
May 30, 2013
jtuple
added a commit
that referenced
this pull request
May 30, 2013
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull-request consists of numerous changes that are all part of an effort to improve
riak_corewith respect to large rings and large clusters.As a data point, consider this mailing list post:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-March/011331.html
That email shows that adding a node to a Riak cluster with a 2048-size ring takes about an hour. I have seen similar results in my own testing. With the changes in this pull-request, adding a node to a cluster with a 16384 size ring takes about 5 minutes.
In general, this was a continual round of whack-a-mole. Profile to find bottleneck A, fix it; profile to find bottleneck B. Rinse/repeat. I've tried to organize the pull-request into logical, self-contained commits. For code review, the best approach is to go through commit-by-commit, being sure to read the detailed commit messages provided.
This pull-request requires basho/riak_kv#554
As a whole, the primary changes are as follows.
First, it's been long known that using
mochiglobalto read/write the ring has been a blessing and a curse.mochiglobalreads are very, very fast. They take advantage of the Erlang constant pool and allow multiple processes to read the same data without locks and without copying data into individual process heaps. Unfortunately,mochiglobalwrites are very, very slow. And they get slower the larger the ring is. The majority of this time is not spent in the code loader but inside code that converts the bare term into the abstract syntax tree that is passed to the Erlang compiler.These slow writes hurt Riak during periods of high gossip traffic (such as cluster transitions), where hundreds of ring events may be triggered per minute. The
riak_core_ring_managerstalls, message queue backs up, etc. Bad news. Riak is now CPU bound, spinning it's wheels generating code at runtime.Solution? A hybrid approach. When a ring is first written, it's written to a shared ETS table, and the atom
etsis written to mochiglobal. If the ring doesn't change for 90 seconds, it is then promoted to mochiglobal. Thus, we get fast writes during high ring churn, and fast reads once things stabilize.Of course, reading from ETS is slower than reading from mochiglobal. This is predominately due to the need to copy the data into the process heap, which for large rings is MBs of data.
To combat this, so that critical paths of Riak are not heavily affected during cluster transitions, the second change is to extract certain elements out of the ring and provide an alternate means to query them. Bucket properties are stored in the ETS table, and a binary version of the
chashstructure calledchashbinis generated. Since thechashbinuses a binary for the bulk of it's data, the structure is quick to copy between processes due to off-heap binary sharing. Thechashbinstructure also provides nearly constant time operations for a variety of operations that the traditionalchashstructure did inO(N)time. So, it's a win/win.Various other commit modify different aspects of Riak to use the
chashbinstructure: the preflist calculation inriak_core_apl, the coverage calculations, various parts of the vnode manager, etc. See the commits for more details.The next set of changes optimizes various other things. This consist mostly of changing various algorithms that are
O(N^3),O(N^2), orO(N)with large constant factors where N is either the ring size or number of nodes in the cluster to using different algorithms and/or data structures with reduced complexity. Again, see commits for details.There is also a change to how ring events are triggered in the a few manager processes. The traditional ring event system triggers an event for every ring received. Most things in Riak only care if the ring has actually changed. Thus, a new approach that reduces the number of events generated / colacesing multiple events together was build. This change, along with some of the algorithmic changes, makes the
riak_core_ring_manager,riak_core_vnode_manager, andriak_core_capability_managermuch more lightweight, whereas before they were the three dominate processes with number reductions between them.Finally, there are some other minor changes.
There's a lot of benchmarking / profiling I've done during this work, but most of it become outdated as I made further changes. Or I didn't keep good notes. Will add more details in the coming days, but wanted to get the code out for review.
One set of information I do have around is from a micro-benchmark that compares the cost of computing preference lists in Riak between
master, this branch using non-chashbinapproach, and this branch usingchashbinapproach. The results are as follows (time in microseconds):