In last post, we were introduced to concepts like hashing. In this post, we will learn in detail why “just” hashing will not work and why there is a need for an algorithm like consistent hashing. In specific, we will try to know more about the algorithm that is developed by google.
We now know that we have load balance the requests across multiple data centers. Consider an example for why classic hashing technique is not sufficient. If you have a collection of n cache machines then a common way of load balancing across them is to put object o in cache machine number hash(o) mod n. This works well until you add or remove cache machines (for whatever reason), for then n changes and every object is hashed to a new location.
This is why consistent hashing comes into the picture. It is interesting to note that it is only the client that needs to implement the consistent hashing algorithm – the memcached server is unchanged. Other systems that employ consistent hashing include Chord, which is a distributed hash table implementation, and Amazon’s Dynamo, which is a key-value store (not available outside Amazon).
The algorithm which we will be discussing in the post is called as “consistent hashing with bounded loads”. The main aim for the algorithm is to achieve both uniformity and consistency in the resulting allocations.
We can think about the servers as bins and clients as balls.
The uniformity objective encourages all bins to have a load roughly equal to the average density (the number of balls divided by the number of bins). For some parameter ε, we set the capacity of each bin to either floor or ceiling of the average load times (1+ε). This extra capacity allows us to design an allocation algorithm that meets the consistency objective in addition to the uniformity property.
Imagine a given range of numbers overlaid on a circle. We apply a hash function to balls and a separate hash function to bins to obtain numbers in that range that correspond to positions on that circle. We then start allocating balls in a specific order independent of their hash values (let’s say based on their ID). Then each ball is moved clockwise and is assigned to the first bin with spare capacity.
Consider the example above where 6 balls and 3 bins are assigned using two separate hash functions to random locations on the circle. For the sake of this instance, assume the capacity of each bin is set to 2. We start allocating balls in the increasing order of their ID values. Ball number 1 moves clockwise and goes to bin C. Ball number 2 goes to A. Balls 3 and 4 go to bin B. Ball number 5 goes to bin C. Then ball number 6 moves clockwise and hits bin B first. However, bin B has capacity 2 and already contains balls 3 and 4. So ball 6 keeps moving to reach bin C but that bin is also full. Finally, ball 6 ends up in bin A that has a spare slot for it.
Upon any update in the system (ball or bin insertion/deletion), the allocation is recomputed to keep the uniformity objective. The art of the analysis is to show that a small update (a few number of insertions and deletions) results in minor changes in the state of the allocation and therefore the consistency objective is met. In the paper, its also show that every ball removal or insertion in the system results in O(1/ε2) movements of other balls.
This algorithm is just not theoretical! This is in fact implemented in one of the famous companies. Andrew Rodland from Vimeo had found the paper and used it for their load balancing project at Vimeo. The results were dramatic: applying these algorithmic ideas helped them decrease the cache bandwidth by a factor of almost 8, eliminating a scaling bottleneck. He recently summarized this story in a blog post detailing his use case.
Check out simple implementation of simple consistent hashing algorithm check –link
This was one of the interesting topics yet on AM; YourMove. Check out the channel here SUBSCRIBE NOW!
And, I’ll see you in the next one !