Cloud, Big Data and Mobile: Part 2: Understanding Amazon ElastiCache Internals:Elasticity Implication and Solutions

In this article we will explore the implications of Elasticity which Amazon ElastiCache brings to the architecture. The Amazon ElastiCache nodes use memcached engine currently and they are not aware of other memcached nodes running inside the cache cluster. Since the cache nodes are not aware of the data distribution among its peers, the balancing act has to be done by the clients. Usually cache clients use a simple hashing algorithm to PUT/GET values from the corresponding cache nodes. This works well if the cache cluster size is static, but for any growing web application being static will not serve the purpose. Let us assume an online business decides to run a promotion where nearly 6X more visitors are expected to hit the site during this period. It is natural that for any heavy cache dependent site their memory needs will also increase correspondingly during this period. Amazon ElastiCache service understands this elastic behavior in the online business and provides an easy mechanism using API or console to add / remove cache nodes in existing cluster. Whenever your cache memory needs grow, you can simply add “N” new cache nodes into the Amazon ElastiCache cluster. But as an IT architect you need to understand this elastic action comes with certain side effects in a distributed caching scenario with Memcached-Amazon ElastiCache Nodes. They may cause swamping of your backend with requests if it is not dealt properly. Let us understand effects in detail:

Effect 1: Cold Cache: Amazon ElastiCache nodes using Memcached engine are ephemeral in nature. The KV data is stored in the memory and it is not persisted to the disk. Whenever you add new cache nodes you need to understand the proportion of increase % and accordingly take a cache node scale out strategy. Imagine you have 2 cache nodes each with cache.m1.large capacity, now if you decide to add 2 more cache nodes of the same type into the cluster, you are adding close to 50% of capacity in cold state without proper cache warming. This action may bring undesirable consequences and swamp your backend with heavy requests until the Amazon ElastiCache Nodes are properly warmed. Whereas if you have 100 cache.m1.large nodes running your cluster and if you are planning to add 5 more nodes into your cluster , it will not have a big impact if the backend is designed to handle this little spike variation. Some best practices that can be followed in this aspect are:

Plan for the addition of the nodes well in advance before the promotion, so that enough time is given for warming the cache nodes adequately.

Add the cache nodes in right proportions which your backend can optimally take without performance disruption. Example: Instead of adding 2 cache nodes in 2 X m1.large cache cluster, adding one by one with enough time to warm will add only ~25% load to your backend. For some advanced strategies using cache redundancy, Maintenance windows etc in AWS to address this situation refer this URL: http://harish11g.blogspot.in/2012/11/amazon-elasticache-memcached-ec2.html

Effect 2: Object Remapping: The most common approach used by clients of Amazon ElastiCache nodes is to distribute object “o” among multiple cache nodes “n” by putting object o in cache node number hash(o) mod n function( result of this function). This approach is good for static cache node scenarios, but when you add or remove cache nodes then object “o” may need to be hashed to a new location every time the “n” nodes change. This operation can thunder your backend with heavy load and cause undesirable consequences. Ideally it would be nice, if a new cache node is added/removed only fair share of objects were remapped to other cache nodes in the Amazon ElastiCache. This can be achieved by using “Consistent Hashing” in the cache node clients. Since Amazon ElastiCache nodes are not peer aware, it does not require any change, it is only in the cache clients we need to apply this intelligent hashing approach. Consistent Hashing was introduced by David Karger et al in the paper “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web”. The paper can be found here. http://www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforrelievingHotSpotsontheworldwideweb.pdf

Consistent hashing was first implemented by last.fm team on memcached library as “ketama”. Refer URL: http://www.last.fm/user/RJ/journal/2007/04/10/rz_libketama_-_a_consistent_hashing_algo_for_memcache_clients

Now let us explore what is Consistent hashing? How it helps and so on. The idea behind the consistent hashing algorithm is to hash both objects and cache nodes using the same hash function and consistently maps objects to the same cache node, as much as possible. Consistent hashing uses a mechanism that acts like a clock. The hash function actually maps objects and cache nodes to a number range. The number range values wrap around like a circle, that's why we call this circle a continuum. Imagine in the below picture a circle with number of objects 1,2,3,4 and cache nodes A,B and C. To assign which cache node an object goes in, you move clockwise round the circle until you find an appropriate cache node. So in the below diagram, you can see object 1 and 4 belong to cache node A, object 2 belongs to cache node B and object 3 belongs in cache node C.

Consider what happens if cache node C is removed from the cache cluster: object 3 now belongs will be remapped to cache node A, and all the other object mappings are unchanged. Same way, If we add another cache node D in the position marked below in the diagram it will take objects 3 and 4, only object 1 belonging to A.

The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected .This approach reduces the remapping of the objects between the cache nodes and there by significantly decrease the swamping of backend servers in event of cache elasticity.

Consistent Hashing implementation is available in most of the popular Amazon ElastiCache-Memcached clients we use every day. There is a saying “everything comes with a consequence”; since the approach used in consistent hashing is random, it is possible to have a very non-uniform data and load distribution of objects between cache nodes inside/across a cluster.

To address this issue, more advanced KV systems like Membase and Amazon Dynamo uses a variant of consistent hashing. Instead of mapping a node to a single point in the circle, each node gets assigned to multiple points in the ring (Amazon Dynamo uses the concept of “virtual nodes”). A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node. Effectively, when a new node is added to the system, it is assigned multiple positions (henceforth, “tokens”) in the ring. To know more about this approach read Amazon Dynamo paper. I hope in future Amazon ElastiCache Team implements this concept in their distributed caching service as well, because it will help us optimally use nodes.

Related Articles

Part 1: Understanding Amazon ElastiCache Internals : Connection overhead

Part 2: Understanding Amazon ElastiCache Internals : Elasticity Implication and Solutions

Part 3: Understanding Amazon ElastiCache Internals : Auto Discovery

Part 4: Understanding Amazon ElastiCache Internals : Economics of Choosing Cache Node Type

Launching Amazon ElastiCache in 3 Easy Steps

Caching architectures using Memcached & Amazon ElastiCache

Web Session Synchronization patterns in AWS