Tuesday, July 10, 2012

Dissecting Amazon ELB : 18 things you should know

While designing highly scalable systems load balancing tier becomes an integral part of any architecture. We have captured some of our prior experiences working with Amazon ELB in this article as points detailed below. Some of the points mentioned here will be encountered only by advanced users in complex use cases. But surely if you/your team have noted some of these points, I feel it might shorten your efforts while debugging a problem or designing a solution and not go through the same effort cycle and pain as our team.

In AWS, there are wide variety of solution choices for the Load balancing layer like Amazon Elastic Load Balancing (ELB) , EC2 AMI’s like HAProxy , Nginx , Zeus , Citrix NetScaler. In this article we are going to dissect our experience with Amazon ELB layer as X points which you will not frequently encounter in Amazon documents or blogosphere. 

To know more about Configuring Amazon ELB in 4 Easy Steps, Refer article:  

Currently there are 18 points in this article and i am having plans to add some more in coming days . So if you are an advanced user of Amazon ELB , please watch this article closely. 

Some of the points are:

Point 1) Algorithms supported by Amazon ELB

Currently Amazon ELB only supports Round Robin(RR) and Session Sticky Algorithms.

Round Robin algorithm can be used for load balancing traffic between
  •  Web/App EC2 instances which are designed stateless
  •  Web/App EC2 instances which synchronizes the state between them
  •  Web/App EC2 instances which synchronizes the state using common data stores like MemCached , ElastiCache , Database etc.
Session Sticky algorithm can be used for load balancing traffic between
  • Web/App EC2 instances which are designed to be statefull
Current Version of ELB does not support Weighted or Least Connection algorithms like other Reverse proxies. We can probably expect these algorithms to be supported in future.

Point 2) Amazon ELB is not a PAGE CACHE
Amazon ELB is just a load balancer and not to be confused with Page Cache Server or Web Accelerator.  Web Accelerators like Varnish can cache pages, Static assets etc and also do RR load balancing to backend EC2 servers. Amazon ELB is designed to do just Load balancing efficiently and elastically. If you need page accelerators + LB you can use Varnish or NetScaler in your LB Tier. Refer Article Varnish or NetScaler. Amazon ELB can be used with Amazon CloudFront to deliver the static assets and dynamic assets that can be page cached at edge location itself to reduce latency for above use cases.

Point 3) Amazon ELB can be pre warmed on request basis
Amazon ELB can be pre warmed by raising a request to Amazon Web Service Support Team. Amazon team will pre warm the Load Balancers in the ELB tier to handle the sudden load/flash traffic. This is advisable for scenarios like Quarterly sales/launch campaigns, promotions etc which follow flash traffic pattern. AWS team would require details like estimated Request per second, average request size in bytes, average response size in bytes, what percentage of traffic is SSL/ Non SSL, whether HTTP/1.1 keep alive is enabled ? etc from your team. Once provided, it will be activated by them. Amazon ELB pre warm cannot be done on hourly/daily basis (i think). It will be a cool feature if Amazon team can get these details and offer ELB Pre warming as a configurable feature into the AWS console (like Amazon DynamoDB console)

Point 4) Amazon ELB is not designed for sudden load spikes /Flash traffic
Amazon ELB is designed to handle unlimited concurrent requests per second with “gradually increasing” load pattern.  It is not designed to handle heavy sudden spike of load or flash traffic. For example: Imagine an e-commerce website whose traffic increases gradually to thousands of concurrent requests/sec in hours, Amazon ELB can easily handle this traffic pattern. According to RightScale benchmark, Amazon ELB was easily able to handle 20K+ requests/sec and more in such patterns. Whereas imagine use cases like Mass Online Exam or GILT load pattern or 3-Hrs Sales/launch campaign sites expecting 20K+ concurrent requests/sec spike suddenly in few minutes, Amazon ELB will struggle to handle this load pattern. If this sudden spike pattern is not a frequent occurrence then we can pre warm ELB, else we need to look for alternative Load balancers in AWS infrastructure. 

Comparison analysis of HAProxy vs Amazon ELB, Refer article:  

Point 5) Protocols supported by Amazon ELB
Currently Amazon ELB only supports following protocols: HTTP, HTTPS (Secure HTTP), SSL (Secure TCP) and TCP protocols. ELB supports load balancing for the following TCP ports: 25, 80, 443, and 1024-65535. In case RTMP or HTTP Streaming protocol is needed, we need to use Amazon CloudFront CDN in your architecture.

Point 6) Amazon ELB timeouts at 60 seconds (kept idle)
Amazon ELB currently timeouts persistent socket connections @ 60 seconds if it is kept idle. This condition will be a problem for use cases which generates large files (PDF, reports etc) at backend EC2, sends them as response back and keeps connection idle during entire generation process. To avoid this you'll have to send something on the socket every 40 or so seconds to keep the connection active in Amazon ELB. Note: I heard we can extend this value after explaining the case to AWS support team.

Point 7) Amazon ELB does not provide Permanent or Fixed IP for its load Balancers
Currently Amazon ELB does not provide fixed or permanent IP address for the Load balancing instances that are launched in its tier. This will be a bottleneck for enterprises which have compulsion to whitelist their Load balancer IP’s in external firewalls/gateways. For such use cases, currently we can use HAProxy, NginX, NetScaler over EC2 attached with Elastic IPs as load balancers in AWS infrastructure.

Designing High Availability @ HAProxy / ELB Layer

Point 8) Amazon ELB cannot do Multi AWS Region Load Balancing
Amazon ELB can be used to Load balance
  • Multiple EC2 instances launched inside a Single Amazon Availability Zone
  • Multiple EC2 instances launched inside Multiple Availability Zones inside a Single Region
Amazon ELB cannot load Balance between EC2 instances launched on Multiple AWS regions . Use Route53 DNS RR / LBR / Failover configurations for Load balancing at DNS level between ELB, EC2 etc launched at multiple AWS Regions

To know more about DNS Load Balancing :

To know more about Geo Distributed Load Balancing using Amazon Route 53 :

Point 9) Amazon ELB sticks request when traffic is generated from Single IP
This point comes as a surprise to many users using Amazon ELB. Amazon ELB behaves little strange when incoming traffic is originated from Single or  Specific IP ranges, it does not efficiently do round robin and sticks the request.  Amazon ELB starts favoring a single EC2 or EC2’s in Single Availability zones alone in Multi-AZ deployments during such conditions. For example: If you have application A(customer company) and Application B, and Application B is deployed inside AWS infrastructure with ELB front end. All the traffic generated from Application A(single host) is sent to Application B in AWS, in this case ELB of Application B will not efficiently Round Robin the traffic to Web/App EC2 instances deployed under it. This is because the entire incoming traffic from application A will be from a Single Firewall/ NAT or Specific IP range servers and ELB will start unevenly sticking the requests to Single EC2 or EC2’s in Single AZ.
Note: Users encounter this usually during load test, so it is ideal to load test AWS Infra from multiple distributed agents. 

Point 10)Too long Load Balancer CNAMES causes issues in some firewalls /ISP
Some ISP's do not allow Amazon ELB CNAMES that exceeds 32 characters and some firewalls versions/models (like Cisco PIX) will not allow larger CNAMES , in such cases try to have shorter name.

Point 11) Amazon ELB cannot Load Balance based on URL patterns
Amazon ELB cannot Load Balance based on URL patterns like other Reverse proxies. Example Amazon ELB cannot direct and load balance between request URLs  www.xyz.com/URL1 and www.xyz.com/URL2. Currently for such use cases you can use HAProxy in EC2.

Point 12) Amazon ELB can easily support more than 20K+ Concurrent reqs/sec
Amazon ELB is designed to handle unlimited concurrent requests per second. ELB is inherently scalable and it can elastically increase /decrease its capacity depending upon the traffic. According to a benchmark done by RightScale, Amazon ELB was easily able to scale out and handle 20K or more concurrent requests /sec. Refer URL: http://blog.rightscale.com/2010/04/01/benchmarking-load-balancers-in-the-cloud/

Point 13) Amazon ELB does not provide logs
Amazon ELB currently does not provide access to its log files for analysis. We cannot debug load balancing problems , analyze the traffic and access patterns; categorize bots / visitors etc currently because we do not have access to the ELB logs.This will also be a bottleneck for some organizations which has strong audit/compliance requirements to be met at all layers of their infrastructure. Amazon ELB can generate the logs and put in Amazon S3 buckets– (feature request to Amazon ELB product team)

Point 14) Monitoring Amazon ELB
Amazon ELB is an AWS building block and it does not currently provide access to its logs or Stats files for monitoring. Secondly, we cannot get full access to the Load Balancers launched inside the ELB tier and install any monitoring agents in it. This closed model of ELB makes us rely only on CloudWatch metrics for monitoring. Refer this URL for ELB metrics that can be currently monitored: http://harish11g.blogspot.in/2012/02/cloudwatch-elastic-load-balancing.html

Point 15) Amazon ELB and Compliance requirements
SSL Termination can be done at 2 levels using Amazon ELB in your application architecture .They are
  • SSL termination can be done at Amazon ELB Tier, which means connection is encrypted between Client(browser etc) and Amazon ELB, but connection between ELB and Web/App EC2 is clear. This configuration may not be acceptable in strictly secure environments and will not pass through compliance requirements.
  • SSL termination can be done at Backend with End to End encryption, which means connection is encrypted between Client and Amazon ELB, and connection between ELB and Web/App EC2 backed is also encrypted. This is the recommended ELB configuration for meeting the compliance requirements at LB level. 
  • Important ELB-SSL Reference URLs 
Point 16) Amazon ELB and X.X.X.255 IP address
Sometimes ELB assigns its load balancers with IP address ending with X.X.X.255. Though it is technically fine, there are certain networks that will not properly route to an IP address ending in X.X.255 series. Unfortunately, it is not possible to exclude an IP address ending in .255 from ELB currently. It is possible, in such circumstances, some requests from certain users may face issues. Note this when you are debugging ELB for missing requests. 

Point 17) Amazon ELB inherently fault tolerant and Scalable service
Elastic Load Balancer does not cap the number of connections that it can attempt to establish with the load balanced Amazon EC2 instances. We can expect this number to scale with the number of concurrent HTTP, HTTPS, or SSL requests or the number of concurrent TCP connections that the Elastic Load Balancer receives. Since multiple Load balancers are launched in ELB tier, it is inherently fault tolerant as well. If you need a Scalable and and elastic LB layer , then ELB comes highly recommended. Amazon ELB can be deployed to support following HA architectures in AWS : http://harish11g.blogspot.in/2012/02/elastic-load-balancing-aws-deployment.html

Point 18) Amazon ELB + Amazon AutoScaling : No graceful connection termination
Amazon ELB can be configured with work seamlessly with Amazon AutoScaling and Amazon CloudWatch. The New EC2 instances launched by AutoScaling are added to the ELB for Load balancing automatically and whenever load drops; existing EC2 instances can be removed by Auto Scaling from ELB. Both Auto Scaling and ELB use CloudWatch Monitoring for enabling this functionality. The important point to remember while using this kind of integration is Amazon AutoScaling does not gracefully (without interruption to existing connections) remove Web/App EC2 from Amazon ELB. The connections are instantly dropped when the Web/App EC2 is removed and no grace period is given by ELB or AutoScaling. This behavior of Auto scaling can make dozens or hundreds of users to get error pages when they are using the application when such an event occurs in the backend infrastructure.

To know more about Amazon Auto Scaling :

Article under progress

Other Load Balancing Articles


Seren Thompson said...

Nicely done.

Seren Thompson said...

You might want to add that, when creating an ELB within a VPC, the subnets assigned to the ELB must be internet-routable, even if the backend EC2 instances are on subnets that are private and have no route to the internet.

Unknown said...

It is very good post. I faced most of some of these issues

Anonymous said...

Good post, some useful points. For point 9 are you sure it wasn't caused by client side DNS caching since ELB uses DNS round robin to distribute incoming traffic initially across ELBs. And for point 15 you can have SSL termination on the ELB and have the ELB initiate a new SSL session to the backend instance so still retain end to end encryption albeit via two SSL sessions.

Rajshekar said...

It is nicely done.... How about a comparision between HA proxy on a Large instance & ELB which one will be performing better....of course availability would be a challenge...

Anonymous said...

Very Nice. This kind of post could save hours of troubleshooting for someone new. Even if it does so for one person (ahem.. may be me) it is a extremely worthy post. It will be great to see such posts on other AWS services such as EC2 Dynamo DB RDS SES.

mxx said...

Regarding point 4:
You are basing this on results of Rightscale's test from 2010. Are you sure that ELB have not done any internal improvements in how fast it can scale up?
In the world of AWS things move and change very quickly. I'm not sure if data from 2010 is still accurate/relevant.

Ira said...

Another important issue that's good to know: when you add the ELB to support more than one AZ, it means it adds more IP addresses to the DNS record. no reason to add AZ you are not running instances on. also, that front is only working with the grace of the DNS roundrobin, so really DO NOT enable AZs you are not running instances on.

This also means that running two machines in one AZ and one in a second means the traffic won't be equal. if you are using more than one AZ behind the ELB, make sure you have roughly the same number of instances in each zone.

Anonymous said...

when will you update this article?

Harish Ganesan said...

@rajshekar : We cannot compare HAProxy with ELB on concurrent req/sec basis because we do not know what is the instance type being used in the ELB tier at the point in time. Also ELB keeps scaling up/out its capacity depending upon the traffic and its internal algorithm. Since it is not direct apples to apples comparison it becomes tough to elaborate in this perspective. ELB also comes with Scalability + HA benefits packaged whereas we need to custom program this tier when using Nginx or HAProxy. On the other hand HAProxy as a LB in Amazon EC2 has got its own benefits and we use it in many of our implementations as well, some of my blog readers have requested me to detail this comparison out. I will surely do this in coming weeks.

Harish Ganesan said...

@mxx : I agree that ELB team would have done some improvements on the scale up /out speed, but it is not directly measurable from outside since we do not know their internal scale out algorithm. if you had done any benchmarks on this front please share it across it will be really useful.
Unless you prewarm ELB i am sure it cannot spike its capacity in seconds to handle Flash Traffic of 20K reqs/sec.

Anonymous said...

I disagree with point 1. ELB does not entirely depend on Round Robin algorithm. Sure it comes into play, but at certain point.

My test reveal me that it uses Least Pending connection algorithm, while sending the traffic to the backend instances. It seems like ELB keeps track of these connections. Round Robin comes into play, when the backend instances have about (approximately) same pending number of requests.

But very good article!

Anonymous said...

How sure are you about point 18?

I've seen some people suggesting that the AutoScaling group tells the ELB to stop making connections to the EC2, and then attempts to cleanly shut down the EC2 (using 'shutdown -h now' or equivalent), which would mean that so long as connections finish within 20 seconds or so you should be fine (my reading tells me that after 20 seconds or so the AutoScaling group will indeed pull the plug on the EC2).

None of that is from Amazon themselves though, who seem to be very quiet in their documentation on the exact shut down procedure.

Unknown said...

Great article, still high on Google's page rank so might be worth an edit to update the points that AWS have addressed such as ELB logging, Route53 aliasing for the CNAME length issue etc.

Anonymous said...

Thanks Harish.. Your post was helpful to identify one of our issues which started when ELB assigned itself with a .255 address and legacy hardware unable to route packets destined for it...

Two thumbs up.. Cheers

Anonymous said...

Amazon ELB - Can provide logs..

You need to enable ELB Logging to an S3 Bucket you like. Please alter the article. Thanks!

Harish Ganesan said...
This comment has been removed by the author.
Harish Ganesan said...

Viewers have requested for updates in this article. Will do the same in coming weeks.

Anonymous said...

Point #18 has also been addressed with the ConnectionDrainingPolicy that was introduced a while back :)

Anonymous said...

RE: Point 9. This definitely used to be an issue. It was not due to anything "fixable" -- ELB simply didn't route properly across AZ's from a small number of source ips (very likely they were hashing the IP to route). However, they've fixed this. At least with round robin, we no longer see volume imbalances across AZ's.

Anonymous said...

I can appreciate not having the time to update the article in detail but I suspect it would be helpful to readers to add a disclaimer to the top indicating that some (many?) of the issues have been addressed.

Need Consulting help ?


Email *

Message *

All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.