While designing highly scalable systems load balancing tier becomes an integral part of any architecture. We have captured some of our prior experiences working with Amazon ELB in this article as points detailed below. Some of the points mentioned here will be encountered only by advanced users in complex use cases. But surely if you/your team have noted some of these points, I feel it might shorten your efforts while debugging a problem or designing a solution and not go through the same effort cycle and pain as our team.
In AWS, there are wide variety of solution choices for the Load balancing layer like Amazon Elastic Load Balancing (ELB) , EC2 AMI’s like HAProxy , Nginx , Zeus , Citrix NetScaler. In this article we are going to dissect our experience with Amazon ELB layer as X points which you will not frequently encounter in Amazon documents or blogosphere.
To know more about Configuring Amazon ELB in 4 Easy Steps, Refer article:
|
Currently there are 18 points in this article and i am having plans to add some more in coming days . So if you are an advanced user of Amazon ELB , please watch this article closely.
Some of the points are:
Point 1) Algorithms
supported by Amazon ELB
Currently Amazon ELB only supports Round Robin(RR) and Session Sticky
Algorithms.
Round Robin algorithm can be used for load balancing traffic
between
- Web/App EC2 instances which are designed stateless
- Web/App EC2 instances which synchronizes the state between them
- Web/App EC2 instances which synchronizes the state using common data stores like MemCached , ElastiCache , Database etc.
- Web/App EC2 instances which are designed to be statefull
Point 2) Amazon ELB
is not a PAGE CACHE
Amazon ELB is just a load balancer and not to be confused with Page Cache Server or Web
Accelerator. Web Accelerators like
Varnish can cache pages, Static assets etc and also do RR load balancing to backend EC2 servers. Amazon ELB is designed to do just Load balancing efficiently and
elastically. If you need page accelerators + LB you can use Varnish or NetScaler in your LB Tier. Refer Article Varnish or NetScaler. Amazon ELB can be used with Amazon CloudFront to deliver the static
assets and dynamic assets that can be page cached at edge location itself to reduce latency for above use cases.
Point 3) Amazon ELB
can be pre warmed on request basis
Amazon ELB can be pre warmed by raising a request to Amazon
Web Service Support Team. Amazon team will pre warm the Load Balancers in the
ELB tier to handle the sudden load/flash traffic. This is advisable for scenarios
like Quarterly sales/launch campaigns, promotions etc which follow flash traffic pattern. AWS team would require details like estimated Request per second, average request size in bytes, average response size in bytes, what percentage of traffic is SSL/ Non SSL, whether HTTP/1.1 keep alive is enabled ? etc from your team. Once provided, it will be activated by them. Amazon
ELB pre warm cannot be done on hourly/daily basis (i think). It will be a cool feature if
Amazon team can get these details and offer ELB Pre warming as a configurable feature into the AWS console (like
Amazon DynamoDB console)
Point 4) Amazon ELB
is not designed for sudden load spikes /Flash traffic
Amazon ELB is designed to handle unlimited concurrent
requests per second with “gradually increasing” load pattern. It is not designed to handle heavy sudden
spike of load or flash traffic. For example: Imagine an e-commerce website whose traffic
increases gradually to thousands of concurrent requests/sec in hours,
Amazon ELB can easily handle this traffic pattern. According to RightScale
benchmark, Amazon ELB was easily able to handle 20K+ requests/sec and more in
such patterns. Whereas imagine use cases like Mass Online Exam or GILT load
pattern or 3-Hrs Sales/launch campaign sites expecting 20K+ concurrent
requests/sec spike suddenly in few minutes, Amazon ELB will struggle to handle this load pattern.
If this sudden spike pattern is not a frequent occurrence then we can pre warm
ELB, else we need to look for alternative Load balancers in AWS infrastructure.
Comparison analysis of HAProxy vs Amazon ELB, Refer article:
|
Currently Amazon ELB only supports
following protocols: HTTP, HTTPS (Secure HTTP), SSL (Secure TCP) and TCP
protocols. ELB supports load balancing for the following TCP ports: 25, 80,
443, and 1024-65535. In case RTMP
or HTTP Streaming protocol is needed, we need to use Amazon CloudFront CDN in your architecture.
Point 6) Amazon ELB timeouts
at 60 seconds (kept idle)
Amazon ELB currently timeouts
persistent socket connections @ 60 seconds if it is kept idle. This condition will
be a problem for use cases which generates large files (PDF, reports etc) at backend EC2,
sends them as response back and keeps connection idle during entire generation process. To
avoid this you'll have to send
something on the socket every 40 or so seconds to keep the connection active in Amazon ELB. Note: I heard we can extend this value after explaining the case to AWS support team.
Point 7) Amazon ELB
does not provide Permanent or Fixed IP for its load Balancers
Currently Amazon ELB does not provide fixed or permanent
IP address for the Load balancing instances that are launched in its tier. This will be
a bottleneck for enterprises which have compulsion to whitelist their Load
balancer IP’s in external firewalls/gateways. For such use cases, currently we can
use HAProxy, NginX, NetScaler over EC2 attached with Elastic IPs as load balancers in AWS infrastructure.
Designing High Availability @ HAProxy / ELB Layer
http://harish11g.blogspot.in/2012/10/high-availability-haproxy-amazon-ec2.html |
Point 8) Amazon ELB cannot do Multi AWS Region Load Balancing
Amazon ELB can be used to Load balance
- Multiple EC2 instances launched inside a Single Amazon Availability Zone
- Multiple EC2 instances launched inside Multiple Availability Zones inside a Single Region
To know more about DNS Load Balancing :
http://harish11g.blogspot.in/2012/06/aws-high-availability-dns-load.html To know more about Geo Distributed Load Balancing using Amazon Route 53 : http://harish11g.blogspot.in/2012/09/geo-distributed-route53-lbr-latency.html |
Point 9) Amazon ELB sticks
request when traffic is generated from Single IP
This point comes as a surprise to many users using Amazon
ELB. Amazon ELB behaves little strange when incoming traffic is originated from
Single or Specific IP ranges, it does not efficiently do round robin
and sticks the request. Amazon ELB
starts favoring a single EC2 or EC2’s in Single Availability zones alone in
Multi-AZ deployments during such conditions. For example: If you have
application A(customer company) and Application B, and Application B is
deployed inside AWS infrastructure with ELB front end. All the traffic generated from
Application A(single host) is sent to Application B in AWS, in this case ELB of
Application B will not efficiently Round Robin the traffic to Web/App EC2
instances deployed under it. This is because the entire incoming traffic from application A will
be from a Single Firewall/ NAT or Specific IP range servers and ELB will start unevenly
sticking the requests to Single EC2 or EC2’s in Single AZ.
Note: Users encounter this usually during load test, so it
is ideal to load test AWS Infra from multiple distributed agents.
Point 10)Too long
Load Balancer CNAMES causes issues in some firewalls /ISP
Some ISP's do not allow Amazon ELB CNAMES that exceeds 32 characters
and some firewalls versions/models (like Cisco PIX) will not allow larger
CNAMES , in such cases try to have shorter name.
Point 11) Amazon ELB cannot
Load Balance based on URL patterns
Amazon ELB cannot Load Balance based on URL patterns like
other Reverse proxies. Example Amazon ELB cannot direct and load balance between request URLs www.xyz.com/URL1 and www.xyz.com/URL2. Currently for such use cases you can use HAProxy in EC2.
Point 12) Amazon ELB
can easily support more than 20K+ Concurrent reqs/sec
Amazon ELB is designed to handle unlimited concurrent
requests per second. ELB is inherently scalable and it can elastically increase
/decrease its capacity depending upon the traffic. According to a benchmark
done by RightScale, Amazon ELB was easily able to scale out and handle 20K or
more concurrent requests /sec. Refer
URL: http://blog.rightscale.com/2010/04/01/benchmarking-load-balancers-in-the-cloud/
Point 13) Amazon ELB does not provide logs
Amazon ELB currently does not provide access to its log
files for analysis. We cannot debug load balancing problems , analyze the
traffic and access patterns; categorize bots / visitors etc currently because we do not have access to the ELB logs.This
will also be a bottleneck for some organizations which has strong audit/compliance
requirements to be met at all layers of their infrastructure. Amazon ELB can
generate the logs and put in Amazon S3 buckets– (feature request to Amazon ELB product
team)
Point 14) Monitoring
Amazon ELB
Amazon ELB is an AWS building block and it does not
currently provide access to its logs or Stats files for monitoring. Secondly,
we cannot get full access to the Load Balancers launched inside the ELB tier and install any
monitoring agents in it. This closed model of ELB makes us rely only on
CloudWatch metrics for monitoring. Refer this URL for ELB metrics that can be currently monitored: http://harish11g.blogspot.in/2012/02/cloudwatch-elastic-load-balancing.html
Point 15) Amazon ELB
and Compliance requirements
SSL Termination can be done at 2 levels using Amazon
ELB in your application architecture .They are
- SSL termination can be done at Amazon ELB Tier, which means connection is encrypted between Client(browser etc) and Amazon ELB, but connection between ELB and Web/App EC2 is clear. This configuration may not be acceptable in strictly secure environments and will not pass through compliance requirements.
- SSL termination can be done at Backend with End to End encryption, which means connection is encrypted between Client and Amazon ELB, and connection between ELB and Web/App EC2 backed is also encrypted. This is the recommended ELB configuration for meeting the compliance requirements at LB level.
- Important ELB-SSL Reference URLs
- Refer this URL to understand how to configure SSL Offloading in Amazon ELB : http://harish11g.blogspot.in/2012/03/ssl-offloading-elastic-load-balancing.html
- How Amazon ELB-SSL support/offloading helps Cloud Admins ? :http://harish11g.blogspot.in/2010/10/amazon-elastic-load-balancing-support.html
Sometimes ELB assigns its load balancers with IP address
ending with X.X.X.255. Though it is technically fine, there are certain
networks that will not properly route to an IP address ending in X.X.255 series. Unfortunately,
it is not possible to exclude an IP address ending in .255 from ELB currently. It is
possible, in such circumstances, some requests from certain users may face issues. Note this when you are debugging ELB for missing requests.
Point 17) Amazon ELB inherently
fault tolerant and Scalable service
Elastic Load Balancer does not cap the number of connections
that it can attempt to establish with the load balanced Amazon EC2 instances. We
can expect this number to scale with the number of concurrent HTTP, HTTPS, or
SSL requests or the number of concurrent TCP connections that the Elastic Load
Balancer receives. Since multiple Load balancers are launched in ELB tier, it
is inherently fault tolerant as well. If you need a Scalable and and elastic LB layer , then ELB comes highly recommended. Amazon ELB can be deployed to support following HA architectures in AWS : http://harish11g.blogspot.in/2012/02/elastic-load-balancing-aws-deployment.html
Point 18) Amazon ELB
+ Amazon AutoScaling : No graceful connection termination
Amazon ELB
can be configured with work seamlessly with Amazon AutoScaling and Amazon
CloudWatch. The New EC2 instances launched by AutoScaling are added to the ELB
for Load balancing automatically and whenever load drops; existing EC2
instances can be removed by Auto Scaling from ELB. Both Auto Scaling and ELB
use CloudWatch Monitoring for enabling this functionality. The important point
to remember while using this kind of integration is Amazon AutoScaling does not gracefully (without interruption to
existing connections) remove Web/App EC2 from Amazon ELB. The connections are
instantly dropped when the Web/App EC2 is removed and no grace period is given
by ELB or AutoScaling. This behavior of Auto
scaling can make dozens or hundreds of users to get error pages when they are
using the application when such an event occurs in the backend infrastructure.
To know more about Amazon Auto Scaling :
http://harish11g.blogspot.in/2011/04/auto-scaling-in-aws-cloud.html
Harish Presentation @ AWS Summit on Amazon Auto Scaling http://harish11g.blogspot.in/2011/11/scale-new-business-peaks-with-amazon.html How Auto Scaling can save costs ? http://harish11g.blogspot.in/2013/04/Amazon-Web-Services-AWS-Cost-Saving-Tips-how-Amazon-AutoScaling-can-reduce-leakage-save-costs.html |
Other Load Balancing Articles
21 comments:
Nicely done.
You might want to add that, when creating an ELB within a VPC, the subnets assigned to the ELB must be internet-routable, even if the backend EC2 instances are on subnets that are private and have no route to the internet.
It is very good post. I faced most of some of these issues
Good post, some useful points. For point 9 are you sure it wasn't caused by client side DNS caching since ELB uses DNS round robin to distribute incoming traffic initially across ELBs. And for point 15 you can have SSL termination on the ELB and have the ELB initiate a new SSL session to the backend instance so still retain end to end encryption albeit via two SSL sessions.
It is nicely done.... How about a comparision between HA proxy on a Large instance & ELB which one will be performing better....of course availability would be a challenge...
Very Nice. This kind of post could save hours of troubleshooting for someone new. Even if it does so for one person (ahem.. may be me) it is a extremely worthy post. It will be great to see such posts on other AWS services such as EC2 Dynamo DB RDS SES.
Regarding point 4:
You are basing this on results of Rightscale's test from 2010. Are you sure that ELB have not done any internal improvements in how fast it can scale up?
In the world of AWS things move and change very quickly. I'm not sure if data from 2010 is still accurate/relevant.
Another important issue that's good to know: when you add the ELB to support more than one AZ, it means it adds more IP addresses to the DNS record. no reason to add AZ you are not running instances on. also, that front is only working with the grace of the DNS roundrobin, so really DO NOT enable AZs you are not running instances on.
This also means that running two machines in one AZ and one in a second means the traffic won't be equal. if you are using more than one AZ behind the ELB, make sure you have roughly the same number of instances in each zone.
when will you update this article?
@rajshekar : We cannot compare HAProxy with ELB on concurrent req/sec basis because we do not know what is the instance type being used in the ELB tier at the point in time. Also ELB keeps scaling up/out its capacity depending upon the traffic and its internal algorithm. Since it is not direct apples to apples comparison it becomes tough to elaborate in this perspective. ELB also comes with Scalability + HA benefits packaged whereas we need to custom program this tier when using Nginx or HAProxy. On the other hand HAProxy as a LB in Amazon EC2 has got its own benefits and we use it in many of our implementations as well, some of my blog readers have requested me to detail this comparison out. I will surely do this in coming weeks.
@mxx : I agree that ELB team would have done some improvements on the scale up /out speed, but it is not directly measurable from outside since we do not know their internal scale out algorithm. if you had done any benchmarks on this front please share it across it will be really useful.
Unless you prewarm ELB i am sure it cannot spike its capacity in seconds to handle Flash Traffic of 20K reqs/sec.
I disagree with point 1. ELB does not entirely depend on Round Robin algorithm. Sure it comes into play, but at certain point.
My test reveal me that it uses Least Pending connection algorithm, while sending the traffic to the backend instances. It seems like ELB keeps track of these connections. Round Robin comes into play, when the backend instances have about (approximately) same pending number of requests.
But very good article!
How sure are you about point 18?
I've seen some people suggesting that the AutoScaling group tells the ELB to stop making connections to the EC2, and then attempts to cleanly shut down the EC2 (using 'shutdown -h now' or equivalent), which would mean that so long as connections finish within 20 seconds or so you should be fine (my reading tells me that after 20 seconds or so the AutoScaling group will indeed pull the plug on the EC2).
None of that is from Amazon themselves though, who seem to be very quiet in their documentation on the exact shut down procedure.
Great article, still high on Google's page rank so might be worth an edit to update the points that AWS have addressed such as ELB logging, Route53 aliasing for the CNAME length issue etc.
Thanks Harish.. Your post was helpful to identify one of our issues which started when ELB assigned itself with a .255 address and legacy hardware unable to route packets destined for it...
Two thumbs up.. Cheers
Manik
Amazon ELB - Can provide logs..
You need to enable ELB Logging to an S3 Bucket you like. Please alter the article. Thanks!
Viewers have requested for updates in this article. Will do the same in coming weeks.
Point #18 has also been addressed with the ConnectionDrainingPolicy that was introduced a while back :)
RE: Point 9. This definitely used to be an issue. It was not due to anything "fixable" -- ELB simply didn't route properly across AZ's from a small number of source ips (very likely they were hashing the IP to route). However, they've fixed this. At least with round robin, we no longer see volume imbalances across AZ's.
I can appreciate not having the time to update the article in detail but I suspect it would be helpful to readers to add a disclaimer to the top indicating that some (many?) of the issues have been addressed.
Post a Comment