DNS is a globally distributed service that translates human
readable names like www.abc.com into the numeric IP addresses like 192.0.3.2. DNS
servers translate requests for names into IP addresses, controlling which
server an end user will connect to when they type a domain name into their web
browser. In Amazon Web Services infrastructure this function is provided by
Route 53 DNS.
Route 53 is a scalable and authoritative Domain Name System
(DNS) web service. Route 53 is also a Tier-0 service – where availability is
most important for its success.
Route 53 responds to DNS queries using a global network of
authoritative DNS servers, which reduces latency. It also provides secure and
reliable routing to our infrastructure that uses Amazon Web Services (AWS) products,
such as Amazon Elastic Compute Cloud (Amazon EC2) and Elastic Load Balancing.
AWS recently enhanced Route53 with the ability to do latency
based routing, which serves user requests from the AWS region with lowest
network latency.
If our application is hosted on Amazon EC2 instances in
multiple AWS regions, we can reduce latency for our end users by serving their
requests from the EC2 region with lowest network latency. Route 53 LBR lets us
use DNS to route end-user requests to the AWS region that will give our
application users the fastest response. This way it helps us to improve our
application’s performance for a global audience.
Why do we need Latency Based Routing (LBR) ?
Imagine a XYZ Low Cost Airlines from Singapore which
operates 100+ destinations around the globe (rapidly expanding its operations every
year) with following characteristics :
- Majority of the bookings and business happens through online and mobile medium
- Their Website and online services are visited by users from Japan, Singapore, Australia, Europe and Middle East all over the year
- During Sales promotions and holiday seasons they will have visitors from more locations around the world
Since their business has heavy dependence upon the
online mediums (web and mobile), their site needs to be highly available, scalable
with better performance. Like most of the online companies, the XYZ airlines
started their operations with the - Centralized
Architecture
Effect of
Centralized Architecture
The entire web/app infrastructure is provisioned inside a
single AWS region (example Singapore region). User requests originating all
around the world are directed to the centralized infrastructure launched at single
region. This architecture may be optimum to start with, but when your user base
is distributed across multiple geographies things start crumbling. The users
accessing the site from different geographies will have different response
times because of the network latency in the internet, Also there is single
point of failure if the network link to that particular Amazon EC2 region is
broken (though the latter is a very rare occurrence).Example: Users from Singapore
and Malaysia will have faster response times with AWS Singapore Servers than
Users from Europe, MEA regions (they might feel the latency creeping up).
Since the XYZ airline follows a simple centralized
architecture and has a data center in Singapore and all the visitors are
distributed across the globe, let us see what it takes as the roundtrip latency
measurements to access their website in centralized infrastructure:
Website is hosted on Singapore region (centralized architecture)
|
From Japan (ms)
|
From Germany(ms)
|
From Malaysia (ms)
|
From Rio (ms)
|
74
|
121
|
30
|
176
|
Note: The above
measurements are not constant and may keep varying every few seconds.
If a single round trip takes 121ms from Europe imagine
content heavy pages loading multiple images, JS, scripts, HTTP/S, hundreds of
AJAX calls etc. It is sure they will land in problems when they expand.
What is the Cost of Latency?
One of the most important rules for a company whose business
depends on the online mediums is PERFORMANCE. No visitor likes a slow loading
site and everybody feels faster is always better. Many companies doing business
online sooner or later realize that latency has definite impact on their Sales,
Bounce rates, Avg time visitors spend on site etc. Poor Performance affects the
online sales heavily,79% of dissatisfied shoppers are less likely to buy from
an online site again and 75% would be less likely to return to the website
again.
In 2009, a study by Forrester Research found that online
shoppers expected pages to load in two seconds or fewer — and at three seconds,
a large share of visitors abandon the site. This two-second rule is still often
cited as a standard for Web commerce sites. But many in online industry feel it
is already outdated. Let see what they think;
Amazon found that shaving 100ms off of load time results in
a 1% increase in sales in their .com site.
Google engineers found that users begin to get frustrated
with a site after waiting just 400 milliseconds. (Note: Google considers your
site speed in their ranking calculations. If the site loads very slowly your
.COM is going to ranked lower and eventually ends up losing business to your
competitors.)
“Two hundred fifty milliseconds, either slower or faster, is
close to the magic number now for competitive advantage on the Web,” People
will visit a Web site less often if it is slower than a close competitor by
more than 250 milliseconds. - Microsoft.
A year-long performance redesign resulted in a 5 second
speed up (from ~7 seconds to ~2 seconds). This resulted in a 25% increase in
page views, a 7-12% increase in revenue, and a 50% reduction in hardware. This
last point shows the win-win of performance improvements, increasing revenue
while driving down operating costs - Shopzilla
Now that we had pretty much understood the importance and
cost of poor architectures not concentrating on reducing latency, Let us
explore how to address this problem.
Welcome to Geo Distributed
Architecture (Using Route53 LBR)
In this architecture, the web/app infrastructure of the XYZ
Airlines is geo distributed across multiple AWS regions (example Singapore,
Japan, Europe etc).
User requests originating all around the world are directed
to the nearest AWS region or AWS region with lowest network latency (more
precisely). For example, suppose you have Load balancers in the Singapore, Japan
and Europe Amazon EC2 region and we have created a latency resource record set
in Route 53 for each load balancer. An end user in Dubai enters the name of our
domain in their browser, and the Domain Name System routes their request to a
Route 53 name server. Route 53 refers to its data on latency between Dubai and
the Europe EC2 region and between Dubai and the Singapore EC2 region. If
latency is lower between Dubai and the Europe region( most of the times), Route
53 responds to the end user's request with the IP address of your load balancer
in the Amazon EC2 data center in Europe EC2 region. If latency is lower between
Dubai and the Singapore region, Route 53 responds with the IP address of the
load balancer in the Amazon EC2 data center in Singapore. This architecture
rapidly cuts down the latency and gives the user a better experience. Also in case one of the regions is
facing network problems, the requests can be routed to alternate low latency
region achieving High Availability at overall website level. Though this
architecture has benefits it comes with various complexities depending upon the
use case and technical stack used, we will uncover some of them in this article.
If the XYZ airline follows GEO distributed architecture and
has data centers in Singapore, Japan and Europe, let us see what it takes as
the roundtrip latency measurements to access their website:
The XYZ airlines website is hosted on multiple AWS region with LBR
configured
|
From Japan (ms)
|
From Germany(ms)
|
From Malaysia (ms)
|
From Rio (ms)
|
12
|
21
|
30
|
75
|
Note: The above
measurements are not constant and may keep varying every few seconds.
From the above table we can observe that using the Geo
Distributed architecture the HTTP/S and AJAX calls are delivered from the AWS
regions with lowest latency (usually nearest region), the round trip latency
measurements have significantly dropped and overall performance has increased
for the users.
Complexities and Best Practices behind Geo Distributed + R53 LBR
Let us take a
simple Geo distributed online app stack and explore the technicalities and best practices a bit:
DNS and CDN Layer: Configure Route 53 to manage the DNS entries,map Domain Names to CloudFront distributions to and Latency Based Routing entries. LBR records point to the Amazon Elastic
Load Balancer's endpoint in Europe and Singapore. Amazon Route 53’s Latency Based Routing (LBR) feature
will route Amazon CloudFront origin requests to the AWS Region that provides the
lowest possible latency. Internally Amazon Route 53 is integrated with Amazon CloudFront to collect latency
measurements from each Amazon CloudFront edge location, resulting in optimal
performance for origin fetches and improving overall performance
Load Balancing Layer: Amazon Elastic Load Balancing (ELB)
is used as the Load Balancing layer. ELB can elastically expand its capacity to
handle load during peak traffic. Amazon ELB should be configured with SSL
termination @ Apache Backends, for meeting Security and Compliance in case
sensitive information gets passed. Round Robin Algorithm is ideal for most
scenarios. ELB should be configured to load balance across Multiple –AZ inside
an Amazon EC2 Region. For more details about architecting using ELB refer http://harish11g.blogspot.in/2012/07/aws-elastic-load-balancing-elb-amazon.html
Web/App Layer: Apache Tomcat EC2’s are launched from
S3 backed Linux AMI’s in Multiple-AZ’s. Logs periodically shipped to S3. Amazon
Auto Scaling can be configured based on CPU or custom metrics to elastically
increase/decrease the EC2 instances across Multiple AZ (recommended approach
for Scalability and HA). ELB, Amazon Auto Scaling, CloudWatch and Route 53 work together.
Session State is synchronized on MemCached. For more details on Amazon EC2
Availability Zones refer http://harish11g.blogspot.in/2012/07/amazon-availability-zones-aws-az.html
Solr Search Layer: Solr Search Instances are launched as
EBS backed AMI’s. Solr EC2 can be replicated between Multiple –AZ’s or sharded
inside an AZ depending upon need. High Memory instances with RAID levels (EBS
Striping) + EBS optimized + Provisioned IOPS give better performance on AWS. Periodic Snapshots are taken and
moved across regions. Sync Solr and DB periodically. For more details
on Solr Sharding refer http://harish11g.blogspot.in/2012/02/apache-solr-sharding-amazon-ec2.html
Database Layer:
If the use case demands the Data to be
localized inside an Amazon EC2 region then one of the following approaches are
recommended:
- · RDS with Multi-AZ for HA, HAProxy Load balanced RDS Read Replicas across Multiple AZ’s for Read scaling are recommended approaches
- · MySQL Master with 1-2 Slaves spread across multiple AZ’s inside a Region , RAID 0 with XFS + EBS optimized + PIOPS for performance
If the use case demands the unidirectional
Data synchronization across Amazon EC2 regions then:
- MySQL Master can sync data to a MySQL Slave in another Amazon EC2 region. Data can be sent over SSL or clear according to the requirements.
- If the MySQL is inside VPC (private subnet) then IPSEC tunnel should be established across 2 Amazon EC2 regions for communication.
If the use case demands the Bi
directional Data synchronization across Amazon EC2 regions then:
Caching Layer: Use MemCacheD/ElastiCache for storing Sessions, results of Heavy queries, frequently used queries and complex queries of DB/Solr and thereby significantly reducing the database load . ElastiCache cannot be distributed over AZ. MemCacheD over Amazon EC2 with Multi-AZ distribution is recommended for website which heavily relies on Caching Layer. Cache need not be replicated across regions, in case needed you need to sync Master/Slave DB's replication with MemCacheD to ensure some consistency.
- MySQL Master-Master across regions and Master-Slave inside Regions can be configured. Though the bi directional data synch can be achieved, transactional integrity will become complex. Overall this model is not very efficient when the number of AWS regions increase.
- Usually the best practice is to streamline and avoid bidirectional sync and expose the function as common data web service that can be consumed over web. This way the Geo distributed applications in both the Amazon EC2 regions can consume that function for information.
Caching Layer: Use MemCacheD/ElastiCache for storing Sessions, results of Heavy queries, frequently used queries and complex queries of DB/Solr and thereby significantly reducing the database load . ElastiCache cannot be distributed over AZ. MemCacheD over Amazon EC2 with Multi-AZ distribution is recommended for website which heavily relies on Caching Layer. Cache need not be replicated across regions, in case needed you need to sync Master/Slave DB's replication with MemCacheD to ensure some consistency.
Storage Layer: S3 for storing Images, JS and other static assets. S3 can be the CloudFront origin. All logs, user uploaded files will be synched to S3 in Amazon EC2 region.
Functional Patterns for Geo Distributed App ( Sample)
In any online application (example the XYZ Low Cost Airlines.com) some functions will be heavily used compared to
others. Functions like Search for Tickets, Search for Products, User
profiles and Product details, Flight Schedules and Status, Trip itinerary and
printing, Discounts and offers are heavily accessed compared to others. These
modules need to be highly scalable and available for the visitors to consume
information anytime. Since they are also the customer facing pages, usually
these functions are designed with heavy content or lots of AJAX calls.
Performance becomes very critical when we adopt a content heavy design and adds
burden to your latency. Also, Millions of hits will happen on search and
product view functions, but only a percentage (few thousands) of them will be
converted as bookings, it becomes very critical for the company to architect
these functions and customer facing pages with proper user experience and
faster performance. Every second it takes extra to load these pages thousands of customers will lose patience and leave the site
and may result millions worth dollars of business is lost in a year. Below
diagram illustrates the sample list of functions
If you observe the characteristics of functions like Booking,
Payments, User Profile, Registrations etc they are done only by few thousands
of users in a day (compared to millions of hits to landing, search pages
etc). They are not graphics heavy, but
usually secured using SSL/HTTPS. Users can afford to wait while accessing these
pages but security, availability and data integrity takes precedence in these
functions. Depending upon the use case, these functions can be segregated as
common services, exposed over the internet as HTTP/S or web services (REST,
SOAP) for consumption by other services. Also since these functions need not be
regionally distributed they can be shared and served from a common
region/location. Example: Search, Discounts, Events, Schedules can be delivered
from the nearest AWS region whereas Bookings and payments can be delivered from
a common AWS region. This way both performance and data integrity can be
maintained at overall application level.
Benefits of using Geo
Distributed+Route53 + LBR
- · Better performance for users than running in single AWS Region
- · More business conversions for online companies
- · Less or No business lost because of latency ( cost of latency)
- · Overall improved reliability compared to running in a single region
- · Easier implementation and much lower prices than traditional DNS solutions
- · Complex and costlier to maintain compared to centralized architecture
References
No comments:
Post a Comment