Sunday, September 23, 2012

Cost of Latency and Route53 LBR

DNS is a globally distributed service that translates human readable names like into the numeric IP addresses like DNS servers translate requests for names into IP addresses, controlling which server an end user will connect to when they type a domain name into their web browser. In Amazon Web Services infrastructure this function is provided by Route 53 DNS.
Route 53 is a scalable and authoritative Domain Name System (DNS) web service. Route 53 is also a Tier-0 service – where availability is most important for its success.
Route 53 responds to DNS queries using a global network of authoritative DNS servers, which reduces latency. It also provides secure and reliable routing to our infrastructure that uses Amazon Web Services (AWS) products, such as Amazon Elastic Compute Cloud (Amazon EC2) and Elastic Load Balancing.

AWS recently enhanced Route53 with the ability to do latency based routing, which serves user requests from the AWS region with lowest network latency.
If our application is hosted on Amazon EC2 instances in multiple AWS regions, we can reduce latency for our end users by serving their requests from the EC2 region with lowest network latency. Route 53 LBR lets us use DNS to route end-user requests to the AWS region that will give our application users the fastest response. This way it helps us to improve our application’s performance for a global audience.

Why do we need Latency Based Routing (LBR) ?

Imagine a XYZ Low Cost Airlines from Singapore which operates 100+ destinations around the globe (rapidly expanding its operations every year) with following characteristics :
  • Majority of the bookings and business happens through online and mobile medium
  • Their Website and online services are visited by users from Japan, Singapore, Australia, Europe and Middle East all over the year
  • During Sales promotions and holiday seasons they will have visitors from more locations around the world

Since their business has heavy dependence upon the online mediums (web and mobile), their site needs to be highly available, scalable with better performance. Like most of the online companies, the XYZ airlines started their operations with the - Centralized Architecture

Effect of Centralized Architecture 

The entire web/app infrastructure is provisioned inside a single AWS region (example Singapore region). User requests originating all around the world are directed to the centralized infrastructure launched at single region. This architecture may be optimum to start with, but when your user base is distributed across multiple geographies things start crumbling. The users accessing the site from different geographies will have different response times because of the network latency in the internet, Also there is single point of failure if the network link to that particular Amazon EC2 region is broken (though the latter is a very rare occurrence).Example: Users from Singapore and Malaysia will have faster response times with AWS Singapore Servers than Users from Europe, MEA regions (they might feel the latency creeping up).
Since the XYZ airline follows a simple centralized architecture and has a data center in Singapore and all the visitors are distributed across the globe, let us see what it takes as the roundtrip latency measurements to access their website in centralized infrastructure:

Website is hosted on Singapore region (centralized architecture)
From Japan (ms)
From Germany(ms)
From Malaysia (ms)
From Rio (ms)

Note: The above measurements are not constant and may keep varying every few seconds.
If a single round trip takes 121ms from Europe imagine content heavy pages loading multiple images, JS, scripts, HTTP/S, hundreds of AJAX calls etc. It is sure they will land in problems when they expand.

What is the Cost of Latency?

One of the most important rules for a company whose business depends on the online mediums is PERFORMANCE. No visitor likes a slow loading site and everybody feels faster is always better. Many companies doing business online sooner or later realize that latency has definite impact on their Sales, Bounce rates, Avg time visitors spend on site etc. Poor Performance affects the online sales heavily,79% of dissatisfied shoppers are less likely to buy from an online site again and 75% would be less likely to return to the website again.
In 2009, a study by Forrester Research found that online shoppers expected pages to load in two seconds or fewer — and at three seconds, a large share of visitors abandon the site. This two-second rule is still often cited as a standard for Web commerce sites. But many in online industry feel it is already outdated. Let see what they think;
Amazon found that shaving 100ms off of load time results in a 1% increase in sales in their .com site.
Google engineers found that users begin to get frustrated with a site after waiting just 400 milliseconds. (Note: Google considers your site speed in their ranking calculations. If the site loads very slowly your .COM is going to ranked lower and eventually ends up losing business to your competitors.)
“Two hundred fifty milliseconds, either slower or faster, is close to the magic number now for competitive advantage on the Web,” People will visit a Web site less often if it is slower than a close competitor by more than 250 milliseconds. - Microsoft.
A year-long performance redesign resulted in a 5 second speed up (from ~7 seconds to ~2 seconds). This resulted in a 25% increase in page views, a 7-12% increase in revenue, and a 50% reduction in hardware. This last point shows the win-win of performance improvements, increasing revenue while driving down operating costs - Shopzilla
Now that we had pretty much understood the importance and cost of poor architectures not concentrating on reducing latency, Let us explore how to address this problem.

Welcome to Geo Distributed Architecture (Using Route53 LBR)

In this architecture, the web/app infrastructure of the XYZ Airlines is geo distributed across multiple AWS regions (example Singapore, Japan, Europe etc). 

User requests originating all around the world are directed to the nearest AWS region or AWS region with lowest network latency (more precisely). For example, suppose you have Load balancers in the Singapore, Japan and Europe Amazon EC2 region and we have created a latency resource record set in Route 53 for each load balancer. An end user in Dubai enters the name of our domain in their browser, and the Domain Name System routes their request to a Route 53 name server. Route 53 refers to its data on latency between Dubai and the Europe EC2 region and between Dubai and the Singapore EC2 region. If latency is lower between Dubai and the Europe region( most of the times), Route 53 responds to the end user's request with the IP address of your load balancer in the Amazon EC2 data center in Europe EC2 region. If latency is lower between Dubai and the Singapore region, Route 53 responds with the IP address of the load balancer in the Amazon EC2 data center in Singapore. This architecture rapidly cuts down the latency and gives the user a better experience. Also in case one of the regions is facing network problems, the requests can be routed to alternate low latency region achieving High Availability at overall website level. Though this architecture has benefits it comes with various complexities depending upon the use case and technical stack used, we will uncover some of them in this article.
If the XYZ airline follows GEO distributed architecture and has data centers in Singapore, Japan and Europe, let us see what it takes as the roundtrip latency measurements to access their website:

The XYZ airlines website is hosted on multiple AWS region with LBR configured
From Japan (ms)
From Germany(ms)
From Malaysia (ms)
From Rio (ms)

Note: The above measurements are not constant and may keep varying every few seconds.
From the above table we can observe that using the Geo Distributed architecture the HTTP/S and AJAX calls are delivered from the AWS regions with lowest latency (usually nearest region), the round trip latency measurements have significantly dropped and overall performance has increased for the users.

Complexities and Best Practices behind Geo Distributed + R53 LBR

Let us take a simple Geo distributed online app stack and explore the technicalities and best practices a bit:

DNS and CDN Layer: Configure Route 53 to manage the DNS entries,map Domain Names to CloudFront distributions to and Latency Based Routing entries. LBR records point to the Amazon Elastic Load Balancer's endpoint in Europe and Singapore. Amazon Route 53’s Latency Based Routing (LBR) feature will route Amazon CloudFront origin requests to the AWS Region that provides the lowest possible latency. Internally Amazon Route 53 is integrated with Amazon CloudFront to collect latency measurements from each Amazon CloudFront edge location, resulting in optimal performance for origin fetches and improving overall performance

Load Balancing Layer: Amazon Elastic Load Balancing (ELB) is used as the Load Balancing layer. ELB can elastically expand its capacity to handle load during peak traffic. Amazon ELB should be configured with SSL termination @ Apache Backends, for meeting Security and Compliance in case sensitive information gets passed. Round Robin Algorithm is ideal for most scenarios. ELB should be configured to load balance across Multiple –AZ inside an Amazon EC2 Region. For more details about architecting using ELB refer

Web/App Layer: Apache Tomcat EC2’s are launched from S3 backed Linux AMI’s in Multiple-AZ’s. Logs periodically shipped to S3. Amazon Auto Scaling can be configured based on CPU or custom metrics to elastically increase/decrease the EC2 instances across Multiple AZ (recommended approach for Scalability and HA). ELB, Amazon Auto Scaling, CloudWatch and Route 53 work together. Session State is synchronized on MemCached. For more details on Amazon EC2 Availability Zones refer

Solr Search Layer: Solr Search Instances are launched as EBS backed AMI’s. Solr EC2 can be replicated between Multiple –AZ’s or sharded inside an AZ depending upon need. High Memory instances with RAID levels (EBS Striping) + EBS optimized + Provisioned IOPS give better performance on AWS. Periodic Snapshots are taken and moved across regions. Sync Solr and DB periodically. For more details on Solr Sharding refer

Database Layer:
If the use case demands the Data to be localized inside an Amazon EC2 region then one of the following approaches are recommended:

  • ·   RDS with Multi-AZ for HA, HAProxy Load balanced RDS Read Replicas across Multiple AZ’s for Read scaling are recommended approaches
  • ·   MySQL Master with 1-2 Slaves spread across multiple AZ’s inside a Region , RAID 0 with XFS + EBS optimized + PIOPS for performance
If the use case demands the unidirectional Data synchronization across Amazon EC2 regions then:

  • MySQL Master can sync data to a MySQL Slave in another Amazon EC2 region. Data can be sent over SSL or clear according to the requirements. 
  • If the MySQL is inside VPC (private subnet) then IPSEC tunnel should be established across 2 Amazon EC2 regions for communication.
If the use case demands the Bi directional Data synchronization across Amazon EC2 regions then:

  • MySQL Master-Master across regions and Master-Slave inside Regions can be configured. Though the bi directional data synch can be achieved, transactional integrity will become complex. Overall this model is not very efficient when the number of AWS regions increase.
  • Usually the best practice is to streamline and avoid bidirectional sync and expose the function as common data web service that can be consumed over web. This way the Geo distributed applications in both the Amazon EC2 regions can consume that function for information.
Note: Geographically distributed Database is a hot field and lots of stuff are happening/emerging everyday like Google Spanner, Yahoo PNUTS, NuoDB, TransLattice Elastic Database, Cloudant, ClearDB etc. In coming days you will be using these systems which will make your life easier for architecting Geo Distributed applications. I also hope AWS product team comes with solution for this Geo distributed database problem.

Caching Layer: Use MemCacheD/ElastiCache for storing Sessions, results of Heavy queries, frequently used queries and complex queries of DB/Solr and thereby significantly reducing the database load . ElastiCache cannot be distributed over AZ. MemCacheD over Amazon EC2 with Multi-AZ distribution is recommended for website which heavily relies on Caching Layer. Cache need not be replicated across regions, in case needed you need to sync Master/Slave DB's replication with MemCacheD to ensure some consistency.

Storage Layer: S3 for storing Images, JS and other static assets. S3 can be the CloudFront origin. All logs, user uploaded files will be synched to S3 in Amazon EC2 region. 

Functional Patterns for Geo Distributed App ( Sample)

In any online application (example the XYZ Low Cost some functions will be heavily used compared to others. Functions like Search for Tickets, Search for Products, User profiles and Product details, Flight Schedules and Status, Trip itinerary and printing, Discounts and offers are heavily accessed compared to others. These modules need to be highly scalable and available for the visitors to consume information anytime. Since they are also the customer facing pages, usually these functions are designed with heavy content or lots of AJAX calls. Performance becomes very critical when we adopt a content heavy design and adds burden to your latency. Also, Millions of hits will happen on search and product view functions, but only a percentage (few thousands) of them will be converted as bookings, it becomes very critical for the company to architect these functions and customer facing pages with proper user experience and faster performance. Every second it takes extra to load these pages thousands of customers will lose patience and leave the site and may result millions worth dollars of business is lost in a year. Below diagram illustrates the sample list of functions

If you observe the characteristics of functions like Booking, Payments, User Profile, Registrations etc they are done only by few thousands of users in a day (compared to millions of hits to landing, search pages etc).  They are not graphics heavy, but usually secured using SSL/HTTPS. Users can afford to wait while accessing these pages but security, availability and data integrity takes precedence in these functions. Depending upon the use case, these functions can be segregated as common services, exposed over the internet as HTTP/S or web services (REST, SOAP) for consumption by other services. Also since these functions need not be regionally distributed they can be shared and served from a common region/location. Example: Search, Discounts, Events, Schedules can be delivered from the nearest AWS region whereas Bookings and payments can be delivered from a common AWS region. This way both performance and data integrity can be maintained at overall application level.

Benefits of using Geo Distributed+Route53 + LBR
  • ·         Better performance for users than running in single AWS Region
  • ·         More business conversions for online companies
  • ·         Less or No business lost because of latency ( cost of latency)
  • ·         Overall improved reliability compared to running in a single region
  • ·         Easier implementation and much lower prices than traditional DNS solutions
  • ·         Complex and costlier to maintain compared to centralized architecture


No comments:

Need Consulting help ?


Email *

Message *

All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.