Sunday, April 29, 2012

The Art of Infrastructure Elasticity

"The Art of Infrastructure Elasticity" was presented by Harish Ganesan on April 28th , Cloud Developer Conference 2012 , Bangalore.


Agenda :

Problem
Challenges
Requirements
Solution Architecture
Q&A


What is the problem scenario ?
  • Big Sales Promotion  every quarter by the Enterprise
  • Massive online Concurrent Visitors
  • Limited processing capacity of the Booking Engine(~3k requests/sec)
  • Unhappy Visitors 
  • More Booking  opportunity lost
Solution (Step 1):
  • Create a Queuing App before the Booking engine
  • Efficiently Queue the concurrent visitors
Solution (Step 2) :
Moderate and move the visitors waiting in Queuing app to Booking engine 


What are the Challenges ?


Challenge 1: Concurrency
HTTP/AJAX/REST requests
Total : 500+ Million requests in 6 hours 
Average :23k+ requests/sec
Peak : 80K+ requests/sec 


Challenge 2:Queue efficiency
  • Allot unique Queue Numbers for visitors
  • Queue Number allotment on Fair Basis (As much possible)
  • Reduce the wait time in Queue Number allotment process
  • Reduce overall Queue wait time for the visitor
Challenge 3:Load Volatility
Massive utilization during promo and under utilization during other times 


Challenge 4:IP Whitelisting
  • Booking engine and other 3rd party webservices needs EC2 IP Whitelisting for security
  • Consecutive IP range needed for whitelisting
Challenge 5:Variety of OS / Software’s in the tech stack
  • RedHat OS for Load Balancer , NoSQL and Queue Layer
  • Apache Tomcat Java web/App Layer
  • CentOS for  Processing Programs 
  • MySQL for Result storage
  • Hadoop for Analytics
What are the requirements from enterprise ?
Elastic Infrastructure 
  • Create the Infrastructure 2 hrs before the promo
  • Tear down infrastructure 2 hrs after the promo
  • Elastically expand the infra during the promo
Highly Scalable and Available
Log Analytics
Complete Infrastructure Automation


Solution Architecture
Option 1: Single Queue ( Initial thought)
Option 2: Parallel Queue ( Recommended )


Request types
  • Customer Visit is a HTTP request to the Queuing Application
  • Current Visitor Queue position is a AJAX call every X seconds  to the Queuing Application. More Wait ~ More Calls
Solution Step 1 : The Cloud ?
  • Amazon Web Services
  • We had 4+ years Architecture experience in AWS
  • It satisfied many customer requirements and challenges in this use case
Solution Step 2 : R53/NW
  • Amazon VPC with Multi-AZ subnet configurations ( HA )
  • Amazon Route 53 for Managed DNS
  • DNS RR algorithm at Route53
Solution Step 3: Load Balancing
  • HAProxy vs Amazon ELB
  • Custom programs to Auto Scale HAProxy
  • HAProxy Elastic -> Attach / Detach from Route53
  • HAProxy IP whitelisting in 3rd party Gateway
  • 16 HAProxy Instances , 2 AZ’s , 2 Subnets
  • RR Load Balancing algorithm
Solution Step 4: Web/App Servers
  •  Web/App instances under every HAProxy
  • C1.Xlarge Instance Type for Web/App Instances
  • Custom programs to Auto Scale C1.Xlarge 
  • Automatic Attach / Detach from HAProxy
  • Every web/App Instance with EIP for IP whitelisting
  • 48 Web/App EC2 Instances spread across 2 AZ’s
Solution Step 5: Queue Servers
  • RabbitMQ vs Amazon SQS
  • FIFO/Concurrency/No Duplicate messages
  • 1 RabbitMQ instance for queuing every sector
  • M1. large Instance Type
  • 16 RabbitMQ Instances overall
Solution Step 6: Redis
  • Redis vs Amazon DynamoDB
  • Redis : NoSQL KV Data store
  • Visitors are shown their Current Queue position every X seconds from Redis
  • 1 Redis Master-Slave instance for every sector
  • M1. large Instance Type for Redis
  • 32 Redis Instances overall
Solution Step 6: Processors
  • BG Processors : Java Programs to 
  • RabbitMq -> Redis : Allot Queue numbers to visitor requests  and insert to Redis
  • Redis -> Booking Engine : Moderate the movement of queued visitors from Redis to Booking Engine
  • Process the Response Status / Booking Status / Inactive Visitors /  Timeouts 
  • 2 BG Processor node per sector
  • CPU intensive : C1.Xlarge Instance Type
  • 32 BG Processor Instances overall
Scalability
  • New sectors containing LB, Web, Queue , NoSQL , BG stack will be created automatically depending upon the load
  • Same AZ or multi-AZ can be specified for the creation
  • CloudWatch Custom parameters used
  • Automated Java Programs were used for the sector creation
  • No Manual intervention needed
High Availability @ Instance Level
  • HA built @ Web/App , Redis and BG processor instances
  • Any Failure / Non responsive EC2 instances will be automatically detected/replaced by Java programs
  • No Manual intervention needed
High Availability @ Sector level
  • Any Failure / Non responsive instances inside Sectors will be automatically detected/replaced by Java programs
  • If sector-3 fails , still other sectors will be active and can take requests
High Availability @ AZ level
  • If entire AZ-2 fails then load will be balanced to instances in AZ-1
  • Automated programs will create new sectors inside AZ-1 to handle the load
Log Analytics
  • Redis , Web/App , HAProxy , RBQ logs synced to S3
  • Elastic MapReduce Jobs to process / analyze the logs
  • Processed result moved to RDS MySQL for reports/ Visualizations
Monitoring
  • Nagios + Puppet (combined) for Auto scaled monitoring infra and deployment
  • CloudWatch  Custom metrics / Tomcat Valve/ Automated Java Programs for EC2 
Backup
  • No backups -> only Syncs to S3
  • Golden AMI’s snapshot to S3
  • Periodic Sync of data between EC2 and S3
  • Periodic log Sync between Web/App to S3
Overall Infrastructure 
  • Amazon Route53
  • Amazon VPC – Public , Private subnet
  • 150+ EC2 instances ,  2 AZ’s , 1 Region 
  • 70+ Elastic IP’s 
  • 200+ EBS 
  • S3 buckets
  • Suite of monitoring tools
  • 1 Puppet Server
  • Amazon CloudWatch 
  • Amazon CloudFront
Infrastructure Elasticity
  • Entire Infra created 2 hrs before promo
  • Tear down infra 2 hrs after promo
  • ~30 Mins to launch the infra in AWS
  • ~45 Mins to tear down 
  • Automated Failure detection/rectification
  • Automated Programs for Infra creation
Infrastructure Cost 
  • ~10K USD per promo
  • Not inclusive of Data charges
  • Unthinkable Savings 
  • Visitor experience was good
  • More Bookings per Promo
  • Power of Elasticity is Simply priceless , AWS is “AWSome”

1 comment:

Anand said...

Good one.
I have one Question. How would you manager Rabbitmq instances. Have you created cluster? and if yes how you distribute load among the cluster?

Need Consulting help ?

Name

Email *

Message *

DISCLAIMER
All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.

Followers