Pages

Sunday, April 29, 2012

The Art of Infrastructure Elasticity

"The Art of Infrastructure Elasticity" was presented by Harish Ganesan on April 28th , Cloud Developer Conference 2012 , Bangalore.


Agenda :

Problem
Challenges
Requirements
Solution Architecture
Q&A


What is the problem scenario ?
  • Big Sales Promotion  every quarter by the Enterprise
  • Massive online Concurrent Visitors
  • Limited processing capacity of the Booking Engine(~3k requests/sec)
  • Unhappy Visitors 
  • More Booking  opportunity lost
Solution (Step 1):
  • Create a Queuing App before the Booking engine
  • Efficiently Queue the concurrent visitors
Solution (Step 2) :
Moderate and move the visitors waiting in Queuing app to Booking engine 


What are the Challenges ?


Challenge 1: Concurrency
HTTP/AJAX/REST requests
Total : 500+ Million requests in 6 hours 
Average :23k+ requests/sec
Peak : 80K+ requests/sec 


Challenge 2:Queue efficiency
  • Allot unique Queue Numbers for visitors
  • Queue Number allotment on Fair Basis (As much possible)
  • Reduce the wait time in Queue Number allotment process
  • Reduce overall Queue wait time for the visitor
Challenge 3:Load Volatility
Massive utilization during promo and under utilization during other times 


Challenge 4:IP Whitelisting
  • Booking engine and other 3rd party webservices needs EC2 IP Whitelisting for security
  • Consecutive IP range needed for whitelisting
Challenge 5:Variety of OS / Software’s in the tech stack
  • RedHat OS for Load Balancer , NoSQL and Queue Layer
  • Apache Tomcat Java web/App Layer
  • CentOS for  Processing Programs 
  • MySQL for Result storage
  • Hadoop for Analytics
What are the requirements from enterprise ?
Elastic Infrastructure 
  • Create the Infrastructure 2 hrs before the promo
  • Tear down infrastructure 2 hrs after the promo
  • Elastically expand the infra during the promo
Highly Scalable and Available
Log Analytics
Complete Infrastructure Automation


Solution Architecture
Option 1: Single Queue ( Initial thought)
Option 2: Parallel Queue ( Recommended )


Request types
  • Customer Visit is a HTTP request to the Queuing Application
  • Current Visitor Queue position is a AJAX call every X seconds  to the Queuing Application. More Wait ~ More Calls
Solution Step 1 : The Cloud ?
  • Amazon Web Services
  • We had 4+ years Architecture experience in AWS
  • It satisfied many customer requirements and challenges in this use case
Solution Step 2 : R53/NW
  • Amazon VPC with Multi-AZ subnet configurations ( HA )
  • Amazon Route 53 for Managed DNS
  • DNS RR algorithm at Route53
Solution Step 3: Load Balancing
  • HAProxy vs Amazon ELB
  • Custom programs to Auto Scale HAProxy
  • HAProxy Elastic -> Attach / Detach from Route53
  • HAProxy IP whitelisting in 3rd party Gateway
  • 16 HAProxy Instances , 2 AZ’s , 2 Subnets
  • RR Load Balancing algorithm
Solution Step 4: Web/App Servers
  •  Web/App instances under every HAProxy
  • C1.Xlarge Instance Type for Web/App Instances
  • Custom programs to Auto Scale C1.Xlarge 
  • Automatic Attach / Detach from HAProxy
  • Every web/App Instance with EIP for IP whitelisting
  • 48 Web/App EC2 Instances spread across 2 AZ’s
Solution Step 5: Queue Servers
  • RabbitMQ vs Amazon SQS
  • FIFO/Concurrency/No Duplicate messages
  • 1 RabbitMQ instance for queuing every sector
  • M1. large Instance Type
  • 16 RabbitMQ Instances overall
Solution Step 6: Redis
  • Redis vs Amazon DynamoDB
  • Redis : NoSQL KV Data store
  • Visitors are shown their Current Queue position every X seconds from Redis
  • 1 Redis Master-Slave instance for every sector
  • M1. large Instance Type for Redis
  • 32 Redis Instances overall
Solution Step 6: Processors
  • BG Processors : Java Programs to 
  • RabbitMq -> Redis : Allot Queue numbers to visitor requests  and insert to Redis
  • Redis -> Booking Engine : Moderate the movement of queued visitors from Redis to Booking Engine
  • Process the Response Status / Booking Status / Inactive Visitors /  Timeouts 
  • 2 BG Processor node per sector
  • CPU intensive : C1.Xlarge Instance Type
  • 32 BG Processor Instances overall
Scalability
  • New sectors containing LB, Web, Queue , NoSQL , BG stack will be created automatically depending upon the load
  • Same AZ or multi-AZ can be specified for the creation
  • CloudWatch Custom parameters used
  • Automated Java Programs were used for the sector creation
  • No Manual intervention needed
High Availability @ Instance Level
  • HA built @ Web/App , Redis and BG processor instances
  • Any Failure / Non responsive EC2 instances will be automatically detected/replaced by Java programs
  • No Manual intervention needed
High Availability @ Sector level
  • Any Failure / Non responsive instances inside Sectors will be automatically detected/replaced by Java programs
  • If sector-3 fails , still other sectors will be active and can take requests
High Availability @ AZ level
  • If entire AZ-2 fails then load will be balanced to instances in AZ-1
  • Automated programs will create new sectors inside AZ-1 to handle the load
Log Analytics
  • Redis , Web/App , HAProxy , RBQ logs synced to S3
  • Elastic MapReduce Jobs to process / analyze the logs
  • Processed result moved to RDS MySQL for reports/ Visualizations
Monitoring
  • Nagios + Puppet (combined) for Auto scaled monitoring infra and deployment
  • CloudWatch  Custom metrics / Tomcat Valve/ Automated Java Programs for EC2 
Backup
  • No backups -> only Syncs to S3
  • Golden AMI’s snapshot to S3
  • Periodic Sync of data between EC2 and S3
  • Periodic log Sync between Web/App to S3
Overall Infrastructure 
  • Amazon Route53
  • Amazon VPC – Public , Private subnet
  • 150+ EC2 instances ,  2 AZ’s , 1 Region 
  • 70+ Elastic IP’s 
  • 200+ EBS 
  • S3 buckets
  • Suite of monitoring tools
  • 1 Puppet Server
  • Amazon CloudWatch 
  • Amazon CloudFront
Infrastructure Elasticity
  • Entire Infra created 2 hrs before promo
  • Tear down infra 2 hrs after promo
  • ~30 Mins to launch the infra in AWS
  • ~45 Mins to tear down 
  • Automated Failure detection/rectification
  • Automated Programs for Infra creation
Infrastructure Cost 
  • ~10K USD per promo
  • Not inclusive of Data charges
  • Unthinkable Savings 
  • Visitor experience was good
  • More Bookings per Promo
  • Power of Elasticity is Simply priceless , AWS is “AWSome”

1 comment:

  1. Good one.
    I have one Question. How would you manager Rabbitmq instances. Have you created cluster? and if yes how you distribute load among the cluster?

    ReplyDelete