Cloud, Big Data and Mobile: The Art of Infrastructure Elasticity

"The Art of Infrastructure Elasticity" was presented by Harish Ganesan on April 28th , Cloud Developer Conference 2012 , Bangalore.

Agenda :

Problem
Challenges
Requirements
Solution Architecture
Q&A

What is the problem scenario ?

Big Sales Promotion every quarter by the Enterprise
Massive online Concurrent Visitors
Limited processing capacity of the Booking Engine(~3k requests/sec)
Unhappy Visitors
More Booking opportunity lost

The art of infrastructure elasticity

View more presentations from Harish Ganesan

Solution (Step 1):

Create a Queuing App before the Booking engine
Efficiently Queue the concurrent visitors

Solution (Step 2) :
Moderate and move the visitors waiting in Queuing app to Booking engine

What are the Challenges ?

Challenge 1: Concurrency
HTTP/AJAX/REST requests
Total : 500+ Million requests in 6 hours
Average :23k+ requests/sec
Peak : 80K+ requests/sec

Challenge 2:Queue efficiency

Allot unique Queue Numbers for visitors
Queue Number allotment on Fair Basis (As much possible)
Reduce the wait time in Queue Number allotment process
Reduce overall Queue wait time for the visitor

Challenge 3:Load Volatility
Massive utilization during promo and under utilization during other times

Challenge 4:IP Whitelisting

Booking engine and other 3rd party webservices needs EC2 IP Whitelisting for security
Consecutive IP range needed for whitelisting

Challenge 5:Variety of OS / Software’s in the tech stack

RedHat OS for Load Balancer , NoSQL and Queue Layer
Apache Tomcat Java web/App Layer
CentOS for Processing Programs
MySQL for Result storage
Hadoop for Analytics

What are the requirements from enterprise ?
Elastic Infrastructure

Create the Infrastructure 2 hrs before the promo
Tear down infrastructure 2 hrs after the promo
Elastically expand the infra during the promo

Highly Scalable and Available
Log Analytics
Complete Infrastructure Automation

Solution Architecture
Option 1: Single Queue ( Initial thought)
Option 2: Parallel Queue ( Recommended )

Request types

Customer Visit is a HTTP request to the Queuing Application
Current Visitor Queue position is a AJAX call every X seconds to the Queuing Application. More Wait ~ More Calls

Solution Step 1 : The Cloud ?

Amazon Web Services
We had 4+ years Architecture experience in AWS
It satisfied many customer requirements and challenges in this use case

Solution Step 2 : R53/NW

Amazon VPC with Multi-AZ subnet configurations ( HA )
Amazon Route 53 for Managed DNS
DNS RR algorithm at Route53

Solution Step 3: Load Balancing

HAProxy vs Amazon ELB
Custom programs to Auto Scale HAProxy
HAProxy Elastic -> Attach / Detach from Route53
HAProxy IP whitelisting in 3rd party Gateway
16 HAProxy Instances , 2 AZ’s , 2 Subnets
RR Load Balancing algorithm

Solution Step 4: Web/App Servers

Web/App instances under every HAProxy
C1.Xlarge Instance Type for Web/App Instances
Custom programs to Auto Scale C1.Xlarge
Automatic Attach / Detach from HAProxy
Every web/App Instance with EIP for IP whitelisting
48 Web/App EC2 Instances spread across 2 AZ’s

Solution Step 5: Queue Servers

RabbitMQ vs Amazon SQS
FIFO/Concurrency/No Duplicate messages
1 RabbitMQ instance for queuing every sector
M1. large Instance Type
16 RabbitMQ Instances overall

Solution Step 6: Redis

Redis vs Amazon DynamoDB
Redis : NoSQL KV Data store
Visitors are shown their Current Queue position every X seconds from Redis
1 Redis Master-Slave instance for every sector
M1. large Instance Type for Redis
32 Redis Instances overall

Solution Step 6: Processors

BG Processors : Java Programs to
RabbitMq -> Redis : Allot Queue numbers to visitor requests and insert to Redis
Redis -> Booking Engine : Moderate the movement of queued visitors from Redis to Booking Engine
Process the Response Status / Booking Status / Inactive Visitors / Timeouts
2 BG Processor node per sector
CPU intensive : C1.Xlarge Instance Type
32 BG Processor Instances overall

Scalability

New sectors containing LB, Web, Queue , NoSQL , BG stack will be created automatically depending upon the load
Same AZ or multi-AZ can be specified for the creation
CloudWatch Custom parameters used
Automated Java Programs were used for the sector creation
No Manual intervention needed

High Availability @ Instance Level

HA built @ Web/App , Redis and BG processor instances
Any Failure / Non responsive EC2 instances will be automatically detected/replaced by Java programs
No Manual intervention needed

High Availability @ Sector level

Any Failure / Non responsive instances inside Sectors will be automatically detected/replaced by Java programs
If sector-3 fails , still other sectors will be active and can take requests

High Availability @ AZ level

If entire AZ-2 fails then load will be balanced to instances in AZ-1
Automated programs will create new sectors inside AZ-1 to handle the load

Log Analytics

Redis , Web/App , HAProxy , RBQ logs synced to S3
Elastic MapReduce Jobs to process / analyze the logs
Processed result moved to RDS MySQL for reports/ Visualizations

Monitoring

Nagios + Puppet (combined) for Auto scaled monitoring infra and deployment
CloudWatch Custom metrics / Tomcat Valve/ Automated Java Programs for EC2

Backup

No backups -> only Syncs to S3
Golden AMI’s snapshot to S3
Periodic Sync of data between EC2 and S3
Periodic log Sync between Web/App to S3

Overall Infrastructure

Amazon Route53
Amazon VPC – Public , Private subnet
150+ EC2 instances , 2 AZ’s , 1 Region
70+ Elastic IP’s
200+ EBS
S3 buckets
Suite of monitoring tools
1 Puppet Server
Amazon CloudWatch
Amazon CloudFront

Infrastructure Elasticity

Entire Infra created 2 hrs before promo
Tear down infra 2 hrs after promo
~30 Mins to launch the infra in AWS
~45 Mins to tear down
Automated Failure detection/rectification
Automated Programs for Infra creation

Infrastructure Cost

~10K USD per promo
Not inclusive of Data charges
Unthinkable Savings
Visitor experience was good
More Bookings per Promo
Power of Elasticity is Simply priceless , AWS is “AWSome”

Cloud, Big Data and Mobile

Pages

Sunday, April 29, 2012

The Art of Infrastructure Elasticity

1 comment: