Agenda :
Problem
Challenges
Requirements
Solution Architecture
Q&A
What is the problem scenario ?
- Big Sales Promotion every quarter by the Enterprise
- Massive online Concurrent Visitors
- Limited processing capacity of the Booking Engine(~3k requests/sec)
- Unhappy Visitors
- More Booking opportunity lost
- Create a Queuing App before the Booking engine
- Efficiently Queue the concurrent visitors
Moderate and move the visitors waiting in Queuing app to Booking engine
What are the Challenges ?
Challenge 1: Concurrency
HTTP/AJAX/REST requests
Total : 500+ Million requests in 6 hours
Average :23k+ requests/sec
Peak : 80K+ requests/sec
Challenge 2:Queue efficiency
- Allot unique Queue Numbers for visitors
- Queue Number allotment on Fair Basis (As much possible)
- Reduce the wait time in Queue Number allotment process
- Reduce overall Queue wait time for the visitor
Massive utilization during promo and under utilization during other times
Challenge 4:IP Whitelisting
- Booking engine and other 3rd party webservices needs EC2 IP Whitelisting for security
- Consecutive IP range needed for whitelisting
- RedHat OS for Load Balancer , NoSQL and Queue Layer
- Apache Tomcat Java web/App Layer
- CentOS for Processing Programs
- MySQL for Result storage
- Hadoop for Analytics
Elastic Infrastructure
- Create the Infrastructure 2 hrs before the promo
- Tear down infrastructure 2 hrs after the promo
- Elastically expand the infra during the promo
Log Analytics
Complete Infrastructure Automation
Solution Architecture
Option 1: Single Queue ( Initial thought)
Option 2: Parallel Queue ( Recommended )
Request types
- Customer Visit is a HTTP request to the Queuing Application
- Current Visitor Queue position is a AJAX call every X seconds to the Queuing Application. More Wait ~ More Calls
- Amazon Web Services
- We had 4+ years Architecture experience in AWS
- It satisfied many customer requirements and challenges in this use case
- Amazon VPC with Multi-AZ subnet configurations ( HA )
- Amazon Route 53 for Managed DNS
- DNS RR algorithm at Route53
- HAProxy vs Amazon ELB
- Custom programs to Auto Scale HAProxy
- HAProxy Elastic -> Attach / Detach from Route53
- HAProxy IP whitelisting in 3rd party Gateway
- 16 HAProxy Instances , 2 AZ’s , 2 Subnets
- RR Load Balancing algorithm
- Web/App instances under every HAProxy
- C1.Xlarge Instance Type for Web/App Instances
- Custom programs to Auto Scale C1.Xlarge
- Automatic Attach / Detach from HAProxy
- Every web/App Instance with EIP for IP whitelisting
- 48 Web/App EC2 Instances spread across 2 AZ’s
- RabbitMQ vs Amazon SQS
- FIFO/Concurrency/No Duplicate messages
- 1 RabbitMQ instance for queuing every sector
- M1. large Instance Type
- 16 RabbitMQ Instances overall
- Redis vs Amazon DynamoDB
- Redis : NoSQL KV Data store
- Visitors are shown their Current Queue position every X seconds from Redis
- 1 Redis Master-Slave instance for every sector
- M1. large Instance Type for Redis
- 32 Redis Instances overall
- BG Processors : Java Programs to
- RabbitMq -> Redis : Allot Queue numbers to visitor requests and insert to Redis
- Redis -> Booking Engine : Moderate the movement of queued visitors from Redis to Booking Engine
- Process the Response Status / Booking Status / Inactive Visitors / Timeouts
- 2 BG Processor node per sector
- CPU intensive : C1.Xlarge Instance Type
- 32 BG Processor Instances overall
- New sectors containing LB, Web, Queue , NoSQL , BG stack will be created automatically depending upon the load
- Same AZ or multi-AZ can be specified for the creation
- CloudWatch Custom parameters used
- Automated Java Programs were used for the sector creation
- No Manual intervention needed
- HA built @ Web/App , Redis and BG processor instances
- Any Failure / Non responsive EC2 instances will be automatically detected/replaced by Java programs
- No Manual intervention needed
- Any Failure / Non responsive instances inside Sectors will be automatically detected/replaced by Java programs
- If sector-3 fails , still other sectors will be active and can take requests
- If entire AZ-2 fails then load will be balanced to instances in AZ-1
- Automated programs will create new sectors inside AZ-1 to handle the load
- Redis , Web/App , HAProxy , RBQ logs synced to S3
- Elastic MapReduce Jobs to process / analyze the logs
- Processed result moved to RDS MySQL for reports/ Visualizations
- Nagios + Puppet (combined) for Auto scaled monitoring infra and deployment
- CloudWatch Custom metrics / Tomcat Valve/ Automated Java Programs for EC2
- No backups -> only Syncs to S3
- Golden AMI’s snapshot to S3
- Periodic Sync of data between EC2 and S3
- Periodic log Sync between Web/App to S3
- Amazon Route53
- Amazon VPC – Public , Private subnet
- 150+ EC2 instances , 2 AZ’s , 1 Region
- 70+ Elastic IP’s
- 200+ EBS
- S3 buckets
- Suite of monitoring tools
- 1 Puppet Server
- Amazon CloudWatch
- Amazon CloudFront
- Entire Infra created 2 hrs before promo
- Tear down infra 2 hrs after promo
- ~30 Mins to launch the infra in AWS
- ~45 Mins to tear down
- Automated Failure detection/rectification
- Automated Programs for Infra creation
- ~10K USD per promo
- Not inclusive of Data charges
- Unthinkable Savings
- Visitor experience was good
- More Bookings per Promo
- Power of Elasticity is Simply priceless , AWS is “AWSome”
1 comment:
Good one.
I have one Question. How would you manager Rabbitmq instances. Have you created cluster? and if yes how you distribute load among the cluster?
Post a Comment