This case was presented @ Cloud Connect 2013 . You can view this in Slide share as well.
To know more about Amazon EMR + Spot refer the following articles. They will set the technical context right before you go through this presentation. Posts are listed below:
Post 1: How elastic thinking saves cost in Amazon EMR Clusters ?
Post 2: Add Spot Instances with Amazon EMR
Mobile advertising company based in USA. They have Forbes 1000 clientele using their mobile advertising product. The company approached us for architecture and implementation help for their Clicks, Ad and other analytic's hosting and processing on AWS.
The LOCK ?
Variety of unstructured logs and semi structured files have to be processed for data analysis. It includes logs from CDN, server logs, XML Files, Text files, Geo Data Files and Structured DB records. The logs flow every hour from variety of sources into AWS. It amounts to few hundred GB's -> ~ 1 TB @ peak hours/seasons.
The STOCK ?
Mobile Ad company wanted an efficient architecture and infrastructure on AWS for collecting, storing, processing and share the data. They wanted to avoid cost leakages because of bad architectural practices in AWS and save $$$ wherever possible.
Their analysis patterns can be categorized into Hourly, Monthly and Historical. Some of the major challenge were:
- How do we Transfer, Store, Analyze and Share ?
- How to optimize costs at this scale ?
- Architect their entire front end analysis module in AWS. Use the AWS technologies like ELB, EC2, R53, RDS ,Cache for delivering their analysis results to online users.
- Use Amazon Elastic MapReduce with Spot EC2 instances for their back-end processing jobs.
- Automate their infrastructure using Chef, Scripts and Custom Java Modules wherever applicable.
First time data was transferred using AWS Import/Export. On going data was transferred to AWS using Tsunami UDP. High Bandwidth EC2's were installed with Tsunami UDP on AWS side for receiving data faster. The Data collected is stored temporarily in the Receiver EC2's , some pre-processing tasks were run on them and files are moved to S3 Buckets post that.
For more details about using Tsunami UDP and Aspera on AWS refer these articles: Tsunami UDP on AWS and Aspera on AWS
Some of the other popular models are :
AWS Direct Connect: Establishes a dedicated connection between Data centers to AWS using Direct Connect. This model is little costlier and is suitable only enterprises.
WAN Optimization: Use WAN optimization tools like Riverbed Steel head, aryaka etc to speed up the transfer between source endpoints to AWS.
Stage 2: Storage of Data:
Temporarily data was stored on Receiver EC2's. After Pre-processing tasks they were moved to S3. Amazon S3 is a default choice because of its inherent fault tolerance and scalability features. Around ~2 TB of compressed logs are stored in Amazon S3 Daily for processing. Amazon S3 Reduced Redundancy Storage option was used for storing intermediate log outputs. S3 Automatic Object Expiry was used for efficiency and cost savings. Archival data is moved to Amazon Glacier periodically.
Stage 3: Analysis using Amazon EMR + Spot EC2
Amazon EMR was used for log processing and analysis. Amazon Elastic MapReduce (EMR) is a web service that helps customers with big data processing using Hadoop framework on EC2 and S3. Amazon Elastic MapReduce lets customers focus on crunching data instead of worrying about time-consuming set-up, management or tuning of Hadoop clusters or the EC2 capacity upon which they operate. This in built automation provided by AWS already saves huge labor cost for the customers. At peak hours some jobs run with 2000 Mappers/ 750 Reducers. For peak periods ~250 m1.xlarge or equivalent EC2 capacity was used for processing the logs. We developed a Custom EMR manager to introduce Spot EC2 into the Amazon EMR equation. Our Spot Bidding strategy was either on-Demand price or 20% above On-demand price. Since the Spot prices vary in different AZ's we collected the past price history, current market price etc and chose the right AZ's appropriately. Choosing right AZ sometimes mean creating the entire EMR Cluster on low priced AZ (All Master,Core & Task should reside on Same AZ). The Master and Core Nodes were running On-Demand EC2 and Task Nodes were running on Spot EC2 or On-Demand EC2. The Custom EMR Manager has the capability to increase the Core nodes depending upon the Log Data Volume + Create the entire EMR cluster (with revised Core+task Node numbers) in New AZ depending upon the Spot prices. This decision was made by the Custom EMR manager periodically depending upon the log data volume fluctuations and Log Volume patterns.
- Spot+On Demand EC2 for EMR is deadly combination for Cost Savings. Though this combination is suitable for Time in-sensitive jobs, but if some intelligence and imagination can be applied in your architecture and design, you can use this for certain time sensitive jobs as well.
- Bigger files are better, so we merge the files into Bigger chunks before proccesing in Amazon EMR.
- In Pre processing stage, we split files with ratio of 1 file per mapper. This gives us better time predictability during processing. Also the file transfer was faster between S3 and EMR because of this manageable size splits.
- Data is compressed at all possible levels. Snappy and .lzo compression was used.
- We dynamically increase/decrease the task nodes using the Custom job manager. If no Spot EC2 available for Task nodes, then Custom EMR manager adds on-Demand task nodes to the cluster.
- The jobs were designed to be small in processing logic size and intermediate output data is stored in S3. This way we can re-create the cluster with reproducible data. Only Processing data was kept in EMR cluster, rest of the data were kept in S3.
- The EMR clusters can be sized according to the log data volumes. This intelligence was built in the custom EMR manager based on the past patterns, Data Density tests etc, during pre processing phase.
- Certain Intermediate reducers were designed to create number of result files according to number of Mappers in next jobs. This helped us in optimum utilization of capacity inside a hour.
- Decision to re-size clusters are made nearest to the hour. If AWS brings per minute pricing , it will help such use cases to save more costs.
- Every API call and millisecond matters in big data programming, tune the MR code and test the 3rd party APIs for performance before integrating them into your code.
- Understand NW capacities of to/fro data transfer to S3, m1.Xlarge EMR Nodes (Memory Sizing, Mapper / reducer numbers) and work leveraging the strengths of AWS.
- If your processing requirements are totally time insensitive, you can bid very close to spot price and have AWS interrupt(mostly) within the hour and literally get entire processing stuff done for free. (If AWS interrupts the Spot EC2 you will not be charged for any partial hour of usage). This strategy was not used by us. If any of you use this, please write to me, will be happy to discuss.
- EMR with Spot brought ~56% cost savings from pure On-Demand model for Core+ Task Nodes.
- Customer CXO's were happy !!!