Introduction: Amazon Elastic MapReduce (EMR) is a web service that helps customers with big data processing using Hadoop framework on EC2 and S3. Amazon Elastic MapReduce lets customers focus on crunching data instead of worrying about time-consuming set-up, management or tuning of Hadoop clusters or the EC2 capacity upon which they operate. This in built automation provided by AWS already saves huge labor cost for the customers.
What does the word Elastic mean in Hadoop/EMR context ? Ans: You can dynamically increase the number of processing nodes depending upon the volume/velocity of the data. Adding or removing servers takes minutes, which is much faster than making similar changes in clusters running on physical servers. Let us explore this in detail and analyse how it will help you save costs in AWS Big data processing.
Before getting into the savings part, lets understand the composition of an Amazon EMR Cluster. An Amazon EMR cluster consists of following server components. They are:
Master Node: This node Manages the cluster, it coordinates the distribution of the MapReduce executable and subsets of the raw data, to the core and task nodes. There is only one master node in a cluster. You cannot expand or reduce your Master Node in the EMR Cluster.
Core Node(s): A core node is an EC2 instance that runs Hadoop map/reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node. You can add more core nodes to running cluster, but you cannot remove them from a cluster because since it stores data you have an risk of losing data.
Task Node(s): As the name suggests these nodes run tasks and they map to equivalent of Hadoop slave node. These nodes are optional in nature. Task nodes are managed by the master node. While a cluster is running you can increase and decrease the number of task nodes. Because they don't store data and can be added and removed from a cluster, you can use task nodes to manage the EC2 instance capacity your cluster, by increasing capacity to handle peak loads and decreasing it later when there is no load.
Imagine the log volume flow is not constant and it varies every hour, some hours you receive few hundred GB's and some hours few GB's of logs for processing. For peak hours your use case needs around 192 mappers/72 reducers and normal hours you need ~64 mappers/24 reducers or less. The peak and normal numbers can be arrived based on the analysis done on the past data. This elasticity in log volume scenario is a usual occurrence in many big data projects and it is source of cost leakage. Simple approach what many architects take is that they run their cluster infrastructure @ peak capacity always since the operation is time sensitive, but this might not be an optimal approach in amazon cloud-big data world. Since you can elastically increase/decrease the number of nodes in an Amazon EMR cluster it is optimal if you can size the number of nodes dynamically every hour. Since you pay by usage in amazon cloud, having this elasticity built in your architecture will save costs.
Based on the number of mappers/reducers required, we have chosen the node capacity to be in m1.xlarge EC2 units. So during Peak hours you will need 24 processing nodes and normal (avg) hours it will be reduced to 8 processing nodes.
Elastic Approach-1: Vary the Task Nodes:
In this approach, number of Master and Core nodes are maintained constant. 1 - master node and 4- core nodes are used for processing and data storage always. The task nodes are increased and decreased between 4->20 every hour depending upon the log volume flow. Since the data is present in the core nodes and only tasks/jobs are assigned in the task nodes, adding/removing task nodes will not cause problems. You can engineer a custom Job manager using AWS API's and manage this entire cluster easily. If you do a simple math that in average only 8 processing nodes are needed (4 core + 4 task nodes) and during peak hours you need ( 4 core + 20 task nodes) with this approach you can save ~60 % costs by not running your cluster in ALWAYS peak capacity. This model is a recommended approach for many elastic big data use cases in AWS. Refer the below table for cost savings:
No. of Processing Nodes
Elastic Approach-2: Vary both the Core and Task Nodes:
In this approach, the number of both Core and Task nodes are varied dynamically. Since the Core nodes can be only increased and cannot be decreased in a running cluster(because it could lead to data loss), this approach is recommended only for advanced use cases. Since the entire data is stored in S3(is reproducible) and can be moved to the EMR cluster every hour, using the custom Job manager an entire cluster can be created(even every hour) depending upon the log data volume (GB's). Example: Imagine first hour: 4 Core + 10 Task nodes are used for processing, second hour data volume is increased and 4 Core + 20 Task nodes are added in the cluster, third/fourth hour etc there is hardly few GB's of data flow and only 8 Mappers/3 reducers are needed for processing, instead of running 20 task + 4 core nodes(of prev hour), a new EMR cluster can be created with just 1 master and 1-2 Core nodes. This approach requires engineering a custom job manager using AWS API's for managing the cluster. Though this approach is little complex to engineer, it saves more cost than approach-1 on medium to long term.
Note:The approaches illustrated are not theoretical in nature. I have put both the above techniques to production use for some customers and they are already seeing huge cost savings.
Coming Soon - Adding Spot to this equation gives brutal savings ...
Cost Saving Tip 1: Amazon SQS Long Polling and Batch requests
Cost Saving Tip 2: How right search technology choice saves cost in AWS ?
Cost Saving Tip 3: Using Amazon CloudFront Price Class to minimize costs
Cost Saving Tip 4 : Right Sizing Amazon ElastiCache Cluster
Cost Saving Tip 5: How Amazon Auto Scaling can save costs ?
Cost Saving Tip 6: Amazon Auto Scaling Termination policy and savings
Cost Saving Tip 7: Use Amazon S3 Object Expiration
Cost Saving Tip 8: Use Amazon S3 Reduced Redundancy Storage
Cost Saving Tip 9: Have efficient EBS Snapshots Retention strategy in place
Cost Saving Top 10: Make right choice between PIOPS vs Std EBS volumes and save costs
Cost Saving Top 11: How elastic thinking saves cost in Amazon EMR Clusters ?
Cost Saving Top 12: Add Spot Instances with Amazon EMR
Cost Saving Top 13: Use Amazon Glacier for archive data and save costs (new)
Cost Saving Top 14: Plan your deletion in Amazon Glacier and avoid cost leakage (new)