Cloud, Big Data and Mobile: Part 11: Understanding Amazon Elastic Block Store Snapshots

We understand that EBS volumes have redundancy built-in, which means that they will not fail if an individual drive fails. But their redundancy is limited to Availability Zone scope. EBS does not replicate data automatically across multiple availability zones like other AWS services (S3, DynamoDB, RDS etc).

The durability of EBS is illustrated as follows by AWS in their site

“The durability of your EBS volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives. “

Technically we can overcome this by mirroring EBS Volumes but still it will not solve if there is a failure at AZ level. This constraint strongly suggests that for safe guarding your data you need to take backups and store them in multiple availability zones. Some of the common challenges involved in the backup process include the time it takes to create data copies, the disk space required, the impact on server operations during the copy process. New generation Storage arrays have the ability to speed up dramatically the backup process by using a technique called as “Snapshot”.

A snapshot is the state of a system (like LUN-level copies of data) at a particular point in time. One of the most common types is Differential snapshots -> which allows for fast creation and reduced disk space consumption. Some common implementations of differential snapshots include copy-on-write or allocate-on-write; Better implementations of these techniques create copies instantly, allow the copies to be used read-write, permit many copies to co-exist and be active at the same time etc

Amazon EBS snapshots are incremental backups, meaning that every snapshot only copies the blocks in the volume that were changed since the last snapshot. The TOC and only changed blocks are copied (in compressed form) to the S3 in subsequent snapshots. If you have a volume with 10 GB of data, but only 2 GB of data have changed since your last snapshot, only the 2 GB of modified data is written to Amazon S3 during the snapshot process.

AWS does not disclose the internal of their snapshot technology but based on our understanding with storage systems let us explore how it works:

Step 1) when you take snapshot of an EBS volume for the first time, it is a full snapshot, but it only copies the blocks in the EBS volume that contains data. During the first snapshot, the full TOC and all blocks containing data (A, B, C, D, and E) is moved asynchronously to S3.

Step 2) Imagine in meantime, blocks D and E were changed and F is newly added from the snapshot 1. When you take snapshot 2, this time the TOC and only the changed blocks D1, E1 and F are moved to S3.

Step 3) when you take snapshot 3, blocks E and F are changed and G is newly added as per diagram. This time the TOC and only the changed blocks E2, F1 and G are moved to S3.

Step 4) since snapshot 3 is the recent and contains the latest data, you can go ahead and delete older snapshots like 1 and 2. The capacity occupied by blocks like D, E, F, E1 are no more relevant, and they are released and not charged by AWS.

You can observe that the above mechanism is much more cost effective because you pay only for what had changed. Second, the overall capacity of the backup is efficiently used and third snapshots are fast to take than traditional backups. You should note that taking a snapshot can impact the rate of IOPS you get from your volume while your snapshot is pending; this is usually few milliseconds->seconds depending upon the changes occurred between snapshots.

In Amazon infrastructure, Snapshots are usually used for achieving some of the following objectives:

Expand the size of a EBS volume
Create multiple duplicate (copies) volumes inside an AZ
Create volumes across Amazon Availability Zones inside an Amazon EC2 region (in event of failure)
Create similar volumes across Amazon EC2 regions using EBS snapshot copy mechanism. This feature will help you during geographic expansion, data center migration, and disaster recovery.

Since EBS snapshots can be taken regardless of whether or not the volume is attached to a running Amazon EC2 instance, it is strongly recommended to either detach the volume or freeze all writes before taking snapshot to prevent data loss. Not all the times we can detach a volume for taking snapshots, imagine you are running a database or Solr Search in EC2, these services need to run continuously and this option is not feasible and might prove very costly. In Amazon cloud it is a recommended practice to use file systems like XFS which provides option to freeze writes for a while and take the snapshot consistently. XFS can is very useful when we use EBS Striping (RAID 0) as well.

A snapshot of an EBS volume writes a copy of the volume data in Amazon S3 (Not Buckets). S3 is an excellent option for snapshot storage because

S3 is a separate infrastructure than EBS storage, hence it improves the availability factor and reduces the dependency in event of EBS failure
EBS volumes have availability zone scope and can be attached only to Amazon EC2 instances launched in same AZ. On the other hand, since the snapshots are stored in S3, you can create a new volume from them in any AZ inside the Amazon EC2 region.
Since the snapshots are not stored directly in buckets, you cannot access them using S3 API’s, you can only list the snapshots using the EC2 API
On the other hand one negative I have observed is that: Accessing data for the first time from Amazon S3 snapshot might cause latency during the initial loading period i.e. whenever you create new volumes from existing Amazon S3 snapshots; they load lazily in the background. But if your EC2 instance accesses data that hasn’t yet been loaded from S3, the volume immediately downloads the requested data from S3, and continues loading the rest of the data in the background. In case you are trying to access S3 snapshots from the private subnet inside VPC, make sure your NAT instance capacity is right sized to reduce the latency during loading.

EBS Article Series (continued..)

Part 1: Understanding Amazon Elastic Block Store

Part 2: Understanding Standard EBS Volumes

Part 3: Understanding EBS PIOPS Volumes

Part 4: Understanding EBS-Optimized Instances

Part 5: Understanding Latency in EBS

Part 7: 10% of your provisioned IOPS 99.9% of the time

Part 8: Performance Tuning - Pre Warming the EBS volume

Part 9: Performance Tuning - EBS Striping

Part 10: Performance Tuning - IO Block Size

Part 11: Understanding Amazon EBS Snapshots

Part 12: Securing Amazon EBS volumes - EBS Encryption using SecureCloud