I set up a VM (using Bitnami running DokuWiki) and when I create manual snapshots, the size varies wildly between 1MB and 1GB. Nothing happens to the VM, the snapshots are created minutes apart from each other.
What is happening here? Am I missing something obvious? I want to set up auto backup, but if the manual creation of snapshots is not reliable I would not trust an auto system.
Cheers
The snapshots are created with incremental backups.
When incremental snapshots are performed, the current existing snapshot will be used as a baseline for subsequent snapshots. The system will create the new snapshot more quickly if it can use the previous snapshot and read only the new or changed data from the persistent disk.
Every new snapshot only contains new data or modified data. This is the reason why the sizes vary on each backup.
For more information in this regard, you may read this article from the GCP public documentation.
Related
When creating a snapshot on the AWS console. After click creating the snapshot, it takes a while before it finishes. Let's say 5-10 minutes.
Will it captures any changes that happened during that time window?
If it doesn't capture those changes, how does AWS achieve that since the resources keep changing and how does it know the state of the resource before change happens?
An Amazon EBS volume is a 'virtual disk'. It is not an actual physical disk. Rather, Amazon EBS is a SAN-like storage service where each block is allocated and stored separately. There is an index of all the blocks that point to where they are stored. Thus, it can keep track of which blocks are used, unused and changed.
When an Amazon EBS Snapshot is created, it looks at the 'index' to determine which blocks are currently in-use. It then copies those blocks into Snapshot storage. (It's pretty smart -- only blocks that have been added or changed since the last Snapshot are copied.) Any blocks that change after the Snapshot is started will not be included in the Snapshot. The EBS service can track all of those blocks and knows which ones were created at what time. Blocks are even replicated between devices in case of failure.
Bottom line: Don't apply traditional disk concepts to Amazon EBS. Trust that it does its job, and does it well.
Will it captures any changes that happened during that time window?
If it doesn't capture those changes, how does AWS achieve that since the resources keep changing and how does it know the state of the resource before change happens?
No. Here mentioned in AWS doc.
"When you create an EBS volume based on a snapshot, the new volume begins as an exact replica of the original volume that was used to create the snapshot. The replicated volume loads data in the background so that you can begin using it immediately.
So any changes after will be in the main EBS volume, not the one that has been replicated in the background.
I'm writing because I'm very confused around mechanism that is responsible for taking EBS snapshots.
First of all as far as I understand the difference between "backup" and "snapshot" - backup is full copy of volume blocks one to one, where snapshot is "delta" approach where only changed blocks are being copied right?
If that definition is right, than I can assume that taking EBS snapshot should be called backup - as we do typically full copy of all blocks that particular EBS is build on.
In almost every documentation from AWS website, I can read that EBS snapshots are taken incrementally (first one is full, then only difference between previous "state"). But after my small exercise on AWS console I was not able to see that in action.
I did snapshot of my EBS volume (50GB) and snapshot had a size exactly 50GB. Than I did another snapshot - again size 50GB. It made me incredible confused :///
All my experience / test were made only using root volume (first attached to EC2 instance). Now I was wondering if I have DB installed (postgreSQL) on EC2 that has only root volume attached, is that safe to make a snapshot of EBS (as a safe backup for my DB) as machine is running? Or unfortunately I should periodically take whole instance offline and only than make a backup of my DB volume?
EBS Snapshots work like this:
On your initial snapshot, it will create a block-level copy of your volume on S3 in the background. On subsequent snapshots it only saves the blocks that have changed since the last snapshot to S3 and for the rest it will keep track of a pointer to the original blocks. The third snapshot will work similar to the second snapshot, it again stores the blocks that have changed since the second snapshot and adds pointers to the other blocks.
If you restore the second snapshot, it will create a new volume and take a look at its metadata store, which pointers belong to that snapshot and then retrieve the blocks from S3 these point to.
If you delete Snapshot two, it will remove the pointers to the blocks that belong to snapshot two. If any of the blocks on S3 has no pointer left, i.e. doesn't belong to a snapshot anymore, it will be deleted.
To you as the client this whole process is transparent - you can delete or restore any snapshot you like and EBS will take care of the specifics in the background.
Should you be more interested in the details under the hood, I can recommend this article: The Jellyfish-inspired database under AWS Block Storage
I'm creating periodic snapshots of my EBS volume using a Scheduled Cron expression rule (thanks, John C).
My data is all binary, and I suspect that the automatic compression AWS performs on my data - will actually enlarge the resulting snapshots.
Is there a way to instruct AWS to not employ compression when creating snapshots (so I could compare the snapshot's size with/without compression)?
Note:
Creating an Amazon EBS Snapshot seems to indicate that using compression is mandatory.
You have no control over the compression used for EBS snapshots.
EBS snapshots are incremental (except for the first snapshot). That data is compressed based on AWS's own heuristics. You have no visibility into the actual compressed data's size.
When you're looking at an EBS snapshot, the snapshot's "size" will always be reported as the originating EBS volume's size, regardless of the actual size of the snapshot.
I don't think EBS snapshots are now compressed (I am not sure if they were earlier) and I could not find any reference to compression in AWS documentation as well. That is why the size of initial snapshot is same as the size of the volume. And after first snapshot, other snapshots are incremental so only the blocks on the device that have changed or added after last snapshot are saved in the new snapshot.
You can refer the blog on how the ebs snapshots backup & restore work.
Currently I am taking manual backup of our EC2 instance by zipping the data and downloading it locally as well as on DropBox.
But I am wondering, can I have an option where I just take a complete copy of the whole system automatically daily so if something goes wrong/crashes, I can replace it with previous copy immediately rather than spending hours installing and configuring things ?
I can see there is an option of take "Image" but can I automated them to have just 1 latest image and replace the system with single click ?
You can create a single Image of your instance as Backup of your instance Configuration.
And
To keep back up of your data you can use snapshots of your volumes.
snapshots store data in incremental format whenever you make any changes.
When ever needed you can just attach the volume from the snapshot to your Instance.
It is not a good idea to do "external backup" for EC2 instance snapshot, before you read AWS pricing details.
First, AWS is charging every GB of data your transfer OUTside AWS cloud. Check out this pricing. Generally speaking, after the 1st GB, the rest will be charge at least $0.09/GB, against S3-standard pricing ~ $0.023/GB.
Second, the snapshot created is actually charges as S3 pricing(Check :
Copying an Amazon EBS Snapshot), not EBS pricing. After offset the transfer cost, perhaps you should consider create multiple snapshot than keep doing the data transfer out backup.
HOWEVER, if you happens to use an instance that use ephemeral storage, snapshot will not help. You need to copy the data out from ephemeral storage yourself. Then it is your choice to store under S3 or other place.
Third. If you worry the AWS region going down, check the multiple AZ option. Or checkout alternate AWS region option.
Fourth. When storing backup data in S3, you can always store them under Infrequent-Access, which save you some bucks, and you don't need to face an insane Glacier bills during emergency restore(Avoid Glacier, unless you are pretty sure about your own requirement).
Fifth, after done your plan of doing everything inside AWS, you can write bash script (AWS CLI) or use boto3, etc API to do the automatic backup.
Lastly , here is way of AWS create and maintain snapshot. Though each snapshot are deem "incremental", when u delete old snap shot :
the snapshot deletion process is designed so that you need to retain
only the most recent snapshot in order to restore the volume.
You can always "test" restore by create another EC2 instance that load the backup snapshot. Or you can mount the snapshot volume from another EC2 instance to check the contents.
I'm trying to backup my cassandra cluster to AWS' S3, and found this tool, which seems to do the work:
https://github.com/tbarbugli/cassandra_snapshotter/
But the problem is, in our current cluster, we cant afford to have snapshots on the same disk as the actual data, for we are using SSD's with limited space.
I've also looked up the nodetool snapshot documentation, but I didn't find any option to change the snapshots dir.
So, how can I backup cassandra to another disk, without using the data disk?
Cassandra snapshots are just hard links to all the live sstables at the moment you take the snapshot. So initially they don't take up any additional space on disk. As time passes the new live sstables will supersede the old one at which point your snapshots will start to count against your storage space.
Generally you will take a snapshot to get a consistent view of the database at a given point in time and then use an external tool or script to copy that backup to external storage (and finally clean up the snapshot).
There is no additional tool provided with Cassandra to handle copying the snapshots to external storage. This isn't too surprising as backup strategies very a lot across companies.