Backup cassandra to another disk - amazon-web-services

I'm trying to backup my cassandra cluster to AWS' S3, and found this tool, which seems to do the work:
https://github.com/tbarbugli/cassandra_snapshotter/
But the problem is, in our current cluster, we cant afford to have snapshots on the same disk as the actual data, for we are using SSD's with limited space.
I've also looked up the nodetool snapshot documentation, but I didn't find any option to change the snapshots dir.
So, how can I backup cassandra to another disk, without using the data disk?

Cassandra snapshots are just hard links to all the live sstables at the moment you take the snapshot. So initially they don't take up any additional space on disk. As time passes the new live sstables will supersede the old one at which point your snapshots will start to count against your storage space.
Generally you will take a snapshot to get a consistent view of the database at a given point in time and then use an external tool or script to copy that backup to external storage (and finally clean up the snapshot).
There is no additional tool provided with Cassandra to handle copying the snapshots to external storage. This isn't too surprising as backup strategies very a lot across companies.

Related

Are Google Compute Engine (GCE) persistent disk (PD) snapshots crash-consistent?

When I take a persistent disk snapshot, what are the consistency guarantees for this snapshot? I know it is not guaranteed to be "application consistent", but is it "crash consistent"? Or are there no consistency guarantees?
-- EDIT --
For comparison, Machine Images are guaranteed to be crash-consistent as is. However, the docs are silent on this issues with persistent disk snapshots.
TL;DR: PD disk snapshots are crash consistent.
From the Compute Engine Persistent documentation:
When you take a snapshot of a persistent disk, you don't need to take any additional steps to make your snapshot crash consistent. In particular, you do not need to pause your workload.
As per the snapshot best practice you can create consistent snapshot from the persistent disk even if the application is writing data to those disks. If your app require strict consistency then you can follow the below steps to ensure consistent snapshot.
To prepare your persistent disk before you take a snapshot do the following:
1- Connect to your instance using SSH.
2- Run an app flush to disk. For example, MySQL has a FLUSH statement. Use whichever tool is available for your app.
3- Stop your apps from writing to your persistent disk.
4- Run sudo sync.

Loading large amount of data from local machine into Amazon Elastic Block Store

I am interested in doing some machine learning using an AWS EC2 instance. I have played around with launching instances with a an attached EBS and I was able to load files into it via scp on my local command line. I will have several gigabytes of data to load onto this EBS (I know that isn't a lot by ML standards but that's not really my point). I would like to know what is the appropriate way to load this data. I'm concerned about racking up large fees because I did something in a silly way.
So far I have just uploaded a few files to the EC2 instance's associated EBS manually via the command line, like this:
scp -i keys/ec2-ml-micro2.pem data/BB000000001.png ubuntu#<my instance ip>:/data
This seems to me to be a rather primitive approach (not that that is always a bad thing). Is it the "right" way? I'm not opposed to letting a batch jbb run overnight like this but I am not sure if it may incur some data transfer fees. I've looked around for information on this, and I have read the page on EBS pricing. I didn't see anything on costs associated with loading data but I just wanted to confirm with someone or some people who have done something similar that this is the correct approach, and if not, what is a a better one
In managing large objects in AWS. Always check for S3 as an initial option, it provides unlimited Storage capacity and best use for object store compared to EBS(block store). EBS billed you from the size of the volume that you provisioned, having said that there is a chance that you over-provisioned(overhead cost) or under-provisioned (can lead to poor performance or even downtime).
Using S3 you are billed for the storage that you consumed per GB per month, pay for what you use model and it's very cheap compared to EBS.
And lastly, try to evaluate first the AWS Machine Learning services that might fit for your use-cases it will save you alot of time and effort.
Data Transfer from S3 to EBS within the same region is free of charge.
AWS Pricing Details

Snapshot size is inconsistent

I set up a VM (using Bitnami running DokuWiki) and when I create manual snapshots, the size varies wildly between 1MB and 1GB. Nothing happens to the VM, the snapshots are created minutes apart from each other.
What is happening here? Am I missing something obvious? I want to set up auto backup, but if the manual creation of snapshots is not reliable I would not trust an auto system.
Cheers
The snapshots are created with incremental backups.
When incremental snapshots are performed, the current existing snapshot will be used as a baseline for subsequent snapshots. The system will create the new snapshot more quickly if it can use the previous snapshot and read only the new or changed data from the persistent disk.
Every new snapshot only contains new data or modified data. This is the reason why the sizes vary on each backup.
For more information in this regard, you may read this article from the GCP public documentation.

How can find out how much space my snapshots are taking in Amazon EC2

I am creating daily snapshots (backups) in Amazon Ec2, and I need to find out how much space the snapshots are taking so that I can remove them if they take up too much space. I have looked and I am unable to find what I need.
I know that its on S3, but I have not seen any bucket created here so that I can see that snapshot.
Also, is there a way to download a snapshot to my computer (where I can store it), and upload it when needed?
Snapshots are persisted to S3, but are not stored in buckets owned by the user - so you won't be able to see them there.
To see how much space is being used by snapshots, you should be able to log in to the AWS Console through your web browser and see it under "Elastic Block Store", or if you've installed the command-line tools by running the ec2-describe-snapshots command you will see the following parameter returned:
volume-size
The size of the volume
As for downloading your snapshots, it's possible with the non-Windows instance snapshots, but it's also quite involved. But here's the instructions.

Amazon instance store

As far as I understand for new created amazon instance ephermeral data store is used by default, unless EBS store is configured.
After stop of the instance, which uses ephermeral data store, I will loose all data. Is it correct ?
I noticed that EBS store has been created automatically for my instance. I have created few files in home directory, but this files were not deleted after reboot. So where is ephermeral data is stored ?
I want to install database to Amazon host. Should I worry about data loose with default setup and what is the common configuration, for example
Create instance
Install and configure database on ephermeral data store
Make AMI
Create EBS store and configure database to use it as storages
After stop of the instance, which uses ephermeral data store, I will loose all data. Is it correct ?
To be specific, after you terminate or stop a node, any data on instance-specific storage will be lost. A reboot is different, and your data is intact in those cases. I am using these terms to match the terms in the AWS console.
To confuse matters slightly, some EBS-backed nodes also have some instance-specific storage. All instance-storage nodes are 100% instance-backed, though. So you really need to understand whether your data is hitting an EBS disk or instance-local storage.
I noticed that EBS store has been created automatically for my instance. I have created few files in home directory, but this files were not deleted after reboot. So where is ephermeral data is stored ?
Several points here:
For an EBS-backed instance, your /home partition is on the EBS root device, and hence data will persist provided the volume exists.
Again a reboot wouldn't delete your data even if you had an instance-storage node, but it sounds like you chose an EBS-backed node.
If you had instead created these files in /mnt, then stopped your instance and later started it again, you might have lost them. Again it depends exactly which ec2 node type you're running.
Regarding your last point - I would recommend that you just make sure your data is being stored on some EBS backed disk. Whether that is your root device or a separate EBS volume is up to you and depends on your specific needs.
I want to install database to Amazon host.
You should give some thought to not installing and maintaining your own database. Doing so is complex, error prone, and can be quite time consuming. I
A better option for most folks is a turnkey database solution like RDS. This is a performant database that you don't have to really think about - it'll just work. RDS isn't for everyone, as there are some restrictive permission issues, but generally speaking it's great. I use it every day.
You can run databases on top of EBS and it'll work just fine. But you are biting off being a database admin at that point, and need to worry about all the complexity that comes with it. In my opinion, better to focus your time & energy on things like database schema, queries, and other aspects of your business.