Amazon Redshift Automated Snapshot Trigger - amazon-web-services

What actually triggers an automatic incremental backup/snapshot for Amazon Redshift? Is it time-based? The site says it "periodically takes snapshots and tracks incremental changes to the cluster since the last snapshot" and I know whenever I modify the cluster(either delete, modify size, or change node type) itself, a snapshot is taken. But what about when a database on the cluster is altered? I have inserted, loaded, deleted many rows but no automatic snapshot is taken. Would I just have to do manual backups then?
I have asked around and looked up online and no one has been able to give me an answer. I am trying to figure out an optimal backing strategy for my workload.

Automated backups are taken every 8 hours or every 5 GB of inserted data, whichever happens first.
Source: I work for AWS Redshift.

Related

Merge AWS EBS snapshots

I am exploring AWS EBS snapshot policy to minimize the data loss while any failure occurs to the server. I am thinking of an hourly snapshot policy with 7 days of retention. It will serve the purpose of minimizing the data loss but it will flood the AWS snapshot console which may lead to mistakes in future. To prevent this I am exploring a way so the hourly backups can be merged together daily.
Scenario
Hourly snapshot policy with 7 days retention means 24 snapshots daily till the end of the week = 168 snapshots for a server and 1 merged snapshot will be created at the end of the week.
What I am exploring
Hourly snapshot policy with 7 days retention and 1-day merging means it will create the snapshots hourly till the end of the day and then merge them to 1 single snapshot so I will have one snapshot for the day rather than 24.
I explored the AWS documentation but that doesn't help. Any help would be really appreciable.
If you delete any of the snapshots in between you will find that AWS will automatically perform this merge functionality to ensure there is no missing data in between snapshots.
Deleting a snapshot might not reduce your organization's data storage costs. Other snapshots might reference that snapshot's data, and referenced data is always preserved. If you delete a snapshot containing data being used by a later snapshot, costs associated with the referenced data are allocated to the later snapshot.
If you delete any snapshots (including the first) the data will be merged with the next snapshot that was taken.
Therefore you can relax and adjust the policies as required, without the risk of data loss.
More details are available in the how incremental snapshots work documentation.
I like to think of an Amazon EBS Snapshot as consisting of two items:
Individual backups of each 'block' on the disk
An 'index' of all the blocks on the disk and where their backup is stored
When an EBS Snapshot is created, a back-up is made of any blocks that are not already backed-up. An index is also made that lists all the blocks in that "backup".
For example, let's say that an EBS Volume has Snapshot #1 and then one block is modified on the disk. If another Snapshot (#2) is created, only one block will be backed-up, but the Snapshot index will point to all the blocks in the backup.
If the Snapshot #1 is then deleted, all the blocks will be retained for Snapshot #2 automatically. Thus, there is no need to "merge" snapshots -- this is all done automatically.
Bottom line: You can delete any snapshots you want. The blocks required to restore all remaining Snapshots will be retained.

Archiving data of specific table in AWS RDS periodically

I use AWS RDS as a database for my Spring boot application. I would like to archive earlier than 6 months of data from one specific table. In this context, I have gone through a few articles here but did not get any concrete idea of how to do this. Could anyone please help here?
If you are looking to backup with RDS itself, your options are limited. You can, of course, use automated RDS snapshots, but that won't let you pick a specific table (it will backup the entire database) and can't be set for retention longer than 35 days. Alternatively, you could manually initiate a snapshot, but you can't indicate a retention period. In this case though, you could instead use the AWS published rds-snapshot-tool which will help you automate the snapshot process and let you specify a maximum age of the snapshot. This is likely the easiest way to use RDS for your question. If you only wanted to restore one specific table (and didn't care about have the other tables in the backup), you could restore the snapshot and just immediately DROP the tables you don't care about before you start using the snapshot.
However, if you really care about only backing up one specific table, then RDS itself is out as a possible means for taking the backups on your behalf. I am assuming a mysql database for your spring application, in which case you will need to use the mysqldump tool to grab the database table you are interested in. You will need to manually call that tool from an application and then store the data persistently somewhere (perhaps S3). You will also need to manage the lifecycle on that backup, but if you do use S3, you can set a lifecycle policy to automatically age out and drop old files (backups, in this case).

Calculating AWS snapshot usage cost programatically

I am planning to calculate snapshot usage cost using a script.
As per the documentation if we have GB-month value we can calculate the cost based on this. Is there any way to calculate snapshot size and its age? I could not find any method to fetch the snapshot size. When I describe a snapshot I do get volume-size in snapshotInfo but I don't think that's the snapshot size. Also the age of a snapshot is not defined in the description. Only the timestamp when the snapshot was initiated is in the output.
I don't want the cost for all the snapshots. I will be filtering snapshots based on a custom tag. I saw https://aws.amazon.com/blogs/aws/new-cost-allocation-for-ebs-snapshots/ but this is via the UI and needs special permissions.
The cost and usage report is the only way to capture this information. It is not accessible through the service API.
EBS snapshots are -- logically -- the same size as the source volume, because every EBS snapshot contains a reference to a stored representation of every single block on the volume.
But it's only a reference -- a pointer -- because EBS doesn't store the actual data blocks inside the snapshot itself. It maintains a mapping and has the ability to determine which blocks are unchanged from snapshot to snapshot, so that it doesn't redundantly store them.
The price you pay for a given snapshot is directly determined by how many blocks in that snapshot are different from those in the most recent, prior snapshot of the same volume that still exists. Deleting older snapshots preserves any blocks that are still needed for restoring newer snapshots, and thus rolls the cost of those blocks forward into snapshots that still exist, with the cost shifting into the oldest snapshot that still needs the blocks after any older ones are deleted.
So the cost of a given snapshot changes as previous snapshots of the same volume are deleted.
Also:
Only the timestamp when the snapshot was initiated is in the output.
That's the age. Snapshots are snapshots -- an image of the disk at the moment in time the snapshot was initiated. Regardless of how long the snapshot takes to run, the data it captures is the data as it existed on the volume when the snapshot was initiated.

describing snapshots whose associated volume is deleted or not present currently

I was trying to do cost optimisation for my aws account. And i came across the snapshots count. and I saw lots of snapshots over there in my console.
There are some snapshots which were created via any volume. and now the volume is deleted.
How can I describe the snapshots whose volume is not present. ( I know we can use ec2-describe-snapshots, but I need the filters and way to get it.)
Thanks in advance. :)
If I were you I would create a lambda function with this code and have it executed by CloudWatch Events daily, this way you clean up regularly without having to remember! ;)
I am going to reference the node.js API here but the method in the madness is the same for all APIs.
Use ec2 describeSnapshots to get your collection for iteration (http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/EC2.html#describeSnapshots-property)
For each snapshot, call describeVolume using the VolumeId in the Snapshot result as the VolumeId. If it doesn't exist anymore you will get an error. (http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/EC2.html#describeVolumes-property)
Call deleteSnapshot to delete the snapshot that you no longer need (http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/EC2.html#deleteSnapshot-property)
Should be a fun little project! :)

Do I need to set up backup data pipeline for AWS Dynamo DB on a daily basis?

I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?
There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.
This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.
data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region
I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.