I would like to backup a snapshot of my Amazon Redshift cluster into Amazon Glacier.
I don't see a way to do that using the API of either Redshift or Glacier. I also don't see a way to export a Redshift snapshot to a custom S3 bucket so that I can write a script to move the files into Glacier.
Any suggestion on how I should accomplish this?
There is no function in Amazon Redshift to export data directly to Amazon Glacier.
Amazon Redshift snapshots, while stored in Amazon S3, are only accessible via the Amazon Redshift console for restoring data back to Redshift. The snapshots are not accessible for any other purpose (eg moving to Amazon Glacier).
The closest option for moving data from Redshift to Glacier would be to use the Redshift UNLOAD command to export data to files in Amazon S3, and then to lifecycle the data from S3 into Glacier.
Alternatively, simply keep the data in Redshift snapshots. Backup storage beyond the provisioned storage size of your cluster and backups stored after your cluster is terminated are billed at standard Amazon S3 rates. This has the benefit of being easily loadable back into a Redshift cluster. While you'd be paying slightly more for storage (compared to Glacier), the real cost saving is in the convenience of quickly restoring the data in future.
Any use case to take a backup as Redshift automatically keeps snapshots. Here is a reference link
Related
I see that AWS DocumentDB is creating automatic snapshots daily and I myself can create manual snapshots from AWS Console. The documentation says that the snapshot is saved in S3 but it is not visible on S3 to me.
I basically want to move the DocumentDB data to S3 in order to propagate it further to other AWS services for monitoring purposes. I was thinking if I can trigger a manual snapshot daily and have a lambda trigger on S3 file upload by DocumentDB.
How can I see the automatic and manual snapshot created by DocumentDB on S3?
Backups in Amazon DocumentDB are stored in service-managed S3 buckets and thus there is no way to access the backups directly.
Two options here are:
1/use mongodump/mongoexport on a schedule: https://docs.aws.amazon.com/documentdb/latest/developerguide/backup_restore-dump_restore_import_export_data.html
2/use change streams to incrementally write to S3: https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
I am looking to replicate Amazon Redshift data to a different region. However, Redshift does allow replication for backup reason only. What I am looking for is to process the data in one region and then replicate the processed data to another region so that tools in the second region can directly connect to Redshift in that region with lower latency.
I know I can Build Multi-AZ or Multi-Region Amazon Redshift Clusters | AWS Big Data Blog using Kinesis but since it is not a transactional app we really don't want to do that.
I use AWS and have automatic backup enabled.
For one of our client, we need to know exactly where the backup data is stored.
From the AWS FAQ website, I can see that:
Q: Where are my automated backups and DB Snapshots stored and how do I manage their retention?
Amazon RDS DB snapshots and automated backups are stored in S3.
My understanding is that you can have a S3 instance located anywhere you want, so it's not clear to me where this data is.
Just to be clear, I'm interested by the physical location (is it Europe, US....)
It is stored in the same AWS region where the RDS instance is located.
When you directly store data in S3, you store it in an S3 container called a bucket (S3 doesn't use the term "instance") in the AWS region you choose, and the data always remains only in that region.
RDS snapshots and backups are not something you store directly -- RDS stores it for you, on your behalf -- so there is no option to select the bucket or region: it is always stored in an S3 bucket in the same AWS region where the RDS instance is located. This can't be modified.
The data from RDS backups and snapshots is not visible to you from the S3 console, because it is not stored in one of your S3 buckets -- it is stored in a bucket owned and controlled by the RDS service within the region.
According to this :
Your Amazon RDS backup storage for each region is composed of the automated backups and manual DB snapshots for that region. Your backup storage is equivalent to the sum of the database storage for all instances in that region
I think that means that it is stored in that region only and s3 stores data like this :
Amazon S3 redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon S3 synchronously stores your data across multiple facilities before confirming that the data has been successfully stored.
https://aws.amazon.com/rds/details/backup/
...By default, Amazon RDS creates and saves automated backups of your DB instance securely in Amazon S3 for a user-specified retention period.
...Database snapshots are user-initiated backups of your instance stored in Amazon S3 that are kept until you explicitly delete them
I want to keep a backup of an AWS s3 bucket. If I use Glacier, it will archive the files from the bucket and moved to the Glacier but it will also delete the files from s3. I don't want to delete the files from s3. One option is to try with EBS volume. You can mount the AWS s3 bucket with s3fs and copy it to the EBS volume. Another way is do an rsync of the existing bucket to a new bucket which will act as a clone. Is there any other way ?
What you are looking for is cross-region replication:
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
setup versioning and setup the replication.
on the target bucket you could setup a policy to archive to Glacier (or you could just use the bucket as a backup as is).
(this will only work between 2 regions, i.e. buckets cannot be in the same region)
If you want your data to be present in both primary and backup locations then this is more of a data replication use case.
Consider using AWS Lambda which is an event driven compute service.
You can write a simple piece of code to copy the data wherever you want. This will execute every time there is a change in S3 bucket.
For more info check the official documentation.
I plan to run mapreduce job on the data stored in S3. Data size is around 1PB. Will EMR copy entire 1TB data to spawned VMs with replication factor 3 (if my rf = 3)? If yes, does amazon charge for copying data from S3 to MapReduce cluster?
Also, is it possible to use EMR for the data not residing in s3?
Amazon Elastic Map Reduce accesses data directly from Amazon S3. It does not copy the data to HDFS. (It might use some local temp storage, I'm not 100% sure.)
However, it certainly won't trigger your HDFS replication factor, since the data is not stored in HDFS. For example, Task Nodes that don't have HDFS can still access data in S3.
There is no Data Transfer charge for data movements between Amazon S3 and Amazon EMR within the same Region, but it will count towards the S3 Request count.
Amazon Elastic Map Reduce can certainly be used on data not residing in Amazon S3 -- it's just a matter of loading the data from your data source, such as using scp to copy the data into HDFS. Please note that the contents of HDFS will disappear when your cluster terminates. That's why S3 is a good place to store data for EMR -- it is persistent and there is no limit on the amount of data that is stored.