I would like to shut down my Redshift cluster but would like to keep a backup of it.
I understood i can create a Manual Snapshot and it will be save to S3.
To reduce the costs even more, i would like to move the Snapshot from S3 to Glacier but can't find the Snapshot in my S3 account.
Where is the Snapshot being saved? Is AWS keeping it in a different account?
Or maybe i am not at all going the right way, should i be backing up my Redshift cluster differently?
Thank you,
Oren.
It's not stored in any of your account's S3 buckets. It's being stored "behind the scenes" in S3. Amazon only makes a point of telling you it is stored in S3 so you understand the fault-tolerance involved in the storage of your snapshots. If you need to store a backup in one of your S3 buckets you would need to do a pg_dump of the database and copy the dump file to S3.
You can use Redshift's UNLOAD to dump tables straight to an S3 bucket. Unfortunately you need to do it separately for each table. You'll also want to archive all the schema queries, CREATE, etc. for your tables (the pg_dump solution doesn't have this problem since it can capture the table definitions, but requires local storage of the files and manual push to S3...might be worth it though for a case like archive and complete shutdown).
UNLOAD('select * from your_table') TO 's3://your_bucket/your_table.csv'
WITH CREDENTIALS 'aws_access_key_id=YOUR_KEY;aws_secret_access_key=YOUR_SECRET'
DELIMITER ',' NULL 'null' ALLOWOVERWRITE;
Once you have all your tables in an S3 bucket you can set the lifecycle (rules created in the Properties pane of the bucket on the Lifecycle panel) to archive to glacier storage class.
It's a bit confusing because Glacier is it's own service, but when you archive via lifecycle the files stay in the S3 bucket. You can tell they're in glacier by selecting a file in the S3 console, selecting the properties pane and opening the Details panel. There is should say Storage class: Glacier.
If you ever need to restore you'll use the COPY command:
COPY your_table FROM 's3://your_bucket/your_table.csv'
CREDENTIALS 'aws_access_key_id=[YOURKEY];aws_secret_access_key=[YOURSECRET]'
DELIMITER ',' NULL 'null' IGNOREBLANKLINES EMPTYASNULL
BLANKSASNULL TIMEFORMAT 'auto' FILLRECORD MAXERROR 1
Related
The Amazon Redshift documentation states that the best way to load data into the database is by using the COPY function. How can I run it automatically every day with a data file uploaded to S3?
The longer version: I have launched a Redshift cluster and set up the database. I have created an S3 bucket and uploaded a CSV file. Now from the Redshift Query editor, I can easily run the COPY function manually. How do I automate this?
Before you finalize your approach you should consider below important points:
If possible, compress csv files into gzips and then ingest into corresponding redshift tables. This will reduce your file size with a good margin and will increase overall data ingestion performance.
Finalize the compression scheme on table columns. If you want redshift to do the job, auto compression can be enabled with "COMPUPDATE ON" in copy command. Refer aws documentation
Now, to answer your question:
As you have created S3 bucket for the same, create directories for each table and place your files there. If your input files are large, split them into multiple files ( number of files should be chosen according to number of nodes you have, to enable better parallel ingestion, refer aws doc for more details).
Your copy command should look something like this :
PGPASSWORD=<password> psql -h <host> -d <dbname> -p 5439 -U <username> -c "copy <table_name> from 's3://<bucket>/<table_dir_path>/' credentials 'aws_iam_role=<iam role identifier to ingest s3 files into redshift>' delimiter ',' region '<region>' GZIP COMPUPDATE ON REMOVEQUOTES IGNOREHEADER 1"
next step it to create lambda and enable sns over redshift s3 bucket, this sns should trigger lambda as soon as you receive new files at s3 bucket. Alternate method would be to set cloudwatch scheduler to run the lambda.
Lambda can be created(java/python or any lang) which reads s3 files, connect to redshift and ingest files into tables using copy command.
Lambda has 15 mins limit, if that is a concern to you then fargate would be better. Running jobs on EC2 will cause more billing than lambda or fargate ( in case you forget to turn off ec2 machine)
You could create an external table over your bucket. Redshift would automatically scan all the files in the bucket. But bare in mind that the performance of queries may not be as good as with data loaded via COPY, but what you gain is no scheduler needed.
Also once you have an external table you could load it once to redshift with a single CREATE TABLE AS SELECT ... FROM your_external_table. The benefit of that approach is that it's idempotent - you don't need to keep track of your files - it will always load all data from all files in the bucket.
There is huge data in one of my AWS account in S3 Bucket and I want to process the data in some another AWS account using Redshift, I want to save the cost of data transfer and storage since I already have the data in first account.
Does Redshift provide this functionality to process data from a shared S3 bucket ?
Thanks in advance.
I haven't tested myself practically, but you could refer any S3 bucket of any account while Copying or Unloading data to/from S3 from/to Redshift. You simple has to supply the accurate IAM Role or S3 credentials.
See the Copy or unload Syntax, it just ask for ACCESSKEY/SECRET, no account information.
COPY sales FROM 's3://s3-path/to/data/data.csv' CREDENTIALS 'aws_access_key_id=**********;aws_secret_access_key=*******' FORMAT as CSV;
similarly unload command need same thing,
unload ('SELECT * FROM example') TO 's3://path/to/S3/' credentials'aws_access_key_id=XXXXXXXXXX;aws_secret_access_key=XXXXXXXXXXXXXX' delimiter '|' NULL AS '\\N' escape;
Amazon Athena Log Analysis Services with S3 Glacier
We have petabytes of data in S3. We are https://www.pubnub.com/ and we store usage data in S3 of our network for billing purposes. We have tab delimited log files stored in an S3 bucket. Athena is giving us a HIVE_CURSOR_ERROR failure.
Our S3 bucket is setup to automatically push to AWS Glacier after 6 months. Our bucket has S3 files hot and ready to read in addition to the Glacier backup files. We are getting access errors from Athena because of this. The file referenced in the error is a Glacier backup.
My guess is the answer will be: don't keep glacier backups in the same bucket. We don't have this option with ease due to our data volume sizes. I believe Athena will not work in this setup and we will not be able to use Athena for our log analysis.
However if there is a way we can use Athena, we would be thrilled. Is there a solution to HIVE_CURSOR_ERROR and a way to skip Glacier files? Our s3 bucket is a flat bucket without folders.
The S3 file object name shown in the above and below screenshots is omitted from the screenshot. The file reference in the HIVE_CURSOR_ERROR is in fact the Glacier object. You can see it in this screenshot of our S3 Bucket.
Note I tried to post on https://forums.aws.amazon.com/ but that was no bueno.
The documentation from AWS dated May 16 2017 states specifically that Athena does not support the GLACIER storage class:
Athena does not support different storage classes within the bucket specified by the LOCATION
clause, does not support the GLACIER storage class, and does not support Requester Pays
buckets. For more information, see Storage Classes, Changing the Storage Class of an Object in
|S3|, and Requester Pays Buckets in the Amazon Simple Storage Service Developer Guide.
We are also interested in this; if you get it to work, please let us know how. :-)
Since the release of February 18, 2019 Athena will ignore objects with the GLACIER storage class instead of failing the query:
[…] As a result of fixing this issue, Athena ignores objects transitioned to the GLACIER storage class. Athena does not support querying data from the GLACIER storage class.
You must have an S3 bucket to work with. In addition, the AWS account that you use to initiate a S3 Glacier Select job must have write permissions for the S3 bucket. The Amazon S3 bucket must be in the same AWS Region as the vault that contains the archive object that is being queried.
S3 glacier select runs the query and stores in S3 bucket
Bottom line, you must move the data into an S3 buck to use teh S3 glacier select statement. Then use Athena on the 'new' S3 bucket.
I want to do a daily backup for s3 buckets. I was wondering if anyone knew what was best practice?
I was thinking of using a lambda function to copy contents from one s3 bucket to another as the s3 bucket is updated. But that won't mitigate against an s3 failure. How do I copy contents from one s3 bucket to another Amazon service like Glacier using lamda? What's the best practice here for backing up s3 buckets?
NOTE: I want to do a backup not archive (where content is deleted afterward)
Look into S3 cross-region replication to keep a backup copy of everything in another S3 bucket in another region. Note that you can even have the destination bucket be in a different AWS Account, so that it is safe even if your primary S3 account is hacked.
Note that a combination of Cross Region Replication and S3 Object Versioning (which is required for replication) will allow you to keep old versions of your files available even if they are deleted from the source bucket.
Then look into S3 lifecycle management to transition objects to Glacier to save storage costs.
I want to keep a backup of an AWS s3 bucket. If I use Glacier, it will archive the files from the bucket and moved to the Glacier but it will also delete the files from s3. I don't want to delete the files from s3. One option is to try with EBS volume. You can mount the AWS s3 bucket with s3fs and copy it to the EBS volume. Another way is do an rsync of the existing bucket to a new bucket which will act as a clone. Is there any other way ?
What you are looking for is cross-region replication:
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
setup versioning and setup the replication.
on the target bucket you could setup a policy to archive to Glacier (or you could just use the bucket as a backup as is).
(this will only work between 2 regions, i.e. buckets cannot be in the same region)
If you want your data to be present in both primary and backup locations then this is more of a data replication use case.
Consider using AWS Lambda which is an event driven compute service.
You can write a simple piece of code to copy the data wherever you want. This will execute every time there is a change in S3 bucket.
For more info check the official documentation.