Amazon Athena Log Analysis Services with S3 Glacier
We have petabytes of data in S3. We are https://www.pubnub.com/ and we store usage data in S3 of our network for billing purposes. We have tab delimited log files stored in an S3 bucket. Athena is giving us a HIVE_CURSOR_ERROR failure.
Our S3 bucket is setup to automatically push to AWS Glacier after 6 months. Our bucket has S3 files hot and ready to read in addition to the Glacier backup files. We are getting access errors from Athena because of this. The file referenced in the error is a Glacier backup.
My guess is the answer will be: don't keep glacier backups in the same bucket. We don't have this option with ease due to our data volume sizes. I believe Athena will not work in this setup and we will not be able to use Athena for our log analysis.
However if there is a way we can use Athena, we would be thrilled. Is there a solution to HIVE_CURSOR_ERROR and a way to skip Glacier files? Our s3 bucket is a flat bucket without folders.
The S3 file object name shown in the above and below screenshots is omitted from the screenshot. The file reference in the HIVE_CURSOR_ERROR is in fact the Glacier object. You can see it in this screenshot of our S3 Bucket.
Note I tried to post on https://forums.aws.amazon.com/ but that was no bueno.
The documentation from AWS dated May 16 2017 states specifically that Athena does not support the GLACIER storage class:
Athena does not support different storage classes within the bucket specified by the LOCATION
clause, does not support the GLACIER storage class, and does not support Requester Pays
buckets. For more information, see Storage Classes, Changing the Storage Class of an Object in
|S3|, and Requester Pays Buckets in the Amazon Simple Storage Service Developer Guide.
We are also interested in this; if you get it to work, please let us know how. :-)
Since the release of February 18, 2019 Athena will ignore objects with the GLACIER storage class instead of failing the query:
[…] As a result of fixing this issue, Athena ignores objects transitioned to the GLACIER storage class. Athena does not support querying data from the GLACIER storage class.
You must have an S3 bucket to work with. In addition, the AWS account that you use to initiate a S3 Glacier Select job must have write permissions for the S3 bucket. The Amazon S3 bucket must be in the same AWS Region as the vault that contains the archive object that is being queried.
S3 glacier select runs the query and stores in S3 bucket
Bottom line, you must move the data into an S3 buck to use teh S3 glacier select statement. Then use Athena on the 'new' S3 bucket.
Related
Was using the GCP data transfer service to transfer a set of files in an S3 bucket over to a GCP storage bucket. The files were in glacier so I first restored them and then tried copying them. The transfer job runs without any errors, but ignores any of the glacier restored files. Is this the expected behavior? And if it is then it seems like a huge oversight not to mention this in the documentation. I could easily imagine a scenario where you think you've mirrored a bucket when you really haven't.
I have created a lifecycle policy for one of my buckets as below:
Name and scope
Name MoveToGlacierAndDeleteAfterSixMonths
Scope Whole bucket
Transitions
For previous versions of objects Transition to Amazon Glacier after 1 days
Expiration Permanently delete after 360 days
Clean up incomplete multipart uploads after 7 days
I would like to get answer for the following questions:
When would the data be deleted from s3 as per this policy ?
Do i have to do anything on the glacier end inorder to move my s3 bucket to glacier ?
My s3 bucket is 6 years old and all the versions of the bucket are even older. But i am not able to see any data in the glacier console though my transition policy is set to move to glacier after 1 day from the creation of the data. Please explain this behavior.
Does this policy affect only new files which will be added to the bucket post lifepolicy creation or does this affect all the files in s3 bucket ?
Please answer these questions.
When would the data be deleted from s3 as per this policy ?
Never, for current versions. A lifecycle policy to transition objects to Glacier doesn't delete the data from S3 -- it migrates it out of S3 primary storage and over into Glacier storage -- but it technically remains an S3 object.
Think of it as S3 having its own Glacier account and storing data in that separate account on your behalf. You will not see these objects in the Glacier console -- they will remain in the S3 console, but if you examine an object that has transitioned, is storage class will change from whatever it was, e.g. STANDARD and will instead say GLACIER.
Do i have to do anything on the glacier end inorder to move my s3 bucket to glacier ?
No, you don't. As mentioned above, it isn't "your" Glacier account that will store the objects. On your AWS bill, the charges will appear under S3, but labeled as Glacier, and the price will be the same as the published pricing for Glacier.
My s3 bucket is 6 years old and all the versions of the bucket are even older. But i am not able to see any data in the glacier console though my transition policy is set to move to glacier after 1 day from the creation of the data. Please explain this behavior.
Two parts: first, check the object storage class displayed in the console or with aws s3api list-objects --output=text. See if you don't see some GLACIER-class objects. Second, it's a background process. It won't happen immediately but you should see things changing within 24 to 48 hours of creating the policy. If you have logging enabled on your bucket, I believe the transition events will also be logged.
Does this policy affect only new files which will be added to the bucket post lifepolicy creation or does this affect all the files in s3 bucket ?
This affects all objects in the bucket.
I would like to shut down my Redshift cluster but would like to keep a backup of it.
I understood i can create a Manual Snapshot and it will be save to S3.
To reduce the costs even more, i would like to move the Snapshot from S3 to Glacier but can't find the Snapshot in my S3 account.
Where is the Snapshot being saved? Is AWS keeping it in a different account?
Or maybe i am not at all going the right way, should i be backing up my Redshift cluster differently?
Thank you,
Oren.
It's not stored in any of your account's S3 buckets. It's being stored "behind the scenes" in S3. Amazon only makes a point of telling you it is stored in S3 so you understand the fault-tolerance involved in the storage of your snapshots. If you need to store a backup in one of your S3 buckets you would need to do a pg_dump of the database and copy the dump file to S3.
You can use Redshift's UNLOAD to dump tables straight to an S3 bucket. Unfortunately you need to do it separately for each table. You'll also want to archive all the schema queries, CREATE, etc. for your tables (the pg_dump solution doesn't have this problem since it can capture the table definitions, but requires local storage of the files and manual push to S3...might be worth it though for a case like archive and complete shutdown).
UNLOAD('select * from your_table') TO 's3://your_bucket/your_table.csv'
WITH CREDENTIALS 'aws_access_key_id=YOUR_KEY;aws_secret_access_key=YOUR_SECRET'
DELIMITER ',' NULL 'null' ALLOWOVERWRITE;
Once you have all your tables in an S3 bucket you can set the lifecycle (rules created in the Properties pane of the bucket on the Lifecycle panel) to archive to glacier storage class.
It's a bit confusing because Glacier is it's own service, but when you archive via lifecycle the files stay in the S3 bucket. You can tell they're in glacier by selecting a file in the S3 console, selecting the properties pane and opening the Details panel. There is should say Storage class: Glacier.
If you ever need to restore you'll use the COPY command:
COPY your_table FROM 's3://your_bucket/your_table.csv'
CREDENTIALS 'aws_access_key_id=[YOURKEY];aws_secret_access_key=[YOURSECRET]'
DELIMITER ',' NULL 'null' IGNOREBLANKLINES EMPTYASNULL
BLANKSASNULL TIMEFORMAT 'auto' FILLRECORD MAXERROR 1
I want to do a daily backup for s3 buckets. I was wondering if anyone knew what was best practice?
I was thinking of using a lambda function to copy contents from one s3 bucket to another as the s3 bucket is updated. But that won't mitigate against an s3 failure. How do I copy contents from one s3 bucket to another Amazon service like Glacier using lamda? What's the best practice here for backing up s3 buckets?
NOTE: I want to do a backup not archive (where content is deleted afterward)
Look into S3 cross-region replication to keep a backup copy of everything in another S3 bucket in another region. Note that you can even have the destination bucket be in a different AWS Account, so that it is safe even if your primary S3 account is hacked.
Note that a combination of Cross Region Replication and S3 Object Versioning (which is required for replication) will allow you to keep old versions of your files available even if they are deleted from the source bucket.
Then look into S3 lifecycle management to transition objects to Glacier to save storage costs.
I want to keep a backup of an AWS s3 bucket. If I use Glacier, it will archive the files from the bucket and moved to the Glacier but it will also delete the files from s3. I don't want to delete the files from s3. One option is to try with EBS volume. You can mount the AWS s3 bucket with s3fs and copy it to the EBS volume. Another way is do an rsync of the existing bucket to a new bucket which will act as a clone. Is there any other way ?
What you are looking for is cross-region replication:
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
setup versioning and setup the replication.
on the target bucket you could setup a policy to archive to Glacier (or you could just use the bucket as a backup as is).
(this will only work between 2 regions, i.e. buckets cannot be in the same region)
If you want your data to be present in both primary and backup locations then this is more of a data replication use case.
Consider using AWS Lambda which is an event driven compute service.
You can write a simple piece of code to copy the data wherever you want. This will execute every time there is a change in S3 bucket.
For more info check the official documentation.