I created external table on underlying data s3. And pointed table1 to s3 standard storage and table2 to glacier storage. Table1 is reading data but not table2.
Can anyone explain why?
S3 objects in the Glacier storage class are not accessible in the same way as normal objects, they need to be retrieved from Glacier before they can be read – which requires a special API call, and also costs money.
Athena is reading objects from S3 just like you would with the S3 API, which means reading objects with the Glacier storage class doesn't work.
It would also not make any sense for Athena to even try reading Glacier objects since the retrieval time is longer than the maximum query duration for an Athena query.
Update: in the release notes for February 18, 2019 it says that Athena now ignores objects transitioned to Glacier, rather than failing the query. The change was likely in effect earlier since releases are often made in different regions at different times and release notes only get updated once it's fully deployed.
Related
In our organization, we facing cost issues due to overload of S3 buckets. Too many junk files and archives are stored which are causing this issue.
I recently got an approval to work on Lifecycle policy in AWS S3. Before I start to work on this, I need to clarify that our Athena databases have their storage in one of the S3 buckets.
If we change the storage class, will that impact the Athena database queries?
That depends on the storage class.
If you archive data into Glacier, Athena won't be able to read it and just ignores it. For the other storage classes, e.g. the infrequently accessed ones Athena can still read them, but your costs will increase if they're read at least once per month.
We have daily database backups created and stored on a server. In order to free up space, it was decided that all the backups older than 30 days should be archived using AWS Glacier.
So far so good, I managed to write a PowerShell script to select the required files and upload them to Glacier, but since I am new to all the AWS stuff, I have one question: is it possible to check that the files I have uploaded are indeed in the archive and that there has been no information loss?
My first approach was to send job retrieval requests for all the files that we have uploaded, and 4 hours later compare the checksums and archive ids of our original files and the ones we retrieved from Glacier. However, I think this process takes long, costs extra money, and most importantly, makes no sense at all..
I have also found that I can use inventory retrieval, but as far as I can tell this approach would be very similar to the one described above, just without downloading all the files again.
Lastly, is there even a point to trying to ensure that a file upload was successful if there are no errors? My vague understanding is that AWS would come back with error messages should an upload to Glacier fail, and it computes checksums internally during uploads.
I know that StackOverflow has seen more precisely worded questions, but any clarification regarding this would be immensely appreciated.
You have to try pretty hard to upload a corrupt file to Glacier, because Glacier requires checksums sent with each API request, and will reject the uploads if they don't match the hashes. Obviously you need to spot check your archives, but each one does not need to be downloaded and verified because of the built-in protections.
See Computing Checksums in the Amazon S3 Glacier Developer Guide for descriptions of how this works, on the wire.
Then, consider not using Glacier at all... not directly, anyway. Use S3, and upload your files using the GLACIER or DEEP_ARCHIVE storage class. Or upload them as Standard, with a lifecycle policy that moves them into one of the archive storage classes after 1 day. (Useful because if you delete Glacier or Deep Archive uploads before the minimum storage time, you're billed for the entire minimum time... this way you have a 24 hour "oops I don't like the way I set this up" window, since Standard storage has no minimum storage time period).
Using S3 is a far better solution, because S3 has a much better API and console, but the pricing is identical, because S3 is actually using Glacier as its backend, while you have the advantage of S3 as the frontend. Glacier has essentially no console functionality, is very opaque, and is not really designed for human interaction -- Glacier appears to have been designed as a backing store for an archiving system or service, which is exactly how S3 uses Glacier.
Amazon Simple Storage Service (Amazon S3) supports lifecycle configuration on an S3 bucket, which enables you to transition objects to the Amazon S3 GLACIER storage class for archival. When you transition Amazon S3 objects to the GLACIER storage class, Amazon S3 internally uses Glacier for durable storage at lower cost. Although the objects are stored in Glacier, they remain Amazon S3 objects that you manage in Amazon S3, and you cannot access them directly through Glacier.
https://docs.aws.amazon.com/amazonglacier/latest/dev/introduction.html
It is confusing and unfortunate that AWS recently confused this issue by dumbing things down, rebranding "Glacier" as "S3 Glacier," as if they were the same thing, when they are two very different services, one of which operates in a mode that gives you a gateway to the other. It's similarly unfortunate how Glacier has traditionally been marketed. Without S3 in front, Glacier is not well suited for very many applications.
I have a website where I serve content that is stored on an AWS S3 bucket. As the amount of content grows, I have started thinking about back-up options. Using AWS Glacier came up as a natural route.
After reading on it, I didn't understand if it does what I intend to do with it. From what I have understood, using Glacier, you set lifecycle policies on objects stored on your S3 buckets. According to these policies, objects will be transferred Glacier and deleted from your S3 bucket at a specific point in time after they have been uploaded to S3. At this point, the object's storage class changes to 'GLACIER'. Amazon explains that, once this is done, you can no longer access the objects through S3 but "their index entry will remain as is". Simultaneously, they say that retrieval of objects from Glacier takes 3-5 hours.
My question is: Does this mean that, once objects are transferred to Glacier, I will not be able to serve them on my website without retrieving them first? Or does it mean that they will still be served from the S3 bucket as usual but that, in case something happens with the files on S3 I will just be able to retrieve them in 3-5 hours? Glacier would only be a viable back up solution for me if users of my website would still be able to load content on the website after the correspondent objects are transferred to Glacier. Also, is it possible to have objects transferred to Glacier without them being deleted from the S3 bucket?
Thank you
To answer your question: Does this mean that, once objects are transferred to Glacier, I will not be able to serve them on my website without retrieving them first?
No, you won't be able to serve them on your website unless transfer them from glacier to standard or standard_IA class, which is taken 3-5 hours. Glacier is generally used to archive cold data like old logs which is accessed in rare condition. So if you need real-time access to the object, Glacier isn't a valid option for you.
I am working on an app which uses S3 to store important documents. These documents need to be backed up on a daily, weekly rotation basis much like how database backups are maintained.
Does S3 support a feature where a bucket can be backup up into multiple buckets periodically or perhaps in Amazon Glacier. I want to avoid using an external service as much as possible, and was hoping S3 had some mechanism to do this, as its a common usecase.
Any help would be appreciated.
Quote from Amazon S3 FAQ about durability:
Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years
These numbers mean, first of all, that they are almost unbeatable. In other words, your data is safe in Amazon S3.
Thus, the only reason why you would need to backup your data objects is to prevent their accidental loss (by your own mistake). To solve this problem Amazon S3 enables versioning of S3 objects. Enable this feature on your S3 bucket and you're safe.
ps. Actually, there is one more possible reason - cost optimization. Amazon Glacier is cheaper than S3. I would recommend to use AWS Data Pipeline to move S3 data to Glacier routinely.
Regarding Glacier, you can make settings on your bucket to backup (old) s3 data to glaciaer if it is older than specified duration. This can save you cost if you want infrequently accessed data to be archived.
In s3 bucket there are lifecycle rules using which we can automatically move data from s3 to glaciers.
but if you want to access these important documents frequently from backup then you can also use another S3 bucket for backup your data.This backup can be scheduled using AWS datapipeline daily,weekly etc.
*Glaciers are cheaper than S3 as data is stored in compressed format in galaciers.
I created a Windows application that will allow you to schedule S3 bucket backups. You can create three kinds of backups: Cumulative, Synchronized and Snapshots. You can also include or exclude root level folders and files from your backups. You can try it free with no registration at https://www.bucketbacker.com
Is there a way to set an expiry date in Amazon Glacier? I want to copy in weekly backup files, but I dont want to hang on to more than 1 years worth.
Can the files be set to "expire" after one year, or is this something I will have to do manually?
While not available natively within Amazon Glacier, AWS has recently enabled Archiving Amazon S3 Data to Amazon Glacier, which makes working with Glacier much easier in the first place already:
[...] Amazon S3 was designed for rapid retrieval. Glacier, in
contrast, trades off retrieval time for cost, providing storage for as
little at $0.01 per Gigabyte per month while retrieving data within
three to five hours.
How would you like to have the best of both worlds? How about rapid
retrieval of fresh data stored in S3, with automatic, policy-driven
archiving to lower cost Glacier storage as your data ages, along with
easy, API-driven or console-powered retrieval? [emphasis mine]
[...] You can now use Amazon Glacier as a storage option for Amazon S3.
This is enabled by facilitating Amazon S3 Object Lifecycle Management, which not only drives the mentioned Object Archival (Transition Objects to the Glacier Storage Class) but also includes optional Object Expiration, which allows you to achieve what you want as outlined in section Before You Decide to Expire Objects within Lifecycle Configuration Rules:
The Expiration action deletes objects
You might have objects in Amazon S3 or archived to Amazon Glacier. No
matter where these objects are, Amazon S3 will delete them. You will
no longer be able to access these objects. [emphasis mine]
So at the small price of having your objects stored in S3 for a short time (which actually eases working with Glacier a lot due to removing the need to manage archives/inventories) you gain the benefit of optional automatic expiration.
You can do this in the AWS Command Line Interface.
http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html