Approach to move file from s3 to s3 glacier - amazon-web-services

I need to create a python flask application that moves a file from s3 storage to s3 glacier. I cannot use the lifetime policy to do this as I need to use glacier vault lock which isn't possible with the lifetime policy method since I won't be able to use any glacier features on those files. The files will be multiple GBs in size so I need to download these files and then upload them on glacier. I was thinking of adding a script on ec2 that will be triggered by flask and will start downloading and uploading files to glacier.
This is the only solution I have come up with and it doesn't seem very efficient but I'm not sure. I am pretty new to AWS so any tips or thoughts will be appreciated.
Not posting any code as I don't really have a problem with the coding, just the approach I should take.

It appears that your requirement is to use Glacier Vault Lock on some objects to guarantee that they cannot be deleted within a certain timeframe.
Fortunately, similar capabilities have recently been added to Amazon S3, called Amazon S3 Object Lock. This works at the object or bucket level.
Therefore, you could simply use Object Lock instead of moving the objects to Glacier.
If the objects will be infrequently accessed, you might also want to change the Storage Class to something cheaper before locking it.
See: Introduction to Amazon S3 Object Lock - Amazon Simple Storage Service

Related

Uploading files to Glacier using AWS S3 v/s S3 Glacier upload

Standard S3 console supports uploading files and changing storage type, but in S3 Glacier we need to create a vault, and console support is not provided. let's say if I selected the S3 Glacier storage class in standard S3 upload, how it's different from Glacier, will it internally create a vault? is there any price variation?
Uploading to Glacier via Amazon S3 storage classes looks simple and easier.
There are two different types of Glacier.
The 'original' Amazon Glacier uses vaults and jobs. Quite frankly, it is awful to use. It's bearable if you are using a software package that knows how to use Glacier, but it is not a pleasant experience. For example, even just listing the contents of a vault requires waiting for a job to run, and then results need to be retrieved.
Using Glacier as a Storage Class in Amazon S3 is a much more pleasant way to use Glacier. You can use all standard S3 commands and utilities and it gives immediate feedback when you list objects. The only thing that takes time is retrieving an object that is in a Glacier storage class.
Plus, the Glacier and Glacier Deep Archive storage classes are cheaper than Glacier itself! I'd like to prove this, but the pricing page for Glacier now redirects to S3 pricing so it's not possible to see how much it costs!
Bottom line: Use S3 storage classes, not the old 'Glacier' service that uses Vaults.

Amazon S3 Glacier vs Glacier Storage Class

This question might look like an easy/idiot/beginner question but I'm really confused between both of them.
Why do I need to use Amazon S3 Glacier if I can use the normal S3 Bucket and just change the storage class of the objects inside to Glacier manually or by using Lifecycle rule?
Thanks in advance,
In the old days, Amazon Glacier was only available as a separate product. Frankly, the Glacier service is a pain to use.
Every request has to be submitted as a Job, which takes a long time to return. Even obtaining a list of archives is slow, let alone restoring a file from the archive.
The best way to use the Amazon Glacier service is with a third-party tool (eg Cloudberry Backup) that knows how to interface with Glacier, isolating you from having to use it directly.
Then, in 2012, the Amazon S3 team introduced a new Glacier Storage Class where S3 would move the data to Glacier, but still present the objects as being "in S3". (Well, the objects appear in S3 and their metadata is accessible, but the contents of the objects is stored in Glacier.) Then, in 2019, an even lower-cost Glacier Deep Archive storage class offered even lower prices than available through Amazon Glacier itself.
Therefore, it is now both easier and lower cost to use Glacier via Amazon S3 storage classes.
Amazon Glacier still remains available for use, and has been renamed Amazon S3 Glacier to further confuse things. There might be some use-cases where it is preferable to use (eg acting like traditional tape backups for AWS Storage Gateway Tape Gateways), but Glacier Deep Archive in S3 would be the lowest-cost option.
These days most people will just use S3 glacier storage class, because S3 api is much more convenient to work with then Glacier api.
However, Amazon S3 Glacier offers some extra functionality, not available in regular S3. Most notably this would be Vault Lock Policies which allow for fine-grain control of locking your vaults with archives for regulatory purposes.
S3 offers Object Lock which performs similar function, but it is not as versatile as vault lock policies. For example, the s3 object locks can be only enabled on bucket creation, and legal holds apply only to individual versions of objects. In contrast, vault lock policies, as the names suggest, are policy documents written in json, which don't have such limitations.

Checking the integrity of an archive uploaded to AWS Glacier

We have daily database backups created and stored on a server. In order to free up space, it was decided that all the backups older than 30 days should be archived using AWS Glacier.
So far so good, I managed to write a PowerShell script to select the required files and upload them to Glacier, but since I am new to all the AWS stuff, I have one question: is it possible to check that the files I have uploaded are indeed in the archive and that there has been no information loss?
My first approach was to send job retrieval requests for all the files that we have uploaded, and 4 hours later compare the checksums and archive ids of our original files and the ones we retrieved from Glacier. However, I think this process takes long, costs extra money, and most importantly, makes no sense at all..
I have also found that I can use inventory retrieval, but as far as I can tell this approach would be very similar to the one described above, just without downloading all the files again.
Lastly, is there even a point to trying to ensure that a file upload was successful if there are no errors? My vague understanding is that AWS would come back with error messages should an upload to Glacier fail, and it computes checksums internally during uploads.
I know that StackOverflow has seen more precisely worded questions, but any clarification regarding this would be immensely appreciated.
You have to try pretty hard to upload a corrupt file to Glacier, because Glacier requires checksums sent with each API request, and will reject the uploads if they don't match the hashes. Obviously you need to spot check your archives, but each one does not need to be downloaded and verified because of the built-in protections.
See Computing Checksums in the Amazon S3 Glacier Developer Guide for descriptions of how this works, on the wire.
Then, consider not using Glacier at all... not directly, anyway. Use S3, and upload your files using the GLACIER or DEEP_ARCHIVE storage class. Or upload them as Standard, with a lifecycle policy that moves them into one of the archive storage classes after 1 day. (Useful because if you delete Glacier or Deep Archive uploads before the minimum storage time, you're billed for the entire minimum time... this way you have a 24 hour "oops I don't like the way I set this up" window, since Standard storage has no minimum storage time period).
Using S3 is a far better solution, because S3 has a much better API and console, but the pricing is identical, because S3 is actually using Glacier as its backend, while you have the advantage of S3 as the frontend. Glacier has essentially no console functionality, is very opaque, and is not really designed for human interaction -- Glacier appears to have been designed as a backing store for an archiving system or service, which is exactly how S3 uses Glacier.
Amazon Simple Storage Service (Amazon S3) supports lifecycle configuration on an S3 bucket, which enables you to transition objects to the Amazon S3 GLACIER storage class for archival. When you transition Amazon S3 objects to the GLACIER storage class, Amazon S3 internally uses Glacier for durable storage at lower cost. Although the objects are stored in Glacier, they remain Amazon S3 objects that you manage in Amazon S3, and you cannot access them directly through Glacier.
https://docs.aws.amazon.com/amazonglacier/latest/dev/introduction.html
It is confusing and unfortunate that AWS recently confused this issue by dumbing things down, rebranding "Glacier" as "S3 Glacier," as if they were the same thing, when they are two very different services, one of which operates in a mode that gives you a gateway to the other. It's similarly unfortunate how Glacier has traditionally been marketed. Without S3 in front, Glacier is not well suited for very many applications.

Does moving files to Glacier lead to new uploads using sync in S3?

currently I am backing up data to AWS S3 using cli sync command. I´d like to make use of AWS Glacier. The idea is to backup data to S3 and move this data to Glacier based on lifecycle rules. However, I am wondering if a S3 sync still works after files have been moved to Glacier by these rules or if these files would be uploaded again to S3. Any information or experiences on this?
Thanks in advance
Yes, it will work fine.
The sync process works on filesize and modified date. These remain the same regardless of storage class.
You can now upload directly to the Glacier storage class without needing a lifecycle policy, which is ideal for backup-type files that you know you will rarely access. Just use: --storage-class glacier

Using AWS Glacier as back-up

I have a website where I serve content that is stored on an AWS S3 bucket. As the amount of content grows, I have started thinking about back-up options. Using AWS Glacier came up as a natural route.
After reading on it, I didn't understand if it does what I intend to do with it. From what I have understood, using Glacier, you set lifecycle policies on objects stored on your S3 buckets. According to these policies, objects will be transferred Glacier and deleted from your S3 bucket at a specific point in time after they have been uploaded to S3. At this point, the object's storage class changes to 'GLACIER'. Amazon explains that, once this is done, you can no longer access the objects through S3 but "their index entry will remain as is". Simultaneously, they say that retrieval of objects from Glacier takes 3-5 hours.
My question is: Does this mean that, once objects are transferred to Glacier, I will not be able to serve them on my website without retrieving them first? Or does it mean that they will still be served from the S3 bucket as usual but that, in case something happens with the files on S3 I will just be able to retrieve them in 3-5 hours? Glacier would only be a viable back up solution for me if users of my website would still be able to load content on the website after the correspondent objects are transferred to Glacier. Also, is it possible to have objects transferred to Glacier without them being deleted from the S3 bucket?
Thank you
To answer your question: Does this mean that, once objects are transferred to Glacier, I will not be able to serve them on my website without retrieving them first?
No, you won't be able to serve them on your website unless transfer them from glacier to standard or standard_IA class, which is taken 3-5 hours. Glacier is generally used to archive cold data like old logs which is accessed in rare condition. So if you need real-time access to the object, Glacier isn't a valid option for you.