I have uploaded 365 files 1 files per day to S3 bucket all at one go. Now All the files have the same upload date. I want to Move the file which are more than 6 months to S3 Glacier. S3 lifecycle policy will take effect after 6 months as all the files upload date to s3 is same. The actual date of the files upload is stored in DynamoDb table with S3KeyUrl.
I want to know the best way to be able to move file to s3 Glacier. I came up with the following approach
Create the S3 Lifecycle policy to move file to s3 Glacier which will work after 6 month.
Create a app to Query DynamoDB Table to get the list of files which are more than 6 months and
download the file from s3 (as it allows uploading files from local directory) and use
ArchiveTransferManager (Amazon.Glacier.Transfer) to the file to s3 glacier vault.
In Prod Scenario there will be files in some 10 million so the solution should be reliable.
There are two versions of Glacier:
The 'original' Amazon Glacier, which uses Vaults and Archives
The Amazon S3 Storage Classes of Glacier and Glacier Deep Archive
Trust me... You do not want to use the 'original' Glacier. It is slow and difficult to use. So, avoid anything that mentions Vaults and Archives.
Instead, you simply want to change the Storage Class of the objects in Amazon S3.
Normally, the easiest way to do this is to "Edit storage class" in the S3 management console. However, you mention Millions of objects, so this wouldn't be feasible.
Instead, you will need to copy objects over themselves, while changing the storage class. This can be done with the AWS CLI:
aws s3 cp s3://<bucket-name>/ s3://<bucket-name>/ --recursive --storage-class <storage_class>
Note that this would change the storage class for all objects in the given bucket/path. Since you only wish to selectively change the storage class, you would either need to issue lots of the above commands (each for only one object), or you could use an AWS SDK to script the process. For example, you could write a Python program that loops through the list of objects, checks DynamoDB to determine whether the object is '6 months old' and then copies it over itself with the new Storage Class.
See: StackOverflow: How to change storage class of existing key via boto3
If you have millions of objects, it can take a long time to merely list the objects. Therefore, you could consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then use this CSV file as the 'input list' for your 'copy' operation rather than having to list the bucket itself.
Or, just be lazy (which is always more productive!) and archive everything to Glacier. Then, if somebody actually needs one of the files in the next 6 months, simply restore it from Glacier before use. So simple!
Related
I have a log archive bucket, and that bucket has 2.5m+ objects.
I am looking to download some specific time period files. For this I have tried different methods but all of them are failing.
My observation is those queries start from oldest file, but the files I seek are the newest ones. So it takes forever to find them.
aws s3 sync s3://mybucket . --exclude "*" --include "2021.12.2*" --include "2021.12.3*" --include "2022.01.01*"
Am I doing something wrong?
Is it possible to make these query start from newest files so it might take less time to complete?
I also tried using S3 Browser and CloudBerry. Same problem. Tried with a EC2 that is inside the same AWS network. Same problem.
2.5m+ objects in an Amazon S3 bucket is indeed a large number of objects!
When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500+ API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).
You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.
Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.
The simple fact is that a flat-structure Amazon S3 bucket with 2.5m+ objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.
Standard S3 console supports uploading files and changing storage type, but in S3 Glacier we need to create a vault, and console support is not provided. let's say if I selected the S3 Glacier storage class in standard S3 upload, how it's different from Glacier, will it internally create a vault? is there any price variation?
Uploading to Glacier via Amazon S3 storage classes looks simple and easier.
There are two different types of Glacier.
The 'original' Amazon Glacier uses vaults and jobs. Quite frankly, it is awful to use. It's bearable if you are using a software package that knows how to use Glacier, but it is not a pleasant experience. For example, even just listing the contents of a vault requires waiting for a job to run, and then results need to be retrieved.
Using Glacier as a Storage Class in Amazon S3 is a much more pleasant way to use Glacier. You can use all standard S3 commands and utilities and it gives immediate feedback when you list objects. The only thing that takes time is retrieving an object that is in a Glacier storage class.
Plus, the Glacier and Glacier Deep Archive storage classes are cheaper than Glacier itself! I'd like to prove this, but the pricing page for Glacier now redirects to S3 pricing so it's not possible to see how much it costs!
Bottom line: Use S3 storage classes, not the old 'Glacier' service that uses Vaults.
We have daily database backups created and stored on a server. In order to free up space, it was decided that all the backups older than 30 days should be archived using AWS Glacier.
So far so good, I managed to write a PowerShell script to select the required files and upload them to Glacier, but since I am new to all the AWS stuff, I have one question: is it possible to check that the files I have uploaded are indeed in the archive and that there has been no information loss?
My first approach was to send job retrieval requests for all the files that we have uploaded, and 4 hours later compare the checksums and archive ids of our original files and the ones we retrieved from Glacier. However, I think this process takes long, costs extra money, and most importantly, makes no sense at all..
I have also found that I can use inventory retrieval, but as far as I can tell this approach would be very similar to the one described above, just without downloading all the files again.
Lastly, is there even a point to trying to ensure that a file upload was successful if there are no errors? My vague understanding is that AWS would come back with error messages should an upload to Glacier fail, and it computes checksums internally during uploads.
I know that StackOverflow has seen more precisely worded questions, but any clarification regarding this would be immensely appreciated.
You have to try pretty hard to upload a corrupt file to Glacier, because Glacier requires checksums sent with each API request, and will reject the uploads if they don't match the hashes. Obviously you need to spot check your archives, but each one does not need to be downloaded and verified because of the built-in protections.
See Computing Checksums in the Amazon S3 Glacier Developer Guide for descriptions of how this works, on the wire.
Then, consider not using Glacier at all... not directly, anyway. Use S3, and upload your files using the GLACIER or DEEP_ARCHIVE storage class. Or upload them as Standard, with a lifecycle policy that moves them into one of the archive storage classes after 1 day. (Useful because if you delete Glacier or Deep Archive uploads before the minimum storage time, you're billed for the entire minimum time... this way you have a 24 hour "oops I don't like the way I set this up" window, since Standard storage has no minimum storage time period).
Using S3 is a far better solution, because S3 has a much better API and console, but the pricing is identical, because S3 is actually using Glacier as its backend, while you have the advantage of S3 as the frontend. Glacier has essentially no console functionality, is very opaque, and is not really designed for human interaction -- Glacier appears to have been designed as a backing store for an archiving system or service, which is exactly how S3 uses Glacier.
Amazon Simple Storage Service (Amazon S3) supports lifecycle configuration on an S3 bucket, which enables you to transition objects to the Amazon S3 GLACIER storage class for archival. When you transition Amazon S3 objects to the GLACIER storage class, Amazon S3 internally uses Glacier for durable storage at lower cost. Although the objects are stored in Glacier, they remain Amazon S3 objects that you manage in Amazon S3, and you cannot access them directly through Glacier.
https://docs.aws.amazon.com/amazonglacier/latest/dev/introduction.html
It is confusing and unfortunate that AWS recently confused this issue by dumbing things down, rebranding "Glacier" as "S3 Glacier," as if they were the same thing, when they are two very different services, one of which operates in a mode that gives you a gateway to the other. It's similarly unfortunate how Glacier has traditionally been marketed. Without S3 in front, Glacier is not well suited for very many applications.
I have a website where I serve content that is stored on an AWS S3 bucket. As the amount of content grows, I have started thinking about back-up options. Using AWS Glacier came up as a natural route.
After reading on it, I didn't understand if it does what I intend to do with it. From what I have understood, using Glacier, you set lifecycle policies on objects stored on your S3 buckets. According to these policies, objects will be transferred Glacier and deleted from your S3 bucket at a specific point in time after they have been uploaded to S3. At this point, the object's storage class changes to 'GLACIER'. Amazon explains that, once this is done, you can no longer access the objects through S3 but "their index entry will remain as is". Simultaneously, they say that retrieval of objects from Glacier takes 3-5 hours.
My question is: Does this mean that, once objects are transferred to Glacier, I will not be able to serve them on my website without retrieving them first? Or does it mean that they will still be served from the S3 bucket as usual but that, in case something happens with the files on S3 I will just be able to retrieve them in 3-5 hours? Glacier would only be a viable back up solution for me if users of my website would still be able to load content on the website after the correspondent objects are transferred to Glacier. Also, is it possible to have objects transferred to Glacier without them being deleted from the S3 bucket?
Thank you
To answer your question: Does this mean that, once objects are transferred to Glacier, I will not be able to serve them on my website without retrieving them first?
No, you won't be able to serve them on your website unless transfer them from glacier to standard or standard_IA class, which is taken 3-5 hours. Glacier is generally used to archive cold data like old logs which is accessed in rare condition. So if you need real-time access to the object, Glacier isn't a valid option for you.
I have an S3 bucket on which I've configured a Lifecycle policy which says to archive all objects in the bucket after 1 day(s) (since I want to keep the files in there temporarily but if there are no issues then it is fine to archive them and not have to pay for the S3 storage)
However I have noticed there are some files in that bucket that were created in February ..
So .. am I right in thinking that if you select 'Archive' as the lifecycle option, that means "copy-to-glacier-and-then-delete-from-S3"? In which case this issue of the files left from February would be a fault - since they haven't been?
Only I saw there is another option - 'Archive and then Delete' - but I assume that means "copy-to-glacier-and-then-delete-from-glacier" - which I don't want.
Has anyone else had issues with S3 -> Glacier?
What you describe sounds normal. Check the storage class of the objects.
The correct way to understand the S3/Glacier integration is the S3 is the "customer" of Glacier -- not you -- and Glacier is a back-end storage provider for S3. Your relationship is still with S3 (if you go into Glacier in the console, your stuff isn't visible there, if S3 put it in Glacier).
When S3 archives an object to Glacier, the object is still logically "in" the bucket and is still an S3 object, and visible in the S3 console, but can't be downloaded from S3 because S3 has migrated it to a different backing store.
The difference you should see in the console is that objects will have A "storage class" of Glacier instead of the usual Standard or Reduced Redundancy. They don't disappear from there.
To access the object later, you ask S3 to initiate a restore from Glacier, which S3 does... but the object is still in Glacier at that point, with S3 holding a temporary copy, which it will again purge after some number of days.
Note that your attempt at saving may be a little bit off target if you do not intend to keep these files for 3 months, because any time you delete an object from Glacier, you are billed for the remainder of the three months, if that object has been in Glacier for a shorter time than that.