How to create one glacier archive for many s3 objects? - amazon-web-services

One of our client asks to get all the video that they uploaded to the system. The files are stored at s3. Client expect to get one link that will download archive with all the videos.
Is there a way to create such an archive without downloading files archiving it and uploading back to aws?
So far I didn't find the solution.
Is it possible to do it with glacier, or move the files to folder and expose it?

Unfortunately, you can't create a zip-like archives from existing objects directly on S3. Similarly you can't transfer them to Glacier to do this. Glacier is not going to produce a single zip or rar (or any time of) archive from multiple s3 objects for you.
Instead, you have to download them first, zip or rar (or use which ever archiving format you prefer), and the re-upload to S3. Then you can share the zip/rar with your customers.
There is also a possibility of using multi-part AWS API to merge S3 objects without downloading them. But this requires programming custom solution to merge objects (not creating zip/rar-type archives).

You can create a glacier archive for a specific prefix (what you see as a subfolder) by using AWS lifecycle rules to perform this action.
More information available here.
Original answer
There is no native way to be able to retrieve all objects as an archive via S3.
S3 simply exposes all objects as they are uploaded, unfortunately you will need to perform the archiving as a separate process afterwards.

Related

How to migrate data from s3 bucket to glacier?

I have a TB sized S3 bucket with pdf files. I need to migrate the old files to glacier. I know that I can create a life cycle rule to migrate files which are older than certain number of days. But in my case currently the bucket consists of both old and new pdf files and they were added at a same time. So they may have same uploaded date. In this case a life cycle rule won't be useful.
In the pdf files there is a field called capture_date. So i need to migrate those files based on the capture_date. (ie: migrate all pdf files if the capture_date < 2015-05-21 likewise).
Will a Fargate job will be useful here? if so, please give a brief idea.
Please suggest your ideas. Thanks in advance
S3 by itself will not read your pdf files. Thus you have to read them yourself, extract data that determine which ones are old and new, and using AWS SDK (or CLI) to move them to Glacier.
Since the files are not too big, you could use S3 Batch along with lambda function which would do the change of the class to glacier.
Alternatively, you could do this on an EC2 instance, using S3 Inventory's CSV list of your objects (assuming large number of them).
And the most traditional way is to just list your bucket, and iterate over each object.

How to download archive from Glacier Deep Archive locally using boto3?

I uploaded zip file on Glacier Deep Archive. Now I want to download the same file locally with boto3 library. What is the way to do it? I tried to find any example, but failed. I have vault_name and archive_id saved. It is less than 1GB in weight. How long does it take to download such a file?
Before you can actually download the object, in boto3 you have to execute restore_object operation. Once the object is restored, you can then download it from S3. From docs:
Objects in the GLACIER and DEEP_ARCHIVE storage classes are archived. To access an archived object, you must first initiate a restore request. This restores a temporary copy of the archived object.

Cloud function is unable to move files in archive bucket

I have implemented an architecture as per the link https://cloud.google.com/solutions/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions
But the issue is when multiple files come at the same time(For E:g. 3 files comes at the same timestamp(21/06/2020, 12:13:54 UTC+5:30)) in the bucket. In this scenario, the cloud function is unable to move all these files with the same timestamp to success bucket after processing.
Can someone please suggest.
Google Cloud Storage is not a file system. You can only CREATE, READ and DELETE the BLOB. Therefore, you can't MOVE a file. The MOVE that exist on the console or in some client library (in python for example) perform a CREATE (copy the existing BLOB to the target name) and then a DELETE of the old BLOB.
Eventually, you can't keep the original timestamp with you perform a MOVE operation.
NOTE: because you perform a CREATE and a DELETE when you MOVE your file, you are charge on early deletion when you use classes such as Nearline, coldline and archive

Download bulk objects from Amazon S3 bucket

I have a large bucket folder with over 30 million object (images). Now, I need to download only 700,000 object (image) from that large folder.
I have the names of objects (images) that I need to download in a .txt file.
I can use AWS CLI, but not sure if it support downloading many objects at one command.
Is there a straight forward solution for that you would have in mind?

How can I know the names of the files / archives that I loaded into a vault?

I wish to load my zipped folders into the amazon glacier.
It seems that after loading files / archives into a vault, I can only get information about how many archives I have in the vault.
How can I know the names of the files / archives that I have in the vault?
thanks.
If you are talking about the files within the zip files, then you have to track that yourself.
If you are talking about the archives themselves, then amazon provides inventory data that you can get these via the InitiateJob api. As always retrievals take several hours, and this inventory data is only updated once a day or so.
You may also consider storing files in S3 (with the storage type set to GLACIER): This allows all the convenience of the regular apis for browsing bucket contents but with glacier pricing for the storage.