How to download archive from Glacier Deep Archive locally using boto3? - amazon-web-services

I uploaded zip file on Glacier Deep Archive. Now I want to download the same file locally with boto3 library. What is the way to do it? I tried to find any example, but failed. I have vault_name and archive_id saved. It is less than 1GB in weight. How long does it take to download such a file?

Before you can actually download the object, in boto3 you have to execute restore_object operation. Once the object is restored, you can then download it from S3. From docs:
Objects in the GLACIER and DEEP_ARCHIVE storage classes are archived. To access an archived object, you must first initiate a restore request. This restores a temporary copy of the archived object.

Related

How to migrate data from s3 bucket to glacier?

I have a TB sized S3 bucket with pdf files. I need to migrate the old files to glacier. I know that I can create a life cycle rule to migrate files which are older than certain number of days. But in my case currently the bucket consists of both old and new pdf files and they were added at a same time. So they may have same uploaded date. In this case a life cycle rule won't be useful.
In the pdf files there is a field called capture_date. So i need to migrate those files based on the capture_date. (ie: migrate all pdf files if the capture_date < 2015-05-21 likewise).
Will a Fargate job will be useful here? if so, please give a brief idea.
Please suggest your ideas. Thanks in advance
S3 by itself will not read your pdf files. Thus you have to read them yourself, extract data that determine which ones are old and new, and using AWS SDK (or CLI) to move them to Glacier.
Since the files are not too big, you could use S3 Batch along with lambda function which would do the change of the class to glacier.
Alternatively, you could do this on an EC2 instance, using S3 Inventory's CSV list of your objects (assuming large number of them).
And the most traditional way is to just list your bucket, and iterate over each object.

How to create one glacier archive for many s3 objects?

One of our client asks to get all the video that they uploaded to the system. The files are stored at s3. Client expect to get one link that will download archive with all the videos.
Is there a way to create such an archive without downloading files archiving it and uploading back to aws?
So far I didn't find the solution.
Is it possible to do it with glacier, or move the files to folder and expose it?
Unfortunately, you can't create a zip-like archives from existing objects directly on S3. Similarly you can't transfer them to Glacier to do this. Glacier is not going to produce a single zip or rar (or any time of) archive from multiple s3 objects for you.
Instead, you have to download them first, zip or rar (or use which ever archiving format you prefer), and the re-upload to S3. Then you can share the zip/rar with your customers.
There is also a possibility of using multi-part AWS API to merge S3 objects without downloading them. But this requires programming custom solution to merge objects (not creating zip/rar-type archives).
You can create a glacier archive for a specific prefix (what you see as a subfolder) by using AWS lifecycle rules to perform this action.
More information available here.
Original answer
There is no native way to be able to retrieve all objects as an archive via S3.
S3 simply exposes all objects as they are uploaded, unfortunately you will need to perform the archiving as a separate process afterwards.

Cloud function is unable to move files in archive bucket

I have implemented an architecture as per the link https://cloud.google.com/solutions/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions
But the issue is when multiple files come at the same time(For E:g. 3 files comes at the same timestamp(21/06/2020, 12:13:54 UTC+5:30)) in the bucket. In this scenario, the cloud function is unable to move all these files with the same timestamp to success bucket after processing.
Can someone please suggest.
Google Cloud Storage is not a file system. You can only CREATE, READ and DELETE the BLOB. Therefore, you can't MOVE a file. The MOVE that exist on the console or in some client library (in python for example) perform a CREATE (copy the existing BLOB to the target name) and then a DELETE of the old BLOB.
Eventually, you can't keep the original timestamp with you perform a MOVE operation.
NOTE: because you perform a CREATE and a DELETE when you MOVE your file, you are charge on early deletion when you use classes such as Nearline, coldline and archive

Spark doesn't output .crc files on S3

When I use spark locally, writing data on my local filesystem, it creates some usefull .crc file.
Using the same job on Aws EMR and writing on S3, the .crc files are not written.
Is this normal? Is there a way to force the writing of .crc files on S3?
those .crc files are just created by the the low level bits of the Hadoop FS binding so that it can identify when a block is corrupt, and, on HDFS, switch to another datanode's copy of the data for the read and kick off a re-replication of one of the good copies.
On S3, stopping corruption is left to AWS.
What you can get off S3 is the etag of a file, which is the md5sum on a small upload; on a multipart upload it is some other string, which again, changes when you upload it.
you can get at this value with the Hadoop 3.1+ version of the S3A connector, though it's off by default as distcp gets very confused when uploading from HDFS. For earlier versions, you can't get at it, nor does the aws s3 command show it. You'd have to try some other S3 libraries (it's just a HEAD request, after all)

Download bulk objects from Amazon S3 bucket

I have a large bucket folder with over 30 million object (images). Now, I need to download only 700,000 object (image) from that large folder.
I have the names of objects (images) that I need to download in a .txt file.
I can use AWS CLI, but not sure if it support downloading many objects at one command.
Is there a straight forward solution for that you would have in mind?