How to migrate data from s3 bucket to glacier? - amazon-web-services

I have a TB sized S3 bucket with pdf files. I need to migrate the old files to glacier. I know that I can create a life cycle rule to migrate files which are older than certain number of days. But in my case currently the bucket consists of both old and new pdf files and they were added at a same time. So they may have same uploaded date. In this case a life cycle rule won't be useful.
In the pdf files there is a field called capture_date. So i need to migrate those files based on the capture_date. (ie: migrate all pdf files if the capture_date < 2015-05-21 likewise).
Will a Fargate job will be useful here? if so, please give a brief idea.
Please suggest your ideas. Thanks in advance

S3 by itself will not read your pdf files. Thus you have to read them yourself, extract data that determine which ones are old and new, and using AWS SDK (or CLI) to move them to Glacier.
Since the files are not too big, you could use S3 Batch along with lambda function which would do the change of the class to glacier.
Alternatively, you could do this on an EC2 instance, using S3 Inventory's CSV list of your objects (assuming large number of them).
And the most traditional way is to just list your bucket, and iterate over each object.

Related

Copy limited number of files from S3?

We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data. Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.
That copy operation is done via S3 cli tool command that looks something like this:
aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile
The problem is that the number of json files on S3 is getting pretty large since more are being made every day. It's nothing even close to the capacity of the S3 bucket since the files are so small. However, in practical terms, there's no need to copy ALL these JSON files. Realistically the system would be safe just copying the most recent 100 or so. But we do want to keep older ones around for other purposes.
So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)? Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?
The aws s3 sync command in the AWS CLI sounds perfect for your needs.
It will copy only files that are New or Modified since the last sync. However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.
Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.
You can set the Lifecycle policies to the S3 buckets which will remove them after certain period of time.
To copy only some days old objects you will need to write a script

How to create one glacier archive for many s3 objects?

One of our client asks to get all the video that they uploaded to the system. The files are stored at s3. Client expect to get one link that will download archive with all the videos.
Is there a way to create such an archive without downloading files archiving it and uploading back to aws?
So far I didn't find the solution.
Is it possible to do it with glacier, or move the files to folder and expose it?
Unfortunately, you can't create a zip-like archives from existing objects directly on S3. Similarly you can't transfer them to Glacier to do this. Glacier is not going to produce a single zip or rar (or any time of) archive from multiple s3 objects for you.
Instead, you have to download them first, zip or rar (or use which ever archiving format you prefer), and the re-upload to S3. Then you can share the zip/rar with your customers.
There is also a possibility of using multi-part AWS API to merge S3 objects without downloading them. But this requires programming custom solution to merge objects (not creating zip/rar-type archives).
You can create a glacier archive for a specific prefix (what you see as a subfolder) by using AWS lifecycle rules to perform this action.
More information available here.
Original answer
There is no native way to be able to retrieve all objects as an archive via S3.
S3 simply exposes all objects as they are uploaded, unfortunately you will need to perform the archiving as a separate process afterwards.

How can I detect orphaned objects in S3 that aren't mapped to our database?

I am trying to find possible orphans in an S3 bucket. What I mean is that we might delete something out of the DB, and for whatever reason, it doesn't get cleared from S3. This can be a bug in our system or something of that nature. I want to double check against our API that the object in S3 maps to something that exists - the naming convention let's us map things together like that.
Scraping an entire bucket every X days seems unscalable. I was thinking that for each object in the bucket, it can add itself to an SQS queue for the relevant checking to happen, every 30 days or so.
I've only found events around uploads and specific modifications over at https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html. Is there anything more generalized I can't find? Any creative solutions to this problem?
You should activate Amazon S3 Inventory, which can provide a regular CSV file (as often as daily) that contains a list of every object in the Amazon S3 bucket.
You could then trigger some code that compares the contents of the CSV file against the database to find 'orphan' objects.

How to diff very large buckets in Amazon S3?

I have a use case where I have to back up a 200+TB, 18M object S3 bucket to another account that changes often (used in batch processing of critical data). I need to add a verification step, but due to the large size of both bucket, object count, and frequency of change this is tricky.
My current thoughts are to pull the eTags from the original bucket and archive bucket, and the write a streaming diff tool to compare the values. Has anyone here had to approach this problem and if so did you come up with a better answer?
Firstly, if you wish to keep two buckets in sync (once you've done the initial sync), you can use Cross-Region Replication (CRR).
To do the initial sync, you could try using the AWS Command-Line Interface (CLI), which has a aws s3 sync command. However, it might have some difficulties with a large number of files -- I suggest you give it a try. It uses keys, dates and filesize to determine which files to sync.
If you do wish to create your own sync app, then eTag is definitely a definitive way to compare files.
To make things simple, activate Amazon S3 Inventory, which can provide a daily listing of all files in a bucket, including eTag. You could then do a comparison between the Inventory files to discover which remaining files require synchronization.
For anyone looking for a way to solve this problem in an automated way (as was I),
I created a small python script that leverages S3 Inventories and Athena to do the comparison somewhat efficiently. (This is basically automation of John Rosenstein's suggestion)
You can find it here https://github.com/forter/s3-compare

Download bulk objects from Amazon S3 bucket

I have a large bucket folder with over 30 million object (images). Now, I need to download only 700,000 object (image) from that large folder.
I have the names of objects (images) that I need to download in a .txt file.
I can use AWS CLI, but not sure if it support downloading many objects at one command.
Is there a straight forward solution for that you would have in mind?