Adding metadata to millions of S3 objects - amazon-web-services

I have an S3 bucket with over 20 million objects (2.3TB).
The objects need to have their content-disposition metadata populated with user defined names while preserving their existing content-type metadata.
The file names are stored in a separate RDS database.
It looks like I would be able to use the copy command for a small number of files but with a bucket this big it doesn't really sound like a sane option.
Any help would be greatly appreciated!

Its seems a perfect use case for S3 Batch operation. So you could create a lambda function which would conduct your changes concurrently through the S3 Batch.

Related

Efficient way to Copy/Replicate S3 Objects?

I need to replicate Millions (one time) of S3 Objs by modifying the metadata (within the same bucket, and obj path)
To perform this, we've various options mentioned below, we need to choose cost-effective method:
AWS COPY requests
AWS Batch Operations
AWS DataSync
References:
https://repost.aws/knowledge-center/s3-large-transfer-between-buckets
I've read AWS Docs but could not get which one is better in terms of cost.
To update metadata on an Amazon S3 object, it is necessary to COPY the object to itself while specifying the new metadata.
From Copying objects - Amazon Simple Storage Service:
Each Amazon S3 object has metadata. It is a set of name-value pairs. You can set object metadata at the time you upload it. After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata. In the copy operation, set the same object as the source and target.
However, you have a choice as to how to trigger the COPY operation:
You can write your own code that loops through the objects and performs the copy, or
You can use S3 Batch Operations to perform the copy
Given that you have millions of objects, I would recommend using S3 Batch Operations since it can perform the process with massive scale.
I would recommend this process:
Activate Amazon S3 Inventory on the bucket, which can provide a daily or weekly CSV file listing all objects.
Take the S3 Inventory output file and treat it as a manifest file for the batch operation. You will need to edit the file (either via code or an Excel spreadsheet) to tell it to copy the objects to themselves and also to specify the desired metadata.
Submit the manifest file to S3 Batch Operations. (It can take some time to start executing.)
I suggest that you try the S3 Batch Operations step on a subset of objects (eg 10 objects) first to confirm that it operates the way you expect. This will be relatively fast and will avoid any potential errors.
Note that S3 Batch Operations charges $1.00 per million object operations performed.

S3 Bucket AWS CLI takes forever to get specific files

I have a log archive bucket, and that bucket has 2.5m+ objects.
I am looking to download some specific time period files. For this I have tried different methods but all of them are failing.
My observation is those queries start from oldest file, but the files I seek are the newest ones. So it takes forever to find them.
aws s3 sync s3://mybucket . --exclude "*" --include "2021.12.2*" --include "2021.12.3*" --include "2022.01.01*"
Am I doing something wrong?
Is it possible to make these query start from newest files so it might take less time to complete?
I also tried using S3 Browser and CloudBerry. Same problem. Tried with a EC2 that is inside the same AWS network. Same problem.
2.5m+ objects in an Amazon S3 bucket is indeed a large number of objects!
When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500+ API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).
You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.
Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.
The simple fact is that a flat-structure Amazon S3 bucket with 2.5m+ objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.

Fastest and most cost efficient way to copy over an S3 bucket from another AWS account

I have an S3 bucket that is 9TB and I want to copy it over to another AWS account.
What would be the fastest and most cost efficient way to copy it?
I know I can rsync them and also use S3 replication.
Rsync I think will take too long and I think be a bit pricey.
I have not played with S3 replication so I am not sure of its speed and cost.
Are there any other methods that I might not be aware of?
FYI - The source and destination buckets will be in the same region (but different accounts).
There is no quicker way to do it then using sync and I do not believe it is that pricey. You do not mention the number of files you are copying though.
You will pay $0.004 / 10,000 requests on the GET operations on the files you are copying and then $0.005 / 1,000 requests on the PUT operations on the files you are writing. Also, I believe you won't pay data transfer costs if this is in the same region.
If you want to speed this up you could use multiple sync jobs if the bucket has a way of being logically divisible i.e. s3://examplebucket/job1 and s3://examplebucket/job2
You can use S3 Batch Operations to copy large quantities of objects between buckets in the same region.
It can accept a CSV file containing a list of objects, or you can use the output of Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
While copying, it can also update tags, metadata and ACLs.
See: Cross-account bulk transfer of files using Amazon S3 Batch Operations | AWS Storage Blog
I wound up finding the page below and used replication with the copy to itself method.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

AWS 100 TB data transformation at rest S3

I have about 50 TB of data in an S3 bucket, the bucket doesn't have any partitioning. The files are json files approx 100KB each in size.
I need to do the partitioning on this data and put this in a different s3 bucket to store it in a structure of yyyy/mm/dd/filename.json or add a custom metadata field to the files which is the original lastmodifieddate on the file itself and move it to the different bucket.
I have looked into options like
Doing it with a spark cluster, mounting both buckets as dbfs and then doing the transformation and copy to destination bucket.
I have also tried writing a lambda function which can do the same for a given file and invoke it from another program. 1000 files take about 15 seconds to copy.
I also looked in to generating s3 inventory and running job on it but it's not customizable to add metadata or create a partition structure so to say.
Is there an obvious choice I may be missing or there are better ways to do it ?

AWS S3 'CloudSearch'

Gone through amazon SDK/documentation and there isn't a lot around programtically querying/searching for documents on S3 bucket.
Sure, can get document by id/name but i want to have ability to search by other meta tags such as author.
Would appreciate some guidance and a specific example of a query being executed and not a local iteration once all documents or items have been pulled locally.
[…] there isn't a lot around programtically querying/searching for documents on S3 bucket.
Right. S3 is flat file storage, and doesn't provide a query interface.
[…] i want to have ability to search by other meta tags such as author.
This will need to be solved by your application logic. This is not built-in to S3.
For example, you can store the metadata about an S3 document/file in DynamoDB. You query DynamoDB for the metadata, which includes a pointer to the file in S3.
Unfortunately, if you already have a bunch of files in S3, you'll need to find a way to build that initial index of your data.
Amazon just released new features for cloud search
http://aws.amazon.com/about-aws/whats-new/2014/03/24/amazon-cloudsearch-introduces-powerful-new-search-and-admin-features/.