Efficient way to Copy/Replicate S3 Objects? - amazon-web-services

I need to replicate Millions (one time) of S3 Objs by modifying the metadata (within the same bucket, and obj path)
To perform this, we've various options mentioned below, we need to choose cost-effective method:
AWS COPY requests
AWS Batch Operations
AWS DataSync
References:
https://repost.aws/knowledge-center/s3-large-transfer-between-buckets
I've read AWS Docs but could not get which one is better in terms of cost.

To update metadata on an Amazon S3 object, it is necessary to COPY the object to itself while specifying the new metadata.
From Copying objects - Amazon Simple Storage Service:
Each Amazon S3 object has metadata. It is a set of name-value pairs. You can set object metadata at the time you upload it. After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata. In the copy operation, set the same object as the source and target.
However, you have a choice as to how to trigger the COPY operation:
You can write your own code that loops through the objects and performs the copy, or
You can use S3 Batch Operations to perform the copy
Given that you have millions of objects, I would recommend using S3 Batch Operations since it can perform the process with massive scale.
I would recommend this process:
Activate Amazon S3 Inventory on the bucket, which can provide a daily or weekly CSV file listing all objects.
Take the S3 Inventory output file and treat it as a manifest file for the batch operation. You will need to edit the file (either via code or an Excel spreadsheet) to tell it to copy the objects to themselves and also to specify the desired metadata.
Submit the manifest file to S3 Batch Operations. (It can take some time to start executing.)
I suggest that you try the S3 Batch Operations step on a subset of objects (eg 10 objects) first to confirm that it operates the way you expect. This will be relatively fast and will avoid any potential errors.
Note that S3 Batch Operations charges $1.00 per million object operations performed.

Related

S3 Bucket AWS CLI takes forever to get specific files

I have a log archive bucket, and that bucket has 2.5m+ objects.
I am looking to download some specific time period files. For this I have tried different methods but all of them are failing.
My observation is those queries start from oldest file, but the files I seek are the newest ones. So it takes forever to find them.
aws s3 sync s3://mybucket . --exclude "*" --include "2021.12.2*" --include "2021.12.3*" --include "2022.01.01*"
Am I doing something wrong?
Is it possible to make these query start from newest files so it might take less time to complete?
I also tried using S3 Browser and CloudBerry. Same problem. Tried with a EC2 that is inside the same AWS network. Same problem.
2.5m+ objects in an Amazon S3 bucket is indeed a large number of objects!
When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500+ API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).
You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.
Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.
The simple fact is that a flat-structure Amazon S3 bucket with 2.5m+ objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.

Best way to move contents from one s3 object/folder to another within the same bucket?

I have a job that needs to transfer ~150GB from one folder into another. This runs once a day.
def copy_new_data_to_official_location(bucket_name):
s3 = retrieve_aws_connection('s3')
objects_to_move = s3.list_objects(
Bucket=bucket_name, Prefix='my/prefix/here')
for item in objects_to_move['Contents']:
print(item['Key'])
copy_source = {
'Bucket': bucket_name,
'Key': item['Key']
}
original_key_name = item['Key'].split('/')[2]
s3.copy(copy_source, bucket_name, original_key_name)
I have the following. This process takes a bit of time and also, if I'm reading correctly, I'm paying transfer fees moving between objects.
Is there a better way?
Flow:
Run large scale job on Spark to feed data in from folder_1 and external source
Copy output to folder_2
Delete all contents from folder_1
Copy contents of folder_2 to folder_1
Repeat above flow on daily cadence.
Spark is a bit strange, so need to copy output to folder_2, otherwise redirecting to folder_1 causes a data wipe before the job even kicks off.
There are no Data Transfer fees if the source and destination buckets are in the same Region. Since you are simply copying within the same bucket, there would be no Data Transfer fee.
150 GB is not very much data, but it can take some time to copy if there are many objects. The overhead of initiating the copy can sometimes take more time than actually copying the data. When using the copy() command, all data is transferred within Amazon S3 -- nothing is copied down to the computer where the command is issued.
There are several ways you could make the process faster:
You could issue the copy() commands in parallel. In fact, this is how the AWS Command-Line Interface (CLI) works when using aws s3 cp --recursive and aws s3 sync.
You could use the AWS CLI to copy the objects rather writing your own program.
Instead of copying objects once per day, you could configure replication within Amazon S3 so that objects are copied as soon as they are created. (Although I haven't tried this with the same source and destination bucket.)
If you need to be more selective about the objects to copy immediately, you could configure Amazon S3 to trigger an AWS Lambda function whenever a new object is created. The Lambda function could apply some business logic to determine whether to copy the object, and then it can issue the copy() command.

right way to move large objects between folders/buckets in S3

I need to move some large file(s) (1 terabyte to 5 terabyte) from one S3 location to a different directory in the same bucket or to a different bucket.
There are few ways that I can think of doing it more robustly.
Trigger a lambda function based on ObjectCreated:Put trigger and use boto3 to copy the file to new location and delete source file. Plain and simple. But if there is any error while copying the files, I lose the event. I have to design some sort of tracking system along with this.
Use same-region-replication and delete the source once the replication is completed. I do not think there is any event emitted once the object is replicated so I am not sure.
Trigger a Step function and have Copy and Delete as separate steps. This way if for some reason Copy or Delete steps fail, I can rerun the state machine. Here again the problem is what if the file size is too big for lambda to copy?
Trigger a lambda function based on ObjectCreated:Put trigger and create a data pipeline and move the file using aws s3 mv. This can get little expensive.
What is the right way of doing this to make this reliable?
I am looking for advise on the right approach. I am not looking for code. Please do not post aws s3 cp or aws s3 mv or aws s3api copy-object one line commands.
Your situation appears to be:
New objects are being created in Bucket A
You wish to 'move' them to Bucket B (or move them to a different location in Bucket A)
The move should happen immediately after object creation
The simplest solution, of course, would be to create the objects in the correct location without needing to move them. I will assume you have a reason for not being able to do this.
To respond to your concepts:
Using an AWS Lambda function: This is the easiest and most-responsive method. The code would need to do a multi-part copy since the objects can be large. If there is an unrecoverable error, the original object would be left in the source bucket for later retry.
Using same-region replication: This is a much easier way to copy the objects to a desired destination. S3 could push the object creation information to an Amazon SQS queue, which could be consulted for later deletion of the source object. You are right that timing would be slightly tricky. If you are fine with keeping some of the source files around for a while, the queue could be processed at regular intervals (eg every 15 minutes).
Using a Step Function: You would need something to trigger the Step Function (another Lambda function?). This is probably overkill since the first option (using Lambda) could delete the source object after a successful copy, without needing to invoke a subsequent step. However, Step Functions might be able to provide some retry functionality.
Use Data Pipeline: Don't. Enough said.
Using an AWS Lambda function to copy an object would require it to send a Copy command for each part of an object, thereby performing a multi-part copy. This can be made faster by running multiple requests in parallel through multiple threads. (I haven't tried that in Lambda, but it should work.)
Such multi-threading has already been implemented in the AWS CLI. So, another option would be to trigger an AWS Lambda function (#1 above) that calls out to run the AWS CLI aws s3 mv command. Yes, this is possible, see: How to use AWS CLI within a Lambda function (aws s3 sync from Lambda) :: Ilya Bezdelev. The benefit of this method is that the code already exists, it works, using aws s3 mv will delete the object after it is successfully copied, and it will run very fast because the AWS CLI implements multi-part copying in parallel.

Fastest and most cost efficient way to copy over an S3 bucket from another AWS account

I have an S3 bucket that is 9TB and I want to copy it over to another AWS account.
What would be the fastest and most cost efficient way to copy it?
I know I can rsync them and also use S3 replication.
Rsync I think will take too long and I think be a bit pricey.
I have not played with S3 replication so I am not sure of its speed and cost.
Are there any other methods that I might not be aware of?
FYI - The source and destination buckets will be in the same region (but different accounts).
There is no quicker way to do it then using sync and I do not believe it is that pricey. You do not mention the number of files you are copying though.
You will pay $0.004 / 10,000 requests on the GET operations on the files you are copying and then $0.005 / 1,000 requests on the PUT operations on the files you are writing. Also, I believe you won't pay data transfer costs if this is in the same region.
If you want to speed this up you could use multiple sync jobs if the bucket has a way of being logically divisible i.e. s3://examplebucket/job1 and s3://examplebucket/job2
You can use S3 Batch Operations to copy large quantities of objects between buckets in the same region.
It can accept a CSV file containing a list of objects, or you can use the output of Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
While copying, it can also update tags, metadata and ACLs.
See: Cross-account bulk transfer of files using Amazon S3 Batch Operations | AWS Storage Blog
I wound up finding the page below and used replication with the copy to itself method.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

AWS S3 How do I enable S3 object encryption for object that existed before

Had a series of buckets that did not have encryption turned on. boto3 code to turn it on easy. Just using basic AES256.
Unfortunately any object that already exists will not have server side encryption set. Been looking at the API and cannot find the call to change the attribute. Via the console, it is there. But i am not about to do that with 10000 objects.
Not willing to copy that much data out and then back in again.
The s3 object put looks like it expects to write an object. Does not seem to update an object.
Anyone willing to offer a pointer?
Amazon S3 has the ability to do a COPY operation where the source file and the destination file are the same (in object name only). This copy operation happens on S3, which means that you do not need to download and reupload the file.
To turn on encryption for a file, called Server Side Encryption (SSE AES-256), you can use the AWS CLI COPY command:
aws s3 cp s3://mybucket/myfile.zip s3://mybucket/myfile.zip --sse
The source file will be copied to the destination (notice the same object names) and SSE will be enabled (the file will be encrypted).
If you have a list of files, you could easily create a batch script to process each file.
Or you could write a simple python program to scan each file on S3 and if SSE is not enabled, encrypt with the AWS CLI command or with python S3 APIs.
I've been reading and talking to friends. I tried something for the heck of it.
aws s3 cp s3://bucket/tools/README.md s3://bucket/tools/README.md
Encryption was turned on. Is AWS smart enough to recognize this and it just applied encryption bucket policy? Or did it really recopy of object on top of itself?
You can do something like this to copy object between buckets and encrypt them.
But coping is not without any side effects, in order to understand what is behind coping we have to see the S3 user guide.
Each object has metadata. Some of it is system metadata and other user-defined. Users control some of the system metadata such as storage class configuration to use for the object, and configure server-side encryption. When you copy an object, user-controlled system metadata and user-defined metadata are also copied. Amazon S3 resets the system controlled metadata. For example, when you copy an object, Amazon S3 resets creation date of copied object. You don't need to set any of these values in your copy request.
You can find more about metadata from here
Note that if you choose to update any of the object's user configurable metadata (system or user-defined) during the copy, then you must explicitly specify all the user configurable metadata, even if you are only changing only one of the metadata values, present on the source object in your request.
You will also have to pay for copy requests. However there won't be any charge for delete requests. Since there is no need to copy object between regions in this case you wont be charge for bandwidth.
So keep these in mind when you are going ahead with copy object in S3.