I need to move some large file(s) (1 terabyte to 5 terabyte) from one S3 location to a different directory in the same bucket or to a different bucket.
There are few ways that I can think of doing it more robustly.
Trigger a lambda function based on ObjectCreated:Put trigger and use boto3 to copy the file to new location and delete source file. Plain and simple. But if there is any error while copying the files, I lose the event. I have to design some sort of tracking system along with this.
Use same-region-replication and delete the source once the replication is completed. I do not think there is any event emitted once the object is replicated so I am not sure.
Trigger a Step function and have Copy and Delete as separate steps. This way if for some reason Copy or Delete steps fail, I can rerun the state machine. Here again the problem is what if the file size is too big for lambda to copy?
Trigger a lambda function based on ObjectCreated:Put trigger and create a data pipeline and move the file using aws s3 mv. This can get little expensive.
What is the right way of doing this to make this reliable?
I am looking for advise on the right approach. I am not looking for code. Please do not post aws s3 cp or aws s3 mv or aws s3api copy-object one line commands.
Your situation appears to be:
New objects are being created in Bucket A
You wish to 'move' them to Bucket B (or move them to a different location in Bucket A)
The move should happen immediately after object creation
The simplest solution, of course, would be to create the objects in the correct location without needing to move them. I will assume you have a reason for not being able to do this.
To respond to your concepts:
Using an AWS Lambda function: This is the easiest and most-responsive method. The code would need to do a multi-part copy since the objects can be large. If there is an unrecoverable error, the original object would be left in the source bucket for later retry.
Using same-region replication: This is a much easier way to copy the objects to a desired destination. S3 could push the object creation information to an Amazon SQS queue, which could be consulted for later deletion of the source object. You are right that timing would be slightly tricky. If you are fine with keeping some of the source files around for a while, the queue could be processed at regular intervals (eg every 15 minutes).
Using a Step Function: You would need something to trigger the Step Function (another Lambda function?). This is probably overkill since the first option (using Lambda) could delete the source object after a successful copy, without needing to invoke a subsequent step. However, Step Functions might be able to provide some retry functionality.
Use Data Pipeline: Don't. Enough said.
Using an AWS Lambda function to copy an object would require it to send a Copy command for each part of an object, thereby performing a multi-part copy. This can be made faster by running multiple requests in parallel through multiple threads. (I haven't tried that in Lambda, but it should work.)
Such multi-threading has already been implemented in the AWS CLI. So, another option would be to trigger an AWS Lambda function (#1 above) that calls out to run the AWS CLI aws s3 mv command. Yes, this is possible, see: How to use AWS CLI within a Lambda function (aws s3 sync from Lambda) :: Ilya Bezdelev. The benefit of this method is that the code already exists, it works, using aws s3 mv will delete the object after it is successfully copied, and it will run very fast because the AWS CLI implements multi-part copying in parallel.
Related
I need to copy files from one S3 bucket to another S3 bucket using event notification ( so file should be copied as soon as it is created in first bucket).
I am planning to use AWS Lambda ( Python) and write code in Lambda to copy files. But I am afraid for large file copy time taken might be greater than 15 minutes. Also, if for some reason Lambda fails, we lose the event.
is there a better way for this? may be somehow use SQS queue and trigger some other copy function using this queue? Please advise best approach.
Required procedure:
Someone does an upload to an S3 bucket.
This triggers a Lambda function that does some processing on the uploaded file(s).
Processed objects are now copied into a "processed" folder within the same bucket.
The copy-operation in Step 3 should never re-trigger the initial Lambda function itself.
I know that the general guidance is to use a different bucket for storing the processed objects in a situation like this (but this is not possible in this case).
So my approach was to set up the S3 trigger to only listen to PUT/POST-Method and excluded the COPY-Method. The lambda function itself uses python-boto (S3_CLIENT.copy_object(..)). The approach seems to work (the lambda function seems to not be retriggered by the copy operation)
However I wanted to ask if this approach is really reliable - is it?
You can filter which events trigger the S3 notification.
There are 2 ways to trigger lambda from S3 event in general: bucket notifications and EventBridge.
Notifications: https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-filtering.html
EB: https://aws.amazon.com/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/
In your case, a quick search doesn't show me that you can setup a "negative" rule, so "everything which doesn't have processed prefix". But you can rework your bucket structure a bit and dump unprocessed items into unprocessed and setup filter based on that prefix only.
When setting up an S3 trigger for lambda function, there is the possibility, to define which kind of overarching S3-event should be listened to:
I have a job that needs to transfer ~150GB from one folder into another. This runs once a day.
def copy_new_data_to_official_location(bucket_name):
s3 = retrieve_aws_connection('s3')
objects_to_move = s3.list_objects(
Bucket=bucket_name, Prefix='my/prefix/here')
for item in objects_to_move['Contents']:
print(item['Key'])
copy_source = {
'Bucket': bucket_name,
'Key': item['Key']
}
original_key_name = item['Key'].split('/')[2]
s3.copy(copy_source, bucket_name, original_key_name)
I have the following. This process takes a bit of time and also, if I'm reading correctly, I'm paying transfer fees moving between objects.
Is there a better way?
Flow:
Run large scale job on Spark to feed data in from folder_1 and external source
Copy output to folder_2
Delete all contents from folder_1
Copy contents of folder_2 to folder_1
Repeat above flow on daily cadence.
Spark is a bit strange, so need to copy output to folder_2, otherwise redirecting to folder_1 causes a data wipe before the job even kicks off.
There are no Data Transfer fees if the source and destination buckets are in the same Region. Since you are simply copying within the same bucket, there would be no Data Transfer fee.
150 GB is not very much data, but it can take some time to copy if there are many objects. The overhead of initiating the copy can sometimes take more time than actually copying the data. When using the copy() command, all data is transferred within Amazon S3 -- nothing is copied down to the computer where the command is issued.
There are several ways you could make the process faster:
You could issue the copy() commands in parallel. In fact, this is how the AWS Command-Line Interface (CLI) works when using aws s3 cp --recursive and aws s3 sync.
You could use the AWS CLI to copy the objects rather writing your own program.
Instead of copying objects once per day, you could configure replication within Amazon S3 so that objects are copied as soon as they are created. (Although I haven't tried this with the same source and destination bucket.)
If you need to be more selective about the objects to copy immediately, you could configure Amazon S3 to trigger an AWS Lambda function whenever a new object is created. The Lambda function could apply some business logic to determine whether to copy the object, and then it can issue the copy() command.
I wrote a lambda function to copy files in an s3 bucket into another s3 bucket and I need to move a very large number of these files. To try and meet the volume requirements I was looking for a way to send these requests in large batches to S3 to cut down on overhead. However I cannot find any information on how to do this in Python. There's a Batch class in the boto3 documentation but I can't make sense of how it works or even what it actually does.
There is no underlying Amazon S3 API call that can copy multiple files in one request.
The best option is to issue requests in parallel so that they will execute faster.
The boto3 Transfer Manager might be able to assist with this effort.
Side-note: There is no such thing as 'move' command for S3. Instead, you will need to copy, then delete. Just mentioning it for other readers.
I write an AWS lambda function in Node.js for image resizing and trigger it when images upload.
I have already more than 1,000,000 images existing in bucket.
I want to run this lambda function on that images but not find anything till yet.
How can I run AWS lamdba function on existing images of S3 bucket?
Note:- I know this question already asked on Stack overflow, but issue is that no solution of them given till yet
Unfortunately, Lambda cannot be triggered automatically for objects that are already existing in a S3 bucket.
You will have to invoke your Lambda function manually for each image in your S3 bucket.
First, you will need to list existing objects in your S3 bucket using the ListObjectsV2 action.
For each object in your S3 bucket, you must then invoke your Lambda function and provide the S3 object's information as the Payload.
Yes , it's completely true that lambda cannot be triggered by objects already present there in your s3 bucket, but invoking your lambda manually for each object is a completely dumb idea.
With some clever techniques you can perform your tasks on those images easily :
The hard way is, make a program locally that exactly does the same thing as your lambda function but add two more things, firstly you have to iterate over each object in your bucket, then perform your code on it and then save it to destination path of s3 after resizing. i.e, for all images already stored in your s3 bucket , instead of using lambda, you are resizing the images locally in your computer and saving them back to s3 destination.
The easiest way is, first make sure that you have configured s3 notification's event type to be Object Created (All) as trigger for your lambda.
Then after this, move all your already stored images to a new temporary bucket, and then move those images back to the original bucket, this is how your lambda will get triggered for each image automatically. You can do the moving task easily by using sdk's provided by AWS. For example, for moving files using boto3 in python, you can refer this link to moving example in python using boto3
Instead of using moving , i.e cut and paste , you can use copy and paste commands too.
In addition to Mausam Sharma's comment you can run the copy between buckets using the aws cli:
aws s3 sync s3://SOURCE-BUCKET-NAME s3://DESTINATION-BUCKET-NAME --source-region SOURCE-REGION-NAME --region DESTINATION-REGION-NAME
from here:
https://medium.com/tensult/copy-s3-bucket-objects-across-aws-accounts-e46c15c4b9e1
You can simply copy back to the same bucket with the CLI which will replace the original file with itself and then run the lambda as a result.
aws s3 copy s3://SOURCE-BUCKET-NAME s3://SOURCE-BUCKET-NAME --recursive
You can also include/exclude glob patterns which can be used to selectively run against say a particular day, or specific extensions etc.
aws s3 copy s3://SOURCE-BUCKET-NAME s3://SOURCE-BUCKET-NAME --recursive --exclude "*" --include "2020-01-15*"
It's worth noting that like many of the other answers here, this will incur costs on s3 for read/write etc, so cautiously apply this in the event of buckets containing lots of files.