Run AWS lambda function on existing S3 images - amazon-web-services

I write an AWS lambda function in Node.js for image resizing and trigger it when images upload.
I have already more than 1,000,000 images existing in bucket.
I want to run this lambda function on that images but not find anything till yet.
How can I run AWS lamdba function on existing images of S3 bucket?
Note:- I know this question already asked on Stack overflow, but issue is that no solution of them given till yet

Unfortunately, Lambda cannot be triggered automatically for objects that are already existing in a S3 bucket.
You will have to invoke your Lambda function manually for each image in your S3 bucket.
First, you will need to list existing objects in your S3 bucket using the ListObjectsV2 action.
For each object in your S3 bucket, you must then invoke your Lambda function and provide the S3 object's information as the Payload.

Yes , it's completely true that lambda cannot be triggered by objects already present there in your s3 bucket, but invoking your lambda manually for each object is a completely dumb idea.
With some clever techniques you can perform your tasks on those images easily :
The hard way is, make a program locally that exactly does the same thing as your lambda function but add two more things, firstly you have to iterate over each object in your bucket, then perform your code on it and then save it to destination path of s3 after resizing. i.e, for all images already stored in your s3 bucket , instead of using lambda, you are resizing the images locally in your computer and saving them back to s3 destination.
The easiest way is, first make sure that you have configured s3 notification's event type to be Object Created (All) as trigger for your lambda.
Then after this, move all your already stored images to a new temporary bucket, and then move those images back to the original bucket, this is how your lambda will get triggered for each image automatically. You can do the moving task easily by using sdk's provided by AWS. For example, for moving files using boto3 in python, you can refer this link to moving example in python using boto3
Instead of using moving , i.e cut and paste , you can use copy and paste commands too.

In addition to Mausam Sharma's comment you can run the copy between buckets using the aws cli:
aws s3 sync s3://SOURCE-BUCKET-NAME s3://DESTINATION-BUCKET-NAME --source-region SOURCE-REGION-NAME --region DESTINATION-REGION-NAME
from here:
https://medium.com/tensult/copy-s3-bucket-objects-across-aws-accounts-e46c15c4b9e1

You can simply copy back to the same bucket with the CLI which will replace the original file with itself and then run the lambda as a result.
aws s3 copy s3://SOURCE-BUCKET-NAME s3://SOURCE-BUCKET-NAME --recursive
You can also include/exclude glob patterns which can be used to selectively run against say a particular day, or specific extensions etc.
aws s3 copy s3://SOURCE-BUCKET-NAME s3://SOURCE-BUCKET-NAME --recursive --exclude "*" --include "2020-01-15*"
It's worth noting that like many of the other answers here, this will incur costs on s3 for read/write etc, so cautiously apply this in the event of buckets containing lots of files.

Related

AWS S3 sync command from one bucket to another

I want to use the AWS S3 sync command to sync a large bucket with another bucket.
I found this answer that say that the files from the bucket synced over the AWS backbone and are not copied to the local machine but I can't find a reference anywhere in the documentation. Does anyone has a proof for this behavior? any formal documentation that explains how it works?
I tried to find something in the documentation but nothing there.
To learn more about the sync command, check CLI docs. You can directly refer to the section named -
Sync from S3 bucket to another S3 bucket
The following sync command syncs objects to a specified bucket and
prefix from objects in another specified bucket and prefix by copying
s3 objects. An s3 object will require copying if one of the following
conditions is true:
The s3 object does not exist in the specified bucket and prefix
destination.
The sizes of the two s3 objects differ.
The last modified time of the source is newer than the last modified time of the destination.
Use the S3 replication capability if you only want to replicate the data that moves from bucket1 to bucket2.

S3 Bucket AWS CLI takes forever to get specific files

I have a log archive bucket, and that bucket has 2.5m+ objects.
I am looking to download some specific time period files. For this I have tried different methods but all of them are failing.
My observation is those queries start from oldest file, but the files I seek are the newest ones. So it takes forever to find them.
aws s3 sync s3://mybucket . --exclude "*" --include "2021.12.2*" --include "2021.12.3*" --include "2022.01.01*"
Am I doing something wrong?
Is it possible to make these query start from newest files so it might take less time to complete?
I also tried using S3 Browser and CloudBerry. Same problem. Tried with a EC2 that is inside the same AWS network. Same problem.
2.5m+ objects in an Amazon S3 bucket is indeed a large number of objects!
When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500+ API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).
You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.
Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.
The simple fact is that a flat-structure Amazon S3 bucket with 2.5m+ objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.

right way to move large objects between folders/buckets in S3

I need to move some large file(s) (1 terabyte to 5 terabyte) from one S3 location to a different directory in the same bucket or to a different bucket.
There are few ways that I can think of doing it more robustly.
Trigger a lambda function based on ObjectCreated:Put trigger and use boto3 to copy the file to new location and delete source file. Plain and simple. But if there is any error while copying the files, I lose the event. I have to design some sort of tracking system along with this.
Use same-region-replication and delete the source once the replication is completed. I do not think there is any event emitted once the object is replicated so I am not sure.
Trigger a Step function and have Copy and Delete as separate steps. This way if for some reason Copy or Delete steps fail, I can rerun the state machine. Here again the problem is what if the file size is too big for lambda to copy?
Trigger a lambda function based on ObjectCreated:Put trigger and create a data pipeline and move the file using aws s3 mv. This can get little expensive.
What is the right way of doing this to make this reliable?
I am looking for advise on the right approach. I am not looking for code. Please do not post aws s3 cp or aws s3 mv or aws s3api copy-object one line commands.
Your situation appears to be:
New objects are being created in Bucket A
You wish to 'move' them to Bucket B (or move them to a different location in Bucket A)
The move should happen immediately after object creation
The simplest solution, of course, would be to create the objects in the correct location without needing to move them. I will assume you have a reason for not being able to do this.
To respond to your concepts:
Using an AWS Lambda function: This is the easiest and most-responsive method. The code would need to do a multi-part copy since the objects can be large. If there is an unrecoverable error, the original object would be left in the source bucket for later retry.
Using same-region replication: This is a much easier way to copy the objects to a desired destination. S3 could push the object creation information to an Amazon SQS queue, which could be consulted for later deletion of the source object. You are right that timing would be slightly tricky. If you are fine with keeping some of the source files around for a while, the queue could be processed at regular intervals (eg every 15 minutes).
Using a Step Function: You would need something to trigger the Step Function (another Lambda function?). This is probably overkill since the first option (using Lambda) could delete the source object after a successful copy, without needing to invoke a subsequent step. However, Step Functions might be able to provide some retry functionality.
Use Data Pipeline: Don't. Enough said.
Using an AWS Lambda function to copy an object would require it to send a Copy command for each part of an object, thereby performing a multi-part copy. This can be made faster by running multiple requests in parallel through multiple threads. (I haven't tried that in Lambda, but it should work.)
Such multi-threading has already been implemented in the AWS CLI. So, another option would be to trigger an AWS Lambda function (#1 above) that calls out to run the AWS CLI aws s3 mv command. Yes, this is possible, see: How to use AWS CLI within a Lambda function (aws s3 sync from Lambda) :: Ilya Bezdelev. The benefit of this method is that the code already exists, it works, using aws s3 mv will delete the object after it is successfully copied, and it will run very fast because the AWS CLI implements multi-part copying in parallel.

Parse data from aws s3 bucket and save parsed data to another bucket

I'm new to AWS S3, and I was reading to this tutorial from AWS on how to move data from bucket to another
How can I copy objects between Amazon S3 buckets?
However, I didn't notice, or it didn't mention that you can apply a hook or any intermediate step before data will be saved.
Ideally, we wanted to take the data from a log bucket(its very dirty and wanted to clean it up a bit) and save another copy of it in another S3 (the parsed data). We also wanted to do this periodically so that automation would be necessary for the future.
What I wanted to know is that, can I do this with just S3 or do I need to use another service to do the parsing and saving to another bucket.
Any insight is appreciated, thanks!
S3 by itself is simply for storage. You should be looking at using AWS Lambda with Amazon S3.
Every time a file is pushed to your Log bucket, S3 can trigger a Lambda function (that you write) that can read the file, do the clean up, and then push the cleaned data to the new S3 bucket.
Hope this helps.

In AWS how to Copy file from one s3 bucket to another s3 bucket using lambda function

I have created two S3 buckets named as 'ABC' and 'XYZ'.
If i upload the file(Object) in 'ABC' bucket it should get automatically copy to 'XYZ'.
For Above scenario i have to write a lambda function using node.js
i am newly learning the lambda so if you provide the details steps it will be good for me.
it will be good if we can do it by web console otherwise Np.
This post should be useful to copy between buckets of same region, https://aws.amazon.com/blogs/compute/content-replication-using-aws-lambda-and-amazon-s3/
If the usecase you are trying to achieve is for DR purpose in another region, you may use this https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/. S3 natively does the replication for you but it's unclear if that is in same region or different region from your question