aws lambda s3 events for existing files - amazon-web-services

I am considering moving to lambdas and after spending some time reading docs and various blogs with user experiences I am still struggling with a simple question. Is there a proposed/proper way to use lambda with existing s3 files?
I have an s3 bucket that contains archived data spanning a couple of years. The size of these data is rather large (hundreds of GB). Each file is a simple txt file. Each line in the file represents an event and it's just a comma separated string.
My endgame is to consume these files, parse each one of them line by line apply some transformation, create batches of lines and send them to an external service. From what I've read so far, if I write a proper lambda, this will be triggered by an s3 event (for example an upload of a new file).
Is there a way to apply the lambda to all the existing contents of my bucket?
Thanks

For existing resources you would need to write a script that gets a listing of all your resources and sends each item to a Lambda function somehow. I'd probably look into sending the location of each of your existing S3 objects to a Kenesis stream and configure a Lambda function to pull records from that stream and process them.

Try using s3cmd.
s3cmd modify --recursive --add-header="touched:touched" s3://path/to/s3/bucket-or-folder
This will modify metadata and invoke an event for lambda

I had a similar problem I solved it with minimal changes to my existing Lambda function. The solution involves creating API Gateway trigger (in addition to S3 trigger) - the API gateway trigger is used to process historical files in S3 & the regular S3 trigger will processes my files as new files are uploaded to my S3 bucket.
Initially - I started by building my function to expect a S3 event as trigger. Recall that the S3 events have this structure - so I would look for the S3 bucket name and key to process - like so:
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')
temp_dir = tempfile.TemporaryDirectory()
video_filename = os.path.basename(key)
local_video_filename = os.path.join(temp_dir.name, video_filename)
s3_client.download_file(bucket, key, local_video_filename)
But when you send the API Gateway trigger there is no "Records" object in the request/event. You can use query parameters in the API Gateway Trigger - so the modification required to the above snippet of code is:
if 'Records' in event:
# this means we are working off of an S3 event
records_to_process = event['Records']
else:
# this is for ad-hoc posts via API Gateway trigger for Lambda
records_to_process = [{
"s3":{"bucket": {"name": event["queryStringParameters"]["bucket"]},
"object":{"key": event["queryStringParameters"]["file"]}}
}]
for record in records_to_process:
# below lines of code s same as the earlier snippet of code
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')
temp_dir = tempfile.TemporaryDirectory()
video_filename = os.path.basename(key)
local_video_filename = os.path.join(temp_dir.name, video_filename)
s3_client.download_file(bucket, key, local_video_filename)
Postman result of sending the post request

Try to copy your bucket content and catch create events with lambda.
copy:
s3cmd sync s3://from/this/bucket/ s3://to/this/bucket
for larger buckets:
https://github.com/paultuckey/s3_bucket_to_bucket_copy_py

Related

Why is lambda getting randomly timed out while trying to read the head object for a key on S3 bucket?

I am working on a feature where a user can upload multiple files which need to be parsed and converted to PDF if required. For that, I'm using AWS and when the user selects N files for upload then the following happens:
The client browser is connected to an AWS WebSocket API which is responsible for sending back the parsed data to respective clients later.
A signed URL for S3 is get from the webserver using which all of the user's files are uploaded onto an S3 bucket.
As soon as each file is uploaded, a lambda function is triggered for it which fetches the object for that file in order to get the content and some metadata to associate the files with respective clients.
Once the files are parsed, the response data is sent back to the respective connected clients via the WebSocket and the browser JS catches the event data and renders it.
The issue I'm facing here is that the lambda function randomly times out at the line which fetches the object of the file (either just head_object or get_object). This is happening for roughly 50% of the files (Usually I test by just sending 15 files at once and 6-7 of them fail)
import boto3
s3 = boto3.client("s3")
def lambda_handler(event, context):
bucket = event["Records"][0]["s3"]["bucket"]["name"]
key = urllib.parse.unquote_plus(event["Records"][0]["s3"]["object"]["key"], encoding="utf-8")
response = s3.get_object(Bucket=bucket, Key=key) # This or head_object gets stuck for 50% of the files
What I have observed is that even if the head_object or get_object is fetched for a file which already exists on S3 instead of getting it for the file who's upload triggered the lambda. Then also it times out with the same rate.
But if the objects are fetched in bulk via some local script using boto3 then they are fetched under a second for 15 files.
I have also tried using my own AWS Access ID and Secret key in lambda to avoid any issue caused by the temporarily generated keys.
So it seems that the multiple lambda instances are having trouble in getting the S3 file objects in parallel, which shouldn't happen though as AWS is supposed to scale well.
What should be done to get around it?

AWS Lambda avoid recursive trigger

I'm downloading data from an API and writing it to a csv file that I store in an S3 bucket. I'm then copying my file from this input bucket into an output bucket with a Lambda function. From the output bucket I'm ingesting it into a MySQL RDS instance with another Lambda function.
The copy-to-another-bucket and upload-to-RDS lambda functions both get triggered when I create a new object in a bucket. Since I'm appending to my csv file, the upload-to-RDS function gets triggered way more than it should and I end up with ~30 rows in my database instead of 6.
I thought by copying the files between S3 buckets I could avoid this, but it doesn't help. Is there any way to only upload the csv file to the database once it has been written and not while it's being updated? Can I delay the trigger maybe?
The only other solution I can think of is to skip the copy-to-another-bucket function altogether and to schedule the upload-to-RDS function.
You need to realize that S3 doesn't support updating an existing file. If you are appending a row to an existing CSV file in S3, then that operation requires uploading the entire contents of the CSV file to S3 again, which S3 sees as a new object.
If you need to store a temporary version of the CSV file in S3 while you are updating it, then you should store it in a separate path, like s3://your_bucket/tmp and then when you have completed your updates, move it to the final path like s3://your_bucket/complete and only configure the Lambda trigger on the /complete path.

AWS S3: Notification for files in particular folder

in S3 buckets we have a folder where incoming files are being placed. And then some of our system picks it up and processes it.
I want to know how many files in this folder is older than some period and then send a notification to corresponding team.
I.e. In S3 bucket, if some file arrived today and it's still there even after 3 hours, I want to get notified.
I am thinking to use boto python library to iterate through all the objects inside S3 bucket at schduled interval to check files are folder. And then send notification. However, this pulling solution doesn't seem good.
I am thinking to have some event based solution. I know, S3 has events which I can subscribe using either queue or lambda. However, I don't want to do any action as soon as I have file available, I just want to to check which files are older than some time and send email notification.
can we achieve this using event based solution?
Per hour we are expecting around 1000 files. Once file is processed they are moved to different folder. However if something goes wrong it will be there. So in day, I am not expecting more than 10,000 files in one bucket. Consider I have multiple buckets.
Itarate through S3 files to do that kind of filter is not a good idea. It can get very slow when you have more than a thousad of files in there. I would suggest you to use a database to store that records.
You can have a dynamodb with 2 columns: file name and upload date. Or, if budget is a problem, you can even have a sqlite3 file on the bucket, and fetch it whenever you need to query or add data to it. I did this using lambda, and it works just fine. Just don't forget to upload the file again when new records are inserted.
You could create an Amazon CloudWatch Event rule that triggers an AWS Lambda function at a desired time interval (eg every 5 minutes or once an hour).
The AWS Lambda function could list the desired folder looking for files older than a desired time period. It would be something like this:
import boto3
from datetime import datetime, timedelta, timezone
s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(
Bucket = 'my-bucket',
Prefix = 'to-be-processed/'
)
for page in page_iterator:
for object in page['Contents']:
if object['LastModified'] < datetime.now(tz=timezone.utc) - timedelta(hours=3):
// Print name of object older than given age
print(object['Key'])
You could then have it notify somebody. The easiest way would be to send a message to an Amazon SNS topic, and then people can subscribe to that topic via SMS or email to receive a notification.
The above code is quite simple in that it will find the same file every time, not just the new files that have been added to the notification period.

Lambda reading file on S3 - flushing S3 cache

I have a problem regarding cache on S3. Basically I have a lambda that reads a file on S3 which is used as configuration. This file is a JSON. I am using python with boto3 to extract the needed info.
Snippet of my code:
s3 = boto3.resource('s3')
bucketname = "configurationbucket"
itemname = "conf.json"
obj = s3.Object(bucketname, itemname)
body = obj.get()['Body'].read()
json_parameters = json.loads(body)
def my_handler(event, context):
# using json_paramters data
The problem is that when I change the json content and I upload the file again on S3, my lambda seems to read the old values, which I suppose is due to S3 doing caching somewhere.
Now I think that there are two ways to solve this problem:
to force S3 to invalidate its cache content
to force my lambda to reload the file from S3 without using the cache
I do prefer the first solution, because I think it will reduce computation time (reloading the file is an expensive procedure). So, how can I flush my cache? I didn't find on console or on AWS guide the way to do this in a simple manner
problem is , the code outside of function handler is initialized only once. It won't be re-initialised when the lambda is warm
def my_handler(event, context):
# read from S3 here
obj = s3.Object(bucketname, itemname)
body = obj.get()['Body'].read()
json_parameters = json.loads(body)
# use json_paramters data

AWS Lambda function getting called repeatedly

I have written a Lambda function which gets invoked automatically when a file comes into my S3 bucket.
I perform certain validations on this file, modify the particular and put the file at the same location.
Due to this "put", my lambda is called again and the process goes on till my lambda execution times out.
Is there any way to trigger this lambda only once?
I found an approach where I can store the file name in DynamoDB and can apply a check in lambda function, but can there be any other approach where DynamoDB's use can be avoided?
You have a couple options:
You can put the file to a different location in s3 and delete the original
You can add a metadata field to the s3 object when you update it. Then check for the presence of that field in s3 so you know if you have processed it already. Now this might not work perfectly since s3 does not always provide the most recent data on reads after updates.
AWS allows different type of s3 event triggers. You can try playing s3:ObjectCreated:Put vs s3:ObjectCreated:Post.
You can upload your files in a folder, say
s3://bucket-name/notvalidated
and store the validated in another folder, say
s3://bucket-name/validated.
Update your S3 Event notification to invoke your lambda function whenever there is a ObjectCreate(All) event in the /notvalidated prefix.
The second answer does not seem to be correct (put vs post) - there is not really a concept of update in S3 in terms of POST or PUT. The request to update an object will be the same as the initial POST of the object. See here for details on the available S3 events.
I had this exact problem last year - I was doing an image resize on PUT and every time a file was overwritten, it would be triggered again. My recommended solution would be to have two folders in your s3 bucket - one for the original file and one for the finalized file. You could then create the lambda trigger with the lambda prefix so it only checks the files in the original folder
The events are triggered in S3 based on if the object is put/post/copy/complete Multipart Upload - All these operations corresponds to ObjectCreate as per AWS documentation .
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
The best solution is to restrict your S3 object create event to particular bucket location. So that any change in that bucket location will trigger lambda function.
You can do the modification in some other bucket location which is not configured to trigger lambda function when object is created in that location.
Hope it helps!