AWS CLoudWatch Log Trigger for Lambda - amazon-web-services

I have a problem in AWS regarding CloudWatch Log Triggers.
I have two Lambda functions. One (business-lambda) gets triggered when I upload a file to a S3 bucket. The other Lambda function (log-lambda) is triggered whenever business-lambda encounters an invalid file which results in creating an ERROR-log entry. I implemented this with a CloudWatch Log Trigger with filter "?ERROR" and having the log-lambda being subscribed to the log-group of the business-lambda.
Everything works fine as long as I upload one file at a time or at a maximum of ~3 files at a time.
But when I upload e.g. 10 invalid files at a time the log-lambda doesn't get triggered for all of the files. Instead it only gets triggered for 4-5 of them.
Is there some kind of "Cloudwatch-log-trigger/second" limit?

I found a solution - luk2302 made the correct suggestion in their comment.
In the log-lambda code I only process the first entry from an incoming log-event. But the log-lambda gets triggered once for multiple error-log entries from the business-lambda. I did not take this into account in the log-lambda code.
Thanks to everybody for their time!

Related

Is there a way to check if AWS lambda is running from java code?

There is a Lambda say my Lambda Name is - XYZ which has a s3 file upload trigger. Now lets say if I upload multiple files say around 100 and the lambda execution starts.
Is there any way I can track if the lambda is running or has processed all the files?
The reason for this is, once the lambda has completed processing all the file, I would want to trigger a step function, so for me to trigger my Step Function I would want to do that only once all the files are processed by my lambda (XYZ).
FYI: There is no current way to track how many files have been uploaded
I think it's not a good design to run a step function state machine after the lambda completes the job without having a perfect logical event.
Because you can for example receive a bunch of files (say 100) completed by the lambda, then using alarms with Cloudwatch once we complete the 100 values of a custom metric, or check value in DynamoDb or number of object in a custom folder, fires a job step function, at the same time you receive another file with +5 sec delay to complete the 101, in this case maybe you miss this file.
If you don't have a special event or a condition to have the completion of the files, then you can work with time scheduling and trigger your Step function with Cloudwatch event ( scheduled event) like every 15 min, check if there is work if not, exit the job.
Otherwise, either include the lambda (file process) in your step function as a step or change your design.

Automated Real Time Data Processing on AWS with Lambda

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.
Current Approaches/Attempts:
-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.
-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.
-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.
Any suggestion is appreciated! Thank you!
If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.
The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.
Best, Stefan
Your file coming in parts might be named as -
filename_part1.ext
filename_part2.ext
If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -
filename.final
Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.
In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB.
You need to put a unique entry per file (not file parts) in dynamo with -
filename and last_part_recieved_time
The last_part_recieved_time keeps getting updated till you keep getting the file parts.
Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.
I will still prefer to go with the first approach as the second one still has a chance for error.
Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.
In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

"Realtime" syncing of large numbers of log files to S3

I have a large number of logfiles from a service that I need to regularly run analysis on via EMR/Hive. There are thousands of new files per day, and they can technically come out of order relative to the file name (e.g. a batch of files comes a week after the date in the file name).
I did an initial load of the files via Snowball, then set up a script that syncs the entire directory tree once per day using the 'aws s3 sync' cli command. This is good enough for now, but I will need a more realtime solution in the near future. The issue with this approach is that it takes a very long time, on the order of 30 minutes per day. And using a ton of bandwidth all at once! I assume this is because it needs to scan the entire directory tree to determine what files are new, then sends them all at once.
A realtime solution would be beneficial in 2 ways. One, I can get the analysis I need without waiting up to a day. Two, the network use would be lower and more spread out, instead of spiking once a day.
It's clear that 'aws s3 sync' isn't the right tool here. Has anyone dealt with a similar situation?
One potential solution could be:
Set up a service on the log-file side that continuously syncs (or aws s3 cp) new files based on the modified date. But wouldn't that need to scan the whole directory tree on the log server as well?
For reference, the log-file directory structure is like:
/var/log/files/done/{year}/{month}/{day}/{source}-{hour}.txt
There is also a /var/log/files/processing/ directory for files being written to.
Any advice would be appreciated. Thanks!
You could have a Lambda function triggered automatically as a new object is saved on your S3 bucket. Check Using AWS Lambda with Amazon S3 for details. The event passed to the Lambda function will contain the file name, allowing you to target only the new files in the syncing process.
If you'd like wait until you have, say 1,000 files, in order to sync in batch, you could use AWS SQS and the following workflow (using 2 Lambda functions, 1 CloudWatch rule and 1 SQS queue):
S3 invokes Lambda whenever there's a new file to sync
Lambda stores the filename in SQS
CloudWatch triggers another Lambda function every X minutes/hours to check how many files are there in SQS for syncing. Once there's 1,000 or more, it retrieves those filenames and run the syncing process.
Keep in mind that Lambda has a hard timeout of 5 minutes. If you sync job takes too long, you'll need to break it in smaller chunks.
You could set the bucket up to log HTTP requests to a separate bucket, then parse the log to look for newly created files and their paths. One troublespot, as well as PUT requests, you have to look for the multipart upload ops which are a sequence of POSTs. Best to log for a few days to see what gets created before putting any effort in to this approach

AWS Lambda: error creating the event source mapping: Configuration is ambiguously defined

There was an error creating the event source mapping: Configuration is ambiguously defined. Cannot have overlapping suffixes in two rules if the prefixes are overlapping for the same event type.
I created an event earlier from the GUI console 6-7 days ago and it was working fine. The next day the event just missing, i cant see it anymore at the Lambda console GUI. But every S3 objects still seems triggering the lambda function not a problem. If i cant see, it is not good; So i deleted the Lambda function, waited for 5-10 seconds before creating another new function. And now, i receive the same above when i try to create the event sources like this:
When i click "Submit" the event sources tab says "You do not have any event sources for this function", Lambda does not get triggered; it means the entire application flow is now broken :(
The problem is almost the same as: "https://forums.aws.amazon.com/thread.jspa?messageID=670712򣯸" But somehow i cant reply to that thread, so i created a new thread here instead. anyone encounter this issue?
In fact, i try to response to the existing AWS forum thread: https://forums.aws.amazon.com/thread.jspa?messageID=670712&#670712
but i keep getting this funny error: "Your message quota has been reached. Please try again later.". And i wasnt even posting anything, how can i use up my quota?
What I suspect is your S3 bucket may still be "linked" to the lambda function.
Maybe check your S3 bucket for events and remove them there, then try creating the lambda events again?
i.e. S3 bucket-> properties-> Events
After 6 years nice to see some people still befitting from this answer,
Here is a shamless plug to youtube video I uploaded 2022-12-13.
https://www.youtube.com/watch?v=rjpOU7jbgEs
The issue must be that the s3 bucket is already linked with the suffix/prefix you are trying to link. Remove the link in S3 and try again.
When you setup a lambda function and setup a trigger related to S3. The notification gets updated in the properties sections of that S3 bucket.
The mentioned error occurs when the earlier lambda function is deleted and you're trying to setup same kind of trigger again. This time the thing to note is, the S3 notification is still not deleted when you deleted the lambda function.
Goto S3 bucket > Properties > Event notifications
and delete the old setting and then setup new trigger in the new lambda function trigger.
Here is a link to a youtube video profiling this issue and demonstrating the solution:
https://www.youtube.com/watch?v=1Tfmc9nEtbU
Just as Ridwaan Manuel, you must remove the events by going to S3 bucket-> properties-> Events as the video shows.
Steps to reproduce this issue:
Create a bucket and create a folder called “example/”
Create Lambda Function
Add S3 trigger to the lambda using the bucket from (1) with default settings
Save the trigger
Click Save and notice error
Refresh the page and notice that the triggers disappeared
Add the same bucket again and notice the ambiguous reference error

Is it possible for S3 notifications to SQS to fail?

I have setup an S3 bucket to publish messages for each PUT and POST actions. Files get uploaded to that bucket using CLI. It does work fine but out of 4 files pushed sequentially, only one triggers a message. I am not sure that this has always happened but it is happening consistently now. Note that it does not happen when I upload file manually (i.e. I always get a message per file).
I have made sure that there is no downstream system processing the messages (as a confirmation, I still see the original message triggered after the first file).
Is there any reason to believe that this AWS feature is not reliable? Since this is unlikely, what could be the problem here?
As suggested by Michael in the comment, the problem was that the bucket only listened to s3:ObjectCreated:Put. What was happening is that all other files but the first one were uploaded using multipart which was not triggering any message creation.
I modified the bucket to trigger messages on s3:ObjectCreated:* and it now works as expected.
Inspired by RaySF answer, I've fixed the issue directly in the AWS console.
Sign in AWS console
S3
Find your bucket and click on it
Properties tab
Events
Edit the related event
Change from PUT to All object create events