Realtime-ness of S3 event notification - amazon-web-services

I am interested in traffic lifecycle (i.e. when the objects were created and deleted) of objects.
One approach is to perform periodic scan of the bucket and track explicitly the lastModifiedTime and perform a diff with previous scan result to identify objects deleted.
Another alternate I was considering was to enable S3 event notifications. However, the data in notification does not contain lastModifiedTime for the object. Can the eventTime be used as proxy instead? Is there a guarantee how quickly the event is sent ? In my case, it is acceptable if delivery of the event is delayed; as long as eventTime is not significantly later that modificationTime of object
Also, any other alternatives to capture lifecycle of s3 objects?

Yeah, the eventTime is a pretty good approximation of the lastModifiedTime of an object. One caveat here is the definition of lastModifiedTime is
Object creation date or the last modified date, whichever is the latest.
So in order to use eventTime as an approximation, you probably need a trigger that covers all the events where an object is either created or modified. Regarding to your question of how quickly the event is sent, here is a quote from the S3 documentation:
Amazon S3 event notifications are designed to be delivered at least once. Typically, event notifications are delivered in seconds but can sometimes take a minute or longer.
If you want the accurate lastModifiedTime, you need to do a headObject operation for each object.
Your first periodic pull approach could work, but be careful don't do it naively if you have millions of objects. I mean don't use listObjects and do it in a while loop. This doesn't scale at all and listObjects API is pretty expensive. If you only need to do this traffic analysis once a day or once a week, I recommend using S3 inventory. The lastModifiedTime is included in the inventory report. [ref]

There is no guarantee for how long it takes to deliver the events. From the docs:
Amazon S3 event notifications are designed to be delivered at least once. Typically, event notifications are delivered in seconds but can sometimes take a minute or longer.
Also events occurring at the same time, may be represented by single event at the end:
If two writes are made to a single non-versioned object at the same time, it is possible that only a single event notification will be sent. If you want to ensure that an event notification is sent for every successful write, you can enable versioning on your bucket. With versioning, every successful write will create a new version of your object and will also send an event notification.

Related

Is there any notification event I can trace for completion of an execution of AWS S3 lifecycle rule?

I wanted to delete large number of S3 files (may be few 100K or 1000K, which I do not have control) in a bulk async process. I tried to look into multiple blogs and collated below strategies:
Leverage AWS S3 REST API from the async thread of custom application
Here the drawbacks are:
I will have to make huge number of S3 API calls as 1 request is limited for 1000 S3 objects and I may not know the exact S3 object.
Even if I identify the S3 objects to delete, I will have to first GET and then DELETE which will make the solution costly.
Here I will have to keep track of deleted chunks and in case of any failure in middle of operation, I will have to build a mechanism to re-trigger the chunks which failed to be deleted.
Leveraging S3 lifecycle policy
Here the drawbacks are:
We are storing multiple customer data into same bucket segregated by customer-id in prefix. With growing number of customers, we foresee that the 1000 rules per bucket hard limit may hit us.
To surpass above drawback, we can delete the rule and free-up the quota for next requests. But we were looking for any event based notification which can tell us back that the bulk delete operation is complete.
Again with growing number of customers, here we may loose predictability of the bulk delete operation. This is because of accumulated jobs due to reached quota limit and a submitted bulk delete job may have to wait for days to be completed.
Create only 1 rule with a special bulk delete tag and use it to set 1 S3 lifecycle policy
With this approach, we believe we will not hit the limit issue as we are expecting in above approach. And as we understood that these S3 lifecycle rules gets executed once a day (though we don't know exactly when), so we are assured that in max next 24h, the rule will get triggered and then it will take some time to actually complete the bulk delete operation (may be few mins or hours, we don't know). Here also we have the open question as: Is there a notification event after completion of 1 execution of S3 lifecycle rule which we can listen and update the status of all submitted bulk delete jobs as DONE? In lack of such notification event, it becomes difficult to let transparently communicate it back to the end-user who triggered the bulk delete async operation.
Any comments/advice on below strategies will be helpful. Also if you can help me with the answer for the last strategy which I guess is the most preferable choice I have as of now.
I tried all the above stated strategies and got stuck at the mentioned problem for each. Any inputs/advice on above will be of great help.
After all evaluations, we have finalized to go with codeful delete relevant data for specific time-range as an async java process leveraging S3 bulk delete SDK (DeleteObjectsRequest).

Best architecture for time triggered events in serverless architecture

I am currently designing a serverless system where I want to store objects that will be used when the user defines. For example the user can say "this object will do x in 3 days." The object gets stored in the DB, and then 3 days later, an action will happen with that object.
I want this to be as real time as possible. Time driven events architecture this answer suggests using a priority queue, which I think is a great idea. But in a serverless architecture, how can I pull objects off that priority queue only after the time that the user set passes? The only way I can think to do this now is to poll the queue every so often, but it seems like it would be better if the priority queue could pop itself and take action if the object at the front of the queue is expired.
This seems like it could work https://aws.amazon.com/blogs/mt/build-scheduler-as-a-service-amazon-cloudwatch-events-amazon-eventbridge-aws-lambda/ but I worry that it's overkill or that it's not a perfect fit that will run into scaling issues.
I would suggest using the TTL feature of AWS DynamoDB.
ref - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html
When you store the object, add a record to DynamoDB with the details of the object and the epoch time (now + user-provided time).
Once the time you give expires, DynamoDB will delete the record.
Create a DynamoDB stream that triggers a lambda function that can now process the record info it receives in the event.
ref - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-streams.html

Automated Real Time Data Processing on AWS with Lambda

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.
Current Approaches/Attempts:
-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.
-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.
-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.
Any suggestion is appreciated! Thank you!
If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.
The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.
Best, Stefan
Your file coming in parts might be named as -
filename_part1.ext
filename_part2.ext
If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -
filename.final
Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.
In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB.
You need to put a unique entry per file (not file parts) in dynamo with -
filename and last_part_recieved_time
The last_part_recieved_time keeps getting updated till you keep getting the file parts.
Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.
I will still prefer to go with the first approach as the second one still has a chance for error.
Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.
In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

What's the best aws approach to send a notification message to validate if all records have been processed in dynamoDB

Introduction
We are building an application to process a monthly file, and there are many aws components involved in this project:
Lambda reads the file from S3, parse it and push it to dynamoDB with flag (PENDING) for each record.
Another Lambda will processing these records after the first Lambda is done, and to flag a record as (PROCESSED) after it's done with it.
Problem:
We want to send a result to SQS after all records are processed.
Our approach
Is to use DynamoDB streaming to trigger a lambda each time a record gets updated, and Lambda to query dynamoDB to check f all records are processed, and to send the notification when that's true.
Questions
Are there any other approach that can achieve this goal without triggering Lambda each time a record gets updated?
Are there a better approach that doesn't include DynamoDB streaming?
I would recommend Dynamodb Stream as they are reliable enough, triggering lambda for and update is pretty cheap, execution will be 1-100 ms usually. Even if you have millions of executions it is a robust solution. There is a way to a have shared counter of processed messages using elastic cache, once you receive update and counter is 0 you are complete.
Are there any other approach that can achieve this goal without
triggering Lambda each time a record gets updated?
Other option is having a scheduled lambda execution to check status of all processed from the db (query for PROCESSED) and move it to SQS. Depending on the load you could define how often to be run. (Trigger using cloudwatch scheduled event)
What about having a table monthly_file_process with row for every month having extra counter.
Once the the s3 files is read count the records and persist the total as counter. With every PROCESSED one decrease the counter , if the counter is 0 after the update send the SQS notification. This entire thing with sending to SQS could be done from 2 lambda which processes the record, just extra step checking the counter.

aws dynamodb stream lambda processes too quickly

I have DynamoDb table that I send data into, there is a stream that is being processed by a lambda, that rolls up some stats and inserts them back into the table.
My issue is that my lambda is processing the events too quickly, so almost every insert is being sent back to the dynamo table, and inserting them back into the dynamo table is causing throttling.
I need to slow my lambda down!
I have set my concurrency to 1
I had thought about just putting a sleep statement into the lambda code, but this will be billable time.
Can I delay the Lambda to only start once every x minutes?
You can't easily limit how often the Lambda runs, but you could re-architect things a little bit and use a scheduled CloudWatch Event as a trigger instead of your DynamoDB stream. Then you could have the Lambda execute every x minutes, collate the stats for records added since the last run, and push them to the table.
I never tried this myself, but I think you could do the following:
Put a delay queue between the stream and your Lambda.
That is, you would have a new Lambda function just pushing events from the DDB stream to this SQS queue. You can set an delay of up to 15 minutes on the queue. Then setup your original Lambda to be triggered by the messages in this queue. Be vary of SQS limits though.
As per lambda docs "By default, Lambda invokes your function as soon as records are available in the stream. If the batch it reads from the stream only has one record in it, Lambda only sends one record to the function. To avoid invoking the function with a small number of records, you can tell the event source to buffer records for up to 5 minutes by configuring a batch window. Before invoking the function, Lambda continues to read records from the stream until it has gathered a full batch, or until the batch window expires.", using this you can add a bit of a delay, maybe process the batch sequentially even after receiving it. Also, since execution faster is not your priority you will save cost as well. Less lambda function invocations, cost saved by not doing sleep. From aws lambda docs " You are charged based on the number of requests for your functions and the duration, the time it takes for your code to execute."
No, unfortunately you cannot do it.
Having the concurrency set to 1 will definitely help, but won't solve. What you could do instead would be to slightly increase your RCUs a little bit to prevent throttling.
To circumvent the problem though, #bwest's approach seems very good. I'd go with that.
Instead of putting delay or setting concurrency to 1, you can do the following
Increase the batch size, so that you process few events together. It will introduce some delay as well as cost less money.
Instead of putting data back to dynamodb, put it to another store where you are not charged by wcu but by amount of memory/ram you are using.
Have a cloudwatch triggered lambda, who takes data from this temporary store and puts it back to dynamodb.
This will make sure few things,
You can control the lag w.r.t. staleness of aggregated data. (i.e. you can have 2 strategy defined lets say 15 mins or 1000 events whichever is earlier)
You lambda won't have to discard the events when you are writing aggregated data very often. (this problem will be there even if you use sqs).