AWS SNS trigger when 3 objects are loaded into S3 - amazon-web-services

I have a client that performs a weekly data upload of 3 CSV files to an S3 bucket, always within (at most) 5 minutes of each other. I have python code that ingests the 3 files, aggregates them, and writes and output that I would like to use to create a lambda function to fully automate this job. The issue is that I can't configure an S3 trigger that is every 3 object creates. I could have the lambda trigger every upload and exit until the 3 files are there but I don't want to do that as it's not really cost effective.
So I came across this question here that suggested having an SNS Topic that gets notified after a batch of uploads is completed, however I'm having trouble figuring out how to configure that. Basically what I'd like to do is create something similar to a CloudWatch Alarm that triggers when 3 object PUTS have occurred within 5 minutes of each other. Is this possible? Or, how can I configure my SNS event in a way as is suggested in the linked question?

I think you are misreading the answer in the post you linked. The SNS option is literally someone just sending an SNS message after they have pushed their batch of files to S3. So the sender would need to update their process to include a new SNS message step.
Option 3 in that question/answer is about using S3 triggers to handle this in an automatic fashion, without the sender needing to do any extra steps besides upload files, but it is also much more complicated since it involves using a DynamoDB table with Atomic Counters to ensure the files aren't processed multiple times.
I would like to point out that your concern over the cost effectiveness of triggering a Lambda function for each S3 object is going to be on the order of a few pennies a month, maybe less.

Related

Is there any notification event I can trace for completion of an execution of AWS S3 lifecycle rule?

I wanted to delete large number of S3 files (may be few 100K or 1000K, which I do not have control) in a bulk async process. I tried to look into multiple blogs and collated below strategies:
Leverage AWS S3 REST API from the async thread of custom application
Here the drawbacks are:
I will have to make huge number of S3 API calls as 1 request is limited for 1000 S3 objects and I may not know the exact S3 object.
Even if I identify the S3 objects to delete, I will have to first GET and then DELETE which will make the solution costly.
Here I will have to keep track of deleted chunks and in case of any failure in middle of operation, I will have to build a mechanism to re-trigger the chunks which failed to be deleted.
Leveraging S3 lifecycle policy
Here the drawbacks are:
We are storing multiple customer data into same bucket segregated by customer-id in prefix. With growing number of customers, we foresee that the 1000 rules per bucket hard limit may hit us.
To surpass above drawback, we can delete the rule and free-up the quota for next requests. But we were looking for any event based notification which can tell us back that the bulk delete operation is complete.
Again with growing number of customers, here we may loose predictability of the bulk delete operation. This is because of accumulated jobs due to reached quota limit and a submitted bulk delete job may have to wait for days to be completed.
Create only 1 rule with a special bulk delete tag and use it to set 1 S3 lifecycle policy
With this approach, we believe we will not hit the limit issue as we are expecting in above approach. And as we understood that these S3 lifecycle rules gets executed once a day (though we don't know exactly when), so we are assured that in max next 24h, the rule will get triggered and then it will take some time to actually complete the bulk delete operation (may be few mins or hours, we don't know). Here also we have the open question as: Is there a notification event after completion of 1 execution of S3 lifecycle rule which we can listen and update the status of all submitted bulk delete jobs as DONE? In lack of such notification event, it becomes difficult to let transparently communicate it back to the end-user who triggered the bulk delete async operation.
Any comments/advice on below strategies will be helpful. Also if you can help me with the answer for the last strategy which I guess is the most preferable choice I have as of now.
I tried all the above stated strategies and got stuck at the mentioned problem for each. Any inputs/advice on above will be of great help.
After all evaluations, we have finalized to go with codeful delete relevant data for specific time-range as an async java process leveraging S3 bulk delete SDK (DeleteObjectsRequest).

Automated Real Time Data Processing on AWS with Lambda

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.
Current Approaches/Attempts:
-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.
-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.
-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.
Any suggestion is appreciated! Thank you!
If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.
The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.
Best, Stefan
Your file coming in parts might be named as -
filename_part1.ext
filename_part2.ext
If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -
filename.final
Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.
In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB.
You need to put a unique entry per file (not file parts) in dynamo with -
filename and last_part_recieved_time
The last_part_recieved_time keeps getting updated till you keep getting the file parts.
Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.
I will still prefer to go with the first approach as the second one still has a chance for error.
Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.
In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

AWS S3 folder put event notification

I've written a function in Python that uploads a folder and its content to S3. Now I would like S3 to generate an event (so I can send it to a lambda function). S3 allows to generate events only at file level, in fact folders on s3 are just a visualization layer, which means that S3 has no internal representation for folders, keys with the same root are simply grouped together. That said, as for now I've come up with three approaches that revolves around the idea of a 'poison pill'.
Send a special file at the end of the folder upload process, the creation of which sends an event to lambda that can open the file to read custom directives to act on. Seems that this approach is quite flexible, however it poses serious concerns security-wise (I know that ACLs are in place for this reason but I'm not quite sure if it's enough), and generates some overhead while downloading/uploading/deleting the file from/to local memory.
Map an event to the target lambdas and fire it directly. The difference in approaches is simply that in this case I'm not really creating a file on S3, I'm just making S3 believe so. I would use CloudWatch to fire custom S3-object-created events with the name of the folder for lambda to pick up. This approach feels a little more hacky than the other two, plus when I did my research on the matter it seemed like it shouldn't be possible to generate "mock" events on AWS (i.e. Trigger S3 create event). To my understanding however, the function put_events should do the trick.
Using SQS would allow to put the folder name into an SQS task that can be later consumed by lambda. This has some advantages over the other two approaches, since SQS has now a LIFO variant that allows for exactly-once-delivery, failures reprocessing (via dead letters queue), etc, however this generates a non-trivial amount of complexity compared to the other approaches.
At this point I'm trying to opt for the most 'correct' approach, and
in order to do so I'm trying to weight pros and cons to make an informed decision, which led me to some questions:
Is there another way I'm missing out to proceed that does not involve client notification ? (all the aforementioned approaches rely on the client sending the notification in one way or another, which is not very "cloudy")?
Is there a substantial difference between approaches 2 and 3, considering that both rely on sending the information in and out of a stream (CloudWatch and SQS respectively)?
Have you consider using the prefix option of S3 bucket event, I tested it and it worked fine. In my S3 bucket I created two folder test1 and test2. On s3 event I added prefix test1 with that in place every time put/copy operation happen on bucket lambda is trigger.
I think your question nets down to "how can I trigger a Lambda function after I have uploaded a folder full of files to S3?"
Unless you have some information a priori server-side that you can use to determine when the folder upload has completed, the client is going to have to tell you.
Options I would consider:
change your client to publish a message to SNS or to SQS upon the completion of uploading to S3. That message can then trigger your Lambda function.
after the last file has been uploaded to folder images/dogs/, upload a zero-sized object whose key is the same as the folder (images/dogs/). This is a 'sentinel file'. Use an S3 event trigger with suffix of / to detect the upload of that 'folder' object and trigger your Lambda.
I prefer the 1st option. It achieves the end goal without resulting in extraneous S3 objects. With SNS you can also configure multiple downstream processes in response to the ‘finished upload’ message (a fan out) if needed.

"Realtime" syncing of large numbers of log files to S3

I have a large number of logfiles from a service that I need to regularly run analysis on via EMR/Hive. There are thousands of new files per day, and they can technically come out of order relative to the file name (e.g. a batch of files comes a week after the date in the file name).
I did an initial load of the files via Snowball, then set up a script that syncs the entire directory tree once per day using the 'aws s3 sync' cli command. This is good enough for now, but I will need a more realtime solution in the near future. The issue with this approach is that it takes a very long time, on the order of 30 minutes per day. And using a ton of bandwidth all at once! I assume this is because it needs to scan the entire directory tree to determine what files are new, then sends them all at once.
A realtime solution would be beneficial in 2 ways. One, I can get the analysis I need without waiting up to a day. Two, the network use would be lower and more spread out, instead of spiking once a day.
It's clear that 'aws s3 sync' isn't the right tool here. Has anyone dealt with a similar situation?
One potential solution could be:
Set up a service on the log-file side that continuously syncs (or aws s3 cp) new files based on the modified date. But wouldn't that need to scan the whole directory tree on the log server as well?
For reference, the log-file directory structure is like:
/var/log/files/done/{year}/{month}/{day}/{source}-{hour}.txt
There is also a /var/log/files/processing/ directory for files being written to.
Any advice would be appreciated. Thanks!
You could have a Lambda function triggered automatically as a new object is saved on your S3 bucket. Check Using AWS Lambda with Amazon S3 for details. The event passed to the Lambda function will contain the file name, allowing you to target only the new files in the syncing process.
If you'd like wait until you have, say 1,000 files, in order to sync in batch, you could use AWS SQS and the following workflow (using 2 Lambda functions, 1 CloudWatch rule and 1 SQS queue):
S3 invokes Lambda whenever there's a new file to sync
Lambda stores the filename in SQS
CloudWatch triggers another Lambda function every X minutes/hours to check how many files are there in SQS for syncing. Once there's 1,000 or more, it retrieves those filenames and run the syncing process.
Keep in mind that Lambda has a hard timeout of 5 minutes. If you sync job takes too long, you'll need to break it in smaller chunks.
You could set the bucket up to log HTTP requests to a separate bucket, then parse the log to look for newly created files and their paths. One troublespot, as well as PUT requests, you have to look for the multipart upload ops which are a sequence of POSTs. Best to log for a few days to see what gets created before putting any effort in to this approach

AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function

I am seeking advice on what's the best way to design this -
Use Case
I want to put multiple files into S3. Once all files are successfully saved, I want to trigger a lambda function to do some other work.
Naive Approach
The way I am approaching this is by saving a record in Dynamo that contains a unique identifier and the total number of records I will be uploading along with the keys that should exist in S3.
A basic implementation would be to take my existing lambda function which is invoked anytime my S3 bucket is written into, and have it check manually whether all the other files been saved.
The Lambda function would know (look in Dynamo to determine what we're looking for) and query S3 to see if the other files are in. If so, use SNS to trigger my other lambda that will do the other work.
Edit: Another approach is have my client program that puts the files in S3 be responsible for directly invoking the other lambda function, since technically it knows when all the files have been uploaded. The issue with this approach is that I do not want this to be the responsibility of the client program... I want the client program to not care. As soon as it has uploaded the files, it should be able to just exit out.
Thoughts
I don't think this is a good idea. Mainly because Lambda functions should be lightweight, and polling the database from within the Lambda function to get the S3 keys of all the uploaded files and then checking in S3 if they are there - doing this each time seems ghetto and very repetitive.
What's the better approach? I was thinking something like using SWF but am not sure if that's overkill for my solution or if it will even let me do what I want. The documentation doesn't show real "examples" either. It's just a discussion without much of a step by step guide (perhaps I'm looking in the wrong spot).
Edit In response to mbaird's suggestions below-
Option 1 (SNS) This is what I will go with. It's simple and doesn't really violate the Single Responsibility Principal. That is, the client uploads the files and sends a notification (via SNS) that its work is done.
Option 2 (Dynamo streams) So this is essentially another "implementation" of Option 1. The client makes a service call, which in this case, results in a table update vs. a SNS notification (Option 1). This update would trigger the Lambda function, as opposed to notification. Not a bad solution, but I prefer using SNS for communication rather than relying on a database's capability (in this case Dynamo streams) to call a Lambda function.
In any case, I'm using AWS technologies and have coupling with their offering (Lambda functions, SNS, etc.) but I feel relying on something like Dynamo streams is making it an even tighter coupling. Not really a huge concern for my use case but still feels dirty ;D
Option 3 with S3 triggers My concern here is the possibility of race conditions. For example, if multiple files are being uploaded by the client simultaneously (think of several async uploads fired off at once with varying file sizes), what if two files happen to finish uploading at around the same time, and two or more Lambda functions (or whatever implementations we use) query Dynamo and gets back N as the completed uploads (instead of N and N+1)? Now even though the final result should be N+2, each one would add 1 to N. Nooooooooooo!
So Option 1 wins.
If you don't want the client program responsible for invoking the Lambda function directly, then would it be OK if it did something a bit more generic?
Option 1: (SNS) What if it simply notified an SNS topic that it had completed a batch of S3 uploads? You could subscribe your Lambda function to that SNS topic.
Option 2: (DynamoDB Streams) What if it simply updated the DynamoDB record with something like an attribute record.allFilesUploaded = true. You could have your Lambda function trigger off the DynamoDB stream. Since you are already creating a DynamoDB record via the client, this seems like a very simple way to mark the batch of uploads as complete without having to code in knowledge about what needs to happen next. The Lambda function could then check the "allFilesUploaded" attribute instead of having to go to S3 for a file listing every time it is called.
Alternatively, don't insert the DynamoDB record until all files have finished uploading, then your Lambda function could just trigger off new records being created.
Option 3: (continuing to use S3 triggers) If the client program can't be changed from how it works today, then instead of listing all the S3 files and comparing them to the list in DynamoDB each time a new file appears, simply update the DynamoDB record via an atomic counter. Then compare the result value against the size of the file list. Once the values are the same you know all the files have been uploaded. The down side to this is that you need to provision enough capacity on your DynamoDB table to handle all the updates, which is going to increase your costs.
Also, I agree with you that SWF is overkill for this task.