How to debounce events on AWS grouped by a key? - amazon-web-services

Our frontend application sends user actions to a lambda function behind an API gateway, which then stores these actions in dynamodb.
We then use dynamodb streams to trigger a separate lambda function that'll parse these actions in dynamodb and decide if the user's actions should result in any notifications being sent (we call these notification events).
For example, if a user places a comment in our app, we'll store a "CREATED_COMMENT" action in dynamodb, which will then trigger a new lambda through a dynamodb stream. The new lambda may then create an "email notification event", which we may send to an email provider like customer.io
However, our users have informed us that they receive emails too frequently, and thus we'd like to start sending email digests aggregating multiple actions over time into a single email rather than sending an email for each action.
Our idea was to use something like AWS EventBridge, Kinesis, Step Functions, or even DynamoDB streams to resend the dynamodb stream actions to, but then configure the new stream's events to be grouped by email address and for these events to be debounced by e.g. 10 minutes. If the user then performs a new action, that user's stream will continue gathering actions for another 10 minutes, until there's been no new actions from that user for 10 minutes. Once that happens, the stream will "release" all gathered actions and invoke a lambda function. Our lambda function will then generate the email notification event and send it to e.g. customer.io.
However, we've been unable to find such grouping and debounced flushing configuration in any of the aforementioned AWS stream services. For such a common thing as digesting (or rolling up), shouldn't there be a serverless approach to doing this without having to write our own queueing service?

The answer to me seems like using a tool such as SQS. SQS will allow you to accumulate messages into a queue and every x minutes you can then read the queue using a Lambda function to do so on a schedule event. You do not need to have a Lambda triggered by SQS, and can still read the queue "manually" from within Lambda instead.

Gareth McCumskey is on the right track.
Use a normal sqs queue for strictly for debouncing.
Set a batch window, i.e 5 seconds. Use a really large batch size when you read from the queue.
In code, use a hashMap to group your message with the same messageId together. Now use your deduped messageIDs to do your work.

I wrote a blog post on something just like this. The short version of it is that it uses a scheduled Lambda function to identify the records that need to be processed.
The problem with using the delay in SQS is that you can only receive 10 messages at a time, so in order to get all the messages you'd have to call SQS repeatedly to clear the queue. At that point, you can aggregate the messages. This doesn't scale very well, as all the messages have to be read in order for it to work. By using DynamoDB you can actually have just one record that represents the collection of records, and query the single record, which then can result in a message in a queue for that specific group of messages. Consider the following data:
user | comment | time
user 1 | comment 1 | 11:43am
user 1 | comment 2 | 11:50am
user 2 | comment 1 | 11:51am
You can add another record that is a signal for the need to send a message for each user (in this example 15 minutes after the first message).
user | scheduled
user 1 | 11:58
user 2 | 12:06
When you insert the second set of records you are inserting the time when you want to send the batch. You only do the insert if there isn't a record already, so you don't end up constantly increasing the time. Your scheduled process reads that record to know what users it needs to send messages to and collects all the data for that user. The process of sending the messages to each user can be done in parallel (you could send a message the SQS for each user or use a Map state in a step function, for example).

Related

I want to know when a batch of messages has completed in a AWS SQS Queue

I think this is more of a 'architecture design' question.
I have a lambda producer that will put ~600 messages on a SQS queue (there are multiple producers) as a batch (so not 1 message with a body of ~600 messages). A consumer lambda that will take individual messages and deal with them (at scale). What I want to do is run another lambda when each batch is complete.
Initial ideas was to create a 'unique batch number', a 'total batch number' and a 'batch position number' and add it to the messages attributes for every message. And then in the consumer lambda check the these to decide if the batch is complete.
But does that mean I would need to use a FIFO queue and partition on the batch number and only have one lambda consumer per batch. Or do I run some sort of state management in DynamoDB (is the a pattern out there for this? please guide me on this).
Regards, J
It seems like the goal is to achieve Fork-Join capabilities in a distributed system. One way to handle this in AWS is using Step Functions. Assuming a queue service needs to be used, state of the overall operation will need to be tracked. Some ways to do this are:
Store state of the overall operation in a DB.
Put a 'terminatation' message in the queue after all others and process FIFO.
Create a metadata service which receives 'start' and 'stop' messages for each service and handles them accordingly.
Reference: Fork and Join with Amazon Lambda

Multiple curl calls php issue

My problem every 20minutes I want to execute the curl request which is around 25000 or more than that and save the curl response in database. In PHP it is not handled properly which is the best AWS services I can use except lambda.
A common technique for processing large number of similar calls is:
Create an Amazon Simple Queue Service (SQS) queue and push each request into the queue as a separate message. In your case, the message would contain the URL that you wish to retrieve.
Create an AWS Lambda function that performs the download and stores the data in the database.
Configure the Lambda function to trigger off the SQS queue
This way, the SQS queue can trigger hundreds of Lambda functions running parallel. The default concurrency limit is 1000 Lambda functions, but you can request for this to be increased.
You would then need a separate process that, every 20 minutes, queries the database for the URLs and pushes the messages into the SQS queue.
The complete process is:
Schedule -> Lambda pusher -> messages into SQS -> Lambda workers -> database
The beauty of this design is that it can scale to handle large workloads and operates in parallel, rather than each curl request having to wait. If a message cannot be processed, it Lambda will automatically try again. Repeated failures will send the message to a Dead Letter Queue for later analysis and reprocessing.
If you wish to perform 25,000 queries every 20 minutes (1200 seconds), this would need a query to complete every 0.05 seconds. That's why it is important to work in parallel.
By the way, if you are attempting to scrape this information from a single website, I suggest you investigate whether they provide an API otherwise you might be violating the Terms & Conditions of the website, which I strongly advise against.

What is the recommended way to fanout in SQS lambda environment?

I would like to send a push notification to users in my database in a lambda environment via SQS / messaging queue architecture, in order to do that
I would first need to query all users in my database with push notifications enabled.
loop over all of them them
send a SQS event/message for each user.
let my sqs triggered lambda handle/send the push notification
Is there a better way to implement this to avoid querying a big number of users and/or looping over all the results to send a SQS message for each?
I would take a slightly different approach here, but similar.
Query the database for the users
Loop over the users
Send one messages to SQS for a batch of records to send, and use the SendMessageBatch operation of SQS to send them. So batches of batches. Each batch of messages would have several "users" to send to, not just one. This will should increase your performance because a batch will require fewer lambda invocations.
Lambda handles SQS messages (probably more than one), and each SQS message results in sending many push notifications. In the case of Firebase I believe there is a way to send batches, which is even better. Even without that you can send several messages at once using a Promise.all type logic.
With this structure you can send a very large number of messages really quickly, and probably a lot cheaper. Imagine you need to send to 1M users. If you send batches of 100, in batches of 25 to SQS, then you have 2,500 messages per call to SQS. That would mean 400 calls to SQS, far better than even the 40K you'd have to make if you sent single messages in batches of 25.
On the receiving side, even if you throttled the SQS integration to 1 message per invocation you'd have 10,000 lambda invocations. If you assume even 1s per invocation, and 1000 concurrent invocations, it would take 10 seconds (likely less). If you send one message per user you'd have to make 1M lambda invocations. If you assume each invocation takes 100ms then you can send 10/second, so with 1000 concurrent executions it would take 100 seconds. In reality the numbers are probably even better than that for the batch version, especially if you don't limit it to 1 message at a time.
Edit
Based on the comments the question seemed to be a bit more about the first part of the process. With that in mind I'd suggest the following options.
If you find yourself needing to address the same large groups repeatedly most messaging services (Firebase and SNS for sure) support some sort of topic subscription model. Given that these are push notifications you can subscribe a device to the topic in code. What this ultimately leads to is one messages sent from your code to the messaging service. The service handles the rest. This is probably the preferred solution for anything that has mass recipients, especially if you can know the recipients up front. This even works for dynamic topics. For example, consider a situation where a person comments on a post. Any new comment on that post should send a message to everyone who has commented on that post. You can create a topic on the fly when the post is created, and add recipients to the topic as they comment. If a user wishes to stop receiving messages you can remove the user from the topic.
If you don't know the recipients up front the above solution is a solid solution. However, if you are concerned with Lambda timeouts on the first two steps I'd modify slightly. I would take advantage of AWS Step Functions and page the data in the lambda. Lambda will tell you, via the context object supplied in the invocation, how much time is remaining. You can check that periodically to determine if you should exit the lambda and pass to the step function the current paging information. The step function can pass that paging information back into the lambda, which should be coded to accept the paging information as part of the request, and continue from that point if supplied.
I would suggest an additional piece in your application architecture,
I personally prefer to avoid using the Primary database for heavy querying,
assuming you have a large user base.
I will suggest maintaining your user list in a Search Engine like ElasticSearch or CloudSearch, or a simple table with just the user list in AWS DynamoDb or create a Read Replica of your DB.
To no confuse you, use a Search Engine(first choice) or an AWS DynamoDb
This will avoid creating pressure on your database when you query the read specialty datastore and won't affect other modules in operation
And it's way fast to query this way
Step 2: loop over all of them them
Step 3: batch send messages to SQS using its SendMessageBatch method like Jason is suggesting
Step 4: Based on your SQS setting, you may process multiple messages on your Lambda function

How to trigger AWS Lambda just once on multiple S3 notifications

We are designing a pipeline. We get a number of raw files which come into S3 buckets and then we apply a schema and then save them as parquet.
As of now we are triggering a lambda function for each file written but ideally we would like to start this process only after all the files are written. How we can we trigger the lambda just once?
I encourage you to use an alternative that maintains the separation between the publisher (whoever is writing) and the subscriber (you). The publisher tells you when things are written; it's your responsibility to choose when to process those things. The neat pattern here would be for the publisher to write its files in batches and publish manifests for you to trigger on: i.e. a list which says "I just wrote all these things, you can find them in these places". Since you don't have that / can't change the publisher, I suggest the following:
Send the notifications from the publisher to an SQS queue.
Schedule your lambda to run on a schedule; how often is determined by how long you're willing to delay ingestion. If you want data to be delayed at most 5min between being published and getting ingested by your system, set your lambda to trigger every 4min. You can use Cloudwatch notifications for this.
When your lambda runs, poll the queue. Keep going until you accumulate the maximum amount of notifications, X, you want to process in one go, or the queue is empty.
Process. If the queue wasn't empty when you stopped polling, immediately trigger another lambda execution.
Things to keep in mind on the above:
As written, it's not parallel, so if your rate of lambda execution is slower than the rate at which the queue fills up, you'll need to 1. run more frequently or 2. insert a load-balancing step: a lambda that is triggered on a schedule, polls the queue, and calls as many processing lambdas as necessary so that each one gets X notifications.
SNS in general and SQS non-FIFO queues specifically don't guarantee exactly-once delivery. They can send you duplicate notifications. Make sure you can handle duplicate processing cleanly.
Hook your Lambda up to a Webhook (API Gateway) and then just call it from your client app once your client app is done.
Solutions:
Zip all files together, Lambda unzip it
create a UI code and send files one by one, trigger lambda from it when the last one is sent
Lambda check files, if didn't find all files, silent quit. if it finds all files, then handle all files in one thread

Group/throttle burst of events with AWS SQS

I'm working on an event-based email notification service. An email is sent on every single event at the moment. The problem with this approach is sometimes there are too many events for a short period of time, thus the user gets too many emails at once. Instead of this, I'd like to throttle and group these emails by user.
I'm looking at AWS SQS to pipe these events into it and somehow consume them with a lambda that picks and groups the ones ready (ready=been there for at least 3 minutes) to be sent. Is there a built-in solution? Can I tag events with user ID and on the consumer side I pick the latest one, check if it's been there for 3 minutes, if so I pick all the remaining ones...? Just and idea...