We are designing a pipeline. We get a number of raw files which come into S3 buckets and then we apply a schema and then save them as parquet.
As of now we are triggering a lambda function for each file written but ideally we would like to start this process only after all the files are written. How we can we trigger the lambda just once?
I encourage you to use an alternative that maintains the separation between the publisher (whoever is writing) and the subscriber (you). The publisher tells you when things are written; it's your responsibility to choose when to process those things. The neat pattern here would be for the publisher to write its files in batches and publish manifests for you to trigger on: i.e. a list which says "I just wrote all these things, you can find them in these places". Since you don't have that / can't change the publisher, I suggest the following:
Send the notifications from the publisher to an SQS queue.
Schedule your lambda to run on a schedule; how often is determined by how long you're willing to delay ingestion. If you want data to be delayed at most 5min between being published and getting ingested by your system, set your lambda to trigger every 4min. You can use Cloudwatch notifications for this.
When your lambda runs, poll the queue. Keep going until you accumulate the maximum amount of notifications, X, you want to process in one go, or the queue is empty.
Process. If the queue wasn't empty when you stopped polling, immediately trigger another lambda execution.
Things to keep in mind on the above:
As written, it's not parallel, so if your rate of lambda execution is slower than the rate at which the queue fills up, you'll need to 1. run more frequently or 2. insert a load-balancing step: a lambda that is triggered on a schedule, polls the queue, and calls as many processing lambdas as necessary so that each one gets X notifications.
SNS in general and SQS non-FIFO queues specifically don't guarantee exactly-once delivery. They can send you duplicate notifications. Make sure you can handle duplicate processing cleanly.
Hook your Lambda up to a Webhook (API Gateway) and then just call it from your client app once your client app is done.
Solutions:
Zip all files together, Lambda unzip it
create a UI code and send files one by one, trigger lambda from it when the last one is sent
Lambda check files, if didn't find all files, silent quit. if it finds all files, then handle all files in one thread
Related
Background
This archhitecture relies solely on Lambda's asyncronous invocation mechanism as described here:
https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
I have a collector function that is invoked once a minute and fetches a batch of data in that can vary drastically in size (tens of of KB to potentially 1-3MB). The data contains a JSON array containing one-to-many records. The collector function segregates these records and publishes them individually to an SNS topic.
A parser function is subribed the SNS topic and has a concurrency limit of 3. SNS asynchronously invokes the parser function per record, meaning that the built-in AWS managed Lambda asyncronous queue begins to fill up as the instances of the parser maxes out at 3. The Lambda queueing mechanism initiates retries at incremental backups when throttling occurs, until the invocation request can be processed by the parser function.
It is imperitive that a record does not get lost during this process as they can not be resurrected. I will be using dead letter queues where needed to ensure they ultimately end up somewhere in case of error.
Testing this method out resulted in no lost invocation. Everything worked as expected. Lambda reported hundreds of throttle responses but I'm relying on this to initiate the Lambda retry behaviour for async invocations. My understanding is that this behaivour is effectively the same as that which I'd have to develop and initiate myself if I wanted to retry consuming a message coming from SQS.
Questions
1. Is the built-in AWS managed Lambda asyncronous queue reliable?
The parser could be subject to a consistent load of 200+ invocations per minute for prelonged periods so I want to understand whether the Lambda queue can handle this as sensibly as an SQS service. The main part that concerns me is this statement:
Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent. If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function. Ensure that your function code gracefully handles duplicate events, and that you have enough concurrency available to handle all invocations.
This implies that an incoming invocation may just be deleted out of thin air. Also in my implementation I'm relying on the retry behaviour when a function throttles.
2. When a message is in the queue, what happens when the message timeout is exceeded?
I can't find a difinitive answer but I'm hoping the message would end up in the configured dead letter queue.
3. Why would I use SQS over the Lambda queue when SQS presents other problems?
See the articles below for arguments against SQS. Overpulling (described in the second link) is of particular concern:
https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes/
https://medium.com/#zaccharles/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0
I can't find any articles or discussions of how the Lambda queue performs.
Thanks for reading!
Quite an interesting question. There's a presentation that covered queues in detail. I can't find it at the moment. The premise is the same as this queues are leaky buckets
So what if I add more Leaky Buckets. We'll you've delayed the leaking, however it's now leaking into another bucket. Have you solved the problem or delayed it?
What if I vibrate the buckets at different frequencies?
Further reading:
operate lambda
message expiry
message timeout
DDIA / DDIA Online
SQS Performance
sqs failure modes
mvce is missing from this question so I cannot address the the precise problem you are having.
As for an opinion on which to choose for SQS and Lambda queue I'll point to the Meta on this
sqs faq mentions Kinesis streams
sqs sns kinesis comparison
TL;DR;
It depends
I think the biggest advantage of using your own queue is the fact that you as a user have visibility into the state of the your backpressure.
Using the Lambda async invoke method, you have the potential to get throttled exceptions with the 'guarantee' that lambda will retry over an interval. If using a SQS source queue instead, you have complete visibility into the state of your message processing at all times with no ambiguity.
Secondly regarding overpulling. In theory this is a concern but in practice its never happened to me. I've run applications requiring thousands of transactions per second and never once had problems with SQS -> Lambda. Obviously set your retry policy appropriately and use a DLQ as transient/unpredictable errors CAN occur.
My problem every 20minutes I want to execute the curl request which is around 25000 or more than that and save the curl response in database. In PHP it is not handled properly which is the best AWS services I can use except lambda.
A common technique for processing large number of similar calls is:
Create an Amazon Simple Queue Service (SQS) queue and push each request into the queue as a separate message. In your case, the message would contain the URL that you wish to retrieve.
Create an AWS Lambda function that performs the download and stores the data in the database.
Configure the Lambda function to trigger off the SQS queue
This way, the SQS queue can trigger hundreds of Lambda functions running parallel. The default concurrency limit is 1000 Lambda functions, but you can request for this to be increased.
You would then need a separate process that, every 20 minutes, queries the database for the URLs and pushes the messages into the SQS queue.
The complete process is:
Schedule -> Lambda pusher -> messages into SQS -> Lambda workers -> database
The beauty of this design is that it can scale to handle large workloads and operates in parallel, rather than each curl request having to wait. If a message cannot be processed, it Lambda will automatically try again. Repeated failures will send the message to a Dead Letter Queue for later analysis and reprocessing.
If you wish to perform 25,000 queries every 20 minutes (1200 seconds), this would need a query to complete every 0.05 seconds. That's why it is important to work in parallel.
By the way, if you are attempting to scrape this information from a single website, I suggest you investigate whether they provide an API otherwise you might be violating the Terms & Conditions of the website, which I strongly advise against.
I would like to send a push notification to users in my database in a lambda environment via SQS / messaging queue architecture, in order to do that
I would first need to query all users in my database with push notifications enabled.
loop over all of them them
send a SQS event/message for each user.
let my sqs triggered lambda handle/send the push notification
Is there a better way to implement this to avoid querying a big number of users and/or looping over all the results to send a SQS message for each?
I would take a slightly different approach here, but similar.
Query the database for the users
Loop over the users
Send one messages to SQS for a batch of records to send, and use the SendMessageBatch operation of SQS to send them. So batches of batches. Each batch of messages would have several "users" to send to, not just one. This will should increase your performance because a batch will require fewer lambda invocations.
Lambda handles SQS messages (probably more than one), and each SQS message results in sending many push notifications. In the case of Firebase I believe there is a way to send batches, which is even better. Even without that you can send several messages at once using a Promise.all type logic.
With this structure you can send a very large number of messages really quickly, and probably a lot cheaper. Imagine you need to send to 1M users. If you send batches of 100, in batches of 25 to SQS, then you have 2,500 messages per call to SQS. That would mean 400 calls to SQS, far better than even the 40K you'd have to make if you sent single messages in batches of 25.
On the receiving side, even if you throttled the SQS integration to 1 message per invocation you'd have 10,000 lambda invocations. If you assume even 1s per invocation, and 1000 concurrent invocations, it would take 10 seconds (likely less). If you send one message per user you'd have to make 1M lambda invocations. If you assume each invocation takes 100ms then you can send 10/second, so with 1000 concurrent executions it would take 100 seconds. In reality the numbers are probably even better than that for the batch version, especially if you don't limit it to 1 message at a time.
Edit
Based on the comments the question seemed to be a bit more about the first part of the process. With that in mind I'd suggest the following options.
If you find yourself needing to address the same large groups repeatedly most messaging services (Firebase and SNS for sure) support some sort of topic subscription model. Given that these are push notifications you can subscribe a device to the topic in code. What this ultimately leads to is one messages sent from your code to the messaging service. The service handles the rest. This is probably the preferred solution for anything that has mass recipients, especially if you can know the recipients up front. This even works for dynamic topics. For example, consider a situation where a person comments on a post. Any new comment on that post should send a message to everyone who has commented on that post. You can create a topic on the fly when the post is created, and add recipients to the topic as they comment. If a user wishes to stop receiving messages you can remove the user from the topic.
If you don't know the recipients up front the above solution is a solid solution. However, if you are concerned with Lambda timeouts on the first two steps I'd modify slightly. I would take advantage of AWS Step Functions and page the data in the lambda. Lambda will tell you, via the context object supplied in the invocation, how much time is remaining. You can check that periodically to determine if you should exit the lambda and pass to the step function the current paging information. The step function can pass that paging information back into the lambda, which should be coded to accept the paging information as part of the request, and continue from that point if supplied.
I would suggest an additional piece in your application architecture,
I personally prefer to avoid using the Primary database for heavy querying,
assuming you have a large user base.
I will suggest maintaining your user list in a Search Engine like ElasticSearch or CloudSearch, or a simple table with just the user list in AWS DynamoDb or create a Read Replica of your DB.
To no confuse you, use a Search Engine(first choice) or an AWS DynamoDb
This will avoid creating pressure on your database when you query the read specialty datastore and won't affect other modules in operation
And it's way fast to query this way
Step 2: loop over all of them them
Step 3: batch send messages to SQS using its SendMessageBatch method like Jason is suggesting
Step 4: Based on your SQS setting, you may process multiple messages on your Lambda function
I have around 3 AWS Lambda functions taking the following form:
Lambda function 1: Reads from an SQS queue and puts a Message on an SQS queue (the incoming and outgoing message formats are different)
Lambda function 2: Reads the message from Lambda function 1, and puts a Message on an SQS queue (the incoming and outgoing message formats are different)
Lambda function 3: Reads the message from Lambda function 3, and updates storage.
There are 3 queues involved and the message format (structure) in each queue is different, however they have one uniqueId which are the same and can be used the relate one to each other. So much question is, is there any way in SQS on or some other tool to track the messages, what I'm specifically looking at is stuff like:
Time the message was entered into the queue
Time the message was taken by the Lambda function for processing
My problem is that the 3 Lambda functions individually perform within a couple of milliseconds, but the time taken for end to end execution is way too long, I suspect that the messages are taking too long in transit.
I'm open for any other ideas on optimisation.
AWS Step Functions is specifically designed for passing information between Lambda functions and orchestrating the whole process.
However, you would need to change the way you have written your functions to take advantage of Step Functions.
Since your only real desire is to explore why it takes "way too long", then AWS X-Ray should be a good way to gather this information. It can track a single transaction end-to-end through each process. I think it's just a matter of including a library and activating X-Ray.
See: AWS Lambda and AWS X-Ray - AWS X-Ray
Or, just start doing some manual investigations in the log files. The log shows how long each function takes to run, so you should be able to identify whether the time taken is within a particular Lambda function, or whether the time is spent between functions, waiting for them to trigger.
I have one publisher and one subscriber for my SNS topic in AWS.
Suppose my subscriber is getting failed and exiting with a failure.
Will SNS repush those failed messages?
If not...
Is there another way to achieve that goal where my system starts processing from the last successful lambda execution?
There is a retry policy, but if your application already received the message, then no. If something goes wrong you won't see it again and since Lambdas don't carry state...You could be in trouble.
I might consider looking at SQS instead of SNS. Remember, messages in SQS won't be removed until you remove them and you can set a window of invisibility. Therefore, you can easily ensure the next Lambda execution picks up where things left off (depending on your settings). Each Lambda would then be responsible for removing that message from SQS and that's how you'd know the message was processed.
Without knowing more about your application and needs, I couldn't say for sure...But I would take a look at it. I've built a "taskmaster" Lambda before that ran on a schedule and read from an SQS queue (multiple queues actually - the scheduled job passed different JSON event based on which queue to read from). It would then pass the job off to the appropriate Lambda "worker" which would then remove that message. Should it stop working...Well, the invisibility period would timeout (and 5 minutes isn't bad here given that's all Lambdas can execute for) and the next Lambda would pick it up. The taskmaster then would run as often as needed and read as many jobs from the queue as necessary. This really helps you have complete control over at what rate you are processing things, how many times you are retrying things, etc. Then you can also make use of a dead-letter queue to catch anything that may have failed (also, think about sticking things back into the queue).
You have a LOT of flexibility with SQS that I'm not really sure you get with SNS to be honest. I was never fond of SNS, though it too has a place and time and so again without knowing more here I couldn't say if SQS would be the fit for you...But I think your concerns can be taken care of with SQS if it makes sense for your application.