Concurrency of Lambda not increasing as expected

Concurrency of Lambda not increasing as expected - amazon-web-services

We've configured an MSK (kafka) event source as the trigger for our Lambda function. Even though the offset lag is increasing the lambda concurrency is limited to 4-5 almost all the time as can be seen in the graph below. The configuration used for the MSK event source is:
Batch Size: 50
Batch window: 30 seconds
Number of partitions in the Kafka topic: 10
I made sure that the load is distributed equally across all the partitions. Is there anything I'm missing here which is causing the concurrency issue? Any solution is appreciated. Thanks in advance.

I think you are hitting the same limitation we found some months ago, this link led us in the right way (aka workaround in our case):
AWS MSK lambda concurrent consumers
It honestly makes sense that the partitions are not being used in all their capability because the jump from the msk EC2 setup to the lambda runtime is not something trivial. Maybe you can try other connectors.
https://docs.confluent.io/kafka-connectors/aws-lambda/current/overview.html#multiple-tasks
It also makes sense that bridging through Kinesis you would not have these specific issues as it is all Amazon native stuff.

Ideally concurrency should be a number with no decimals match the count of consumers count.
When you initially create an an Apache Kafka event source, Lambda allocates one consumer to process all partitions in the Kafka topic. Each consumer has multiple processors running in parallel to handle increased workloads. Additionally, Lambda automatically scales up or down the number of consumers, based on workload. To preserve message ordering in each partition, the maximum number of consumers is one consumer per partition in the topic.
source: https://docs.amazonaws.cn/en_us/lambda/latest/dg/with-kafka.html#services-kafka-scaling
The Offset Lag indicates performance issue for this blog gives better explanation Offset lag metric

Related

Patterns to write to DynamoDB from SQS queue with maximum throughput

I would like to set up a system that transfers data from an SQS queue to DynamoDB. Is there a mechanism to write at the approximate maximum throughput of the respective DynamoDB table if this is the only place that writes into that table avoiding throttling errors as much as possible?
I haven't seen such a pattern yet. If I have a lambda behind the SQS queue it is hard to measure how many writes are currently occuring because I have no control over the number of lambda instances. Then there might be temporary throughput limitations that need to be handled. The approach I have been thinking about is to have some sort of adaptive mechanism that lowers the write speed if throttling errors occur, possibly supported by real-time queries to CloudWatch to get the throughput in the last few seconds.
I have read the posts related to this topic here but didn't find a solution to this.
Thanks in advance

If I have a lambda behind the SQS queue it is hard to measure how many writes are currently occuring because I have no control over the number of lambda instances
Yes you do !
To me, lambda is definitely the way to go. You can set a maximum concurrency limit on every lambda function so that it does not fire too many parallel invocations. More details here
Also, unless you are doing some fine-tuned costs optimization, dynamoDB provides a on-demand feature where you don't have to care about provisioning (and therefore throttling) anymore. Using this feature could also guarantee that no throttling occurs.

Autoscale AWS Lambda concurrency based off throttling errors

I have a AWS Lambda function using an AWS SQS trigger to pull messages, process them with an AWS Comprehend endpoint, and put the output in AWS S3. The AWS Comprehend endpoint has a rate limit which goes up and down throughout the day based off something I can control. The fastest way to process my data, which also optimizes the costs I am paying for the AWS Comprehend endpoint to be up, is to set concurrency high enough that I get throttling errors returned from the api. This however comes with the caveat, that I am paying for more AWS Lambda invocations, the flip side being, that to optimize the costs I am paying for AWS Lambda, I want 0 throttling errors.
Is it possible to set up autoscaling for the concurrency limit of the lambda such that it will increase if it isn't getting any throttling errors, but decrease if it is getting too many?

Very interesting use case.
Let me start by pointing out something that I found out the hard way in an almost 4 hour long call with AWS Tech Support after being puzzled for a couple days.
With SQS acting as a trigger for AWS Lambda, the concurrency cannot go beyond 1K. Even if the concurrency of Lambda is set at a higher limit.
There is now a detailed post on this over at Knowledge Center.
With that out of the way and assuming you are under 1K limit at any given point in time and so only have to use one SQS queue, here is what I feel can be explored:
Either use an existing cloudwatch metric (via Comprehend) or publish a new metric that is indicative of the load that you can handle at any given point in time. you can then use this to set an appropriate concurrency limit for the lambda function. This would ensure that even if you have SQS queue flooded with messages to be processed, lambda picks them up at the rate at which it can actually be processed.
Please Note: This comes out of my own philosophy of being proactive vs being reactive. I would not wait for something to fail to trigger other processes eg invocation errors in this case to adjust concurrency. System failures should be rare and actually raise alarm (if not panic!) rather than being normal that occurs a couple of times a day !
To build up on that, if possible I would suggest that you approach this the other way around i.e. scale Comprehend processing limit and AWS Lambda concurrency based on the messages in the SQS queue (backlog) or a combination of this backlog and the time of the day etc. This way, if every part of your pipeline is a function of the amount of backlog in the Queue, you can be rest assured that you are not spending more than you have at any given point in time.
More importantly, you always have capacity in place should the need arise or something out of normal happens.

Limit concurrent invocation of a AWS Lambda triggered from AWS SQS (Reserved concurrency ignored)?

To me this seemed like a simple use case when I started, but it turned out a lot harder than I had anticipated.
Problem
I have an AWS SQS acting as a job queue that triggers a worker AWS Lambda. However since the worker lambdas are sharing non-scalable resources it is important to limit the number of concurrent running lambdas to (for the sake of example) no more than 5 lambdas running simultaneously.
Simple enough, according to Managing Concurrency for a Lambda Function
Reserved concurrency also limits the maximum concurrency for the
function, and applies to the function as a whole
However, setting the Reserved concurrency-property to 5 seems to be completely ignored by SQS, with the queue Messages in Flight-property in my case showing closer to 20-30 concurrent executions depending on the amount of messages put into the queue.
Question
The closest I have come to a solution is to use a SQS FIFO queue and setting the MessageGroupId to a value of either randomly selecting or alternating between 1-5. However, due to uneven workload this is not optimal as it would be better to have the concurrency distributed by actual workload rather than by chance.
I have also tried using the AWS Step Functions as the Map-state has a MaxConcurrency parameter, which seemed to work well on small job queues, but due to each state having an input/output limit of 32kb, this was not feasible in my use-case.
Has anyone found a better or alternative solution? Are there any other ways Reserved concurrency is supposed to be used?
Similar
Here are some similar questions I have found, but I think my question is different because I am not interested in limiting the total number of invocation, and (although I have not tried it myself) I can not see why triggers from S3 or Kinesis Steam would behave different from SQS.

According to AWS docs AWS SQS doesn't take into account reserved concurrency. If number of batches to be processed is greater than reserved concurrency, your messages might end up in a dead-letter queue:
If your function returns an error, or can't be invoked because it's at
maximum concurrency, processing might succeed with additional
attempts. To give messages a better chance to be processed before
sending them to the dead-letter queue, set the maxReceiveCount on the
source queue's redrive policy to at least 5.
https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
You can check this article for details: https://zaccharles.medium.com/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0

This issue is resolved today Jan 2023. You can use maximum concurrency as suggested in this blog . I was using FIFO with groupid as my backend was non-scalable and i wanted to not have any throttling issue as having too many messages on DLQ does not help.
https://aws.amazon.com/blogs/compute/introducing-maximum-concurrency-of-aws-lambda-functions-when-using-amazon-sqs-as-an-event-source/

AWS Lambda is seemingly not highly available when invoked from SNS

I am invoking a data processing lambda in bulk fashion by submitting ~5k sns requests in an asynchronous fashion. This causes all the requests to hit sns in a very short time. What I am noticing is that my lambda seems to have exactly 5k errors, and then seems to "wake up" and handle the load.
Am I doing something largely out of the ordinary use case here?
Is there any way to combat this?

I suspect it's a combination of concurrency, and the way lambda connects to SNS.
Lambda is only so good at automatically scaling up to deal with spikes in load.
Full details are here: (https://docs.aws.amazon.com/lambda/latest/dg/scaling.html), but the key points to note that
There's an account-wide concurrency limit, which you can ask to be
raised. By default it's much less than 5k, so that will limit how
concurrent your lambda could ever become.
There's a hard scaling limit (+1000 instances/minute), which means even if you've managed to convince AWS to let you have a concurrency limit of 30k, you'll have to be under sustained load for 30 minutes before you'll have that many lambdas going at once.
SNS is a non-stream-based asynchronous invocation (https://docs.aws.amazon.com/lambda/latest/dg/invoking-lambda-function.html#supported-event-source-sns) so what you see is a lot of errors as each SNS attempts to invoke 5k lambdas, but only the first X (say 1k) get through, but they keep retrying. The queue then clears concurrently at your initial burst (typically 1k, depending on your region), +1k a minute until your reach maximum capacity.
Note that SNS only retries three times at intervals (AWS is a bit sketchy about the intervals, but it is probably based on the retry: delay the service returns, so should be approximately intelligent); I suggest you setup a DLQ to make sure you're not dropping messages because the time for the queue to clear.
While your pattern is not a bad one, it seems like you're very exposed to the concurrency issues that surround lambda.
An alternative is to use a stream based event-source (like Kinesis), which processes in batches at a set concurrency (e.g. 500 records per lambda, concurrent by shard count, rather than 1:1 with SNS), and waits for each batch to finish before processing the next.

Why should I use Amazon Kinesis and not SNS-SQS?

I have a use case where there will be stream of data coming and I cannot consume it at the same pace and need a buffer. This can be solved using an SNS-SQS queue. I came to know the Kinesis solves the same purpose, so what is the difference? Why should I prefer (or should not prefer) Kinesis?

Keep in mind this answer was correct for Jun 2015
After studying the issue for a while, having the same question in mind, I found that SQS (with SNS) is preferred for most use cases unless the order of the messages is important to you (SQS doesn't guarantee FIFO on messages).
There are 2 main advantages for Kinesis:
you can read the same message from several applications
you can re-read messages in case you need to.
Both advantages can be achieved by using SNS as a fan out to SQS. That means that the producer of the message sends only one message to SNS, Then the SNS fans-out the message to multiple SQSs, one for each consumer application. In this way you can have as many consumers as you want without thinking about sharding capacity.
Moreover, we added one more SQS that is subscribed to the SNS that will hold messages for 14 days. In normal case no one reads from this SQS but in case of a bug that makes us want to rewind the data we can easily read all the messages from this SQS and re-send them to the SNS. While Kinesis only provides a 7 days retention.
In conclusion, SNS+SQSs is much easier and provides most capabilities. IMO you need a really strong case to choose Kinesis over it.

On the surface they are vaguely similar, but your use case will determine which tool is appropriate. IMO, if you can get by with SQS then you should - if it will do what you want, it will be simpler and cheaper, but here is a better explanation from the AWS FAQ which gives examples of appropriate use-cases for both tools to help you decide:
FAQ's

Semantics of these technologies are different because they were designed to support different scenarios:
SNS/SQS: the items in the stream are not related to each other
Kinesis: the items in the stream are related to each other
Let's understand the difference by example.
Suppose we have a stream of orders, for each order we need to reserve some stock and schedule a delivery. Once this is complete, we can safely remove the item from the stream and start processing the next order. We are fully done with the previous order before we start the next one.
Again, we have the same stream of orders, but now our goal is to group orders by destinations. Once we have, say, 10 orders to the same place, we want to deliver them together (delivery optimization). Now the story is different: when we get a new item from the stream, we cannot finish processing it; rather we "wait" for more items to come in order to meet our goal. Moreover, if the processor process crashes, we must "restore" the state (so no order will be lost).
Once processing of one item cannot be separated from processing another one, we must have Kinesis semantics in order to handle all the cases safely.

Kinesis support multiple consumers capabilities that means same data records can be processed at a same time or different time within 24 hrs at different consumers, similar behavior in SQS can be achieved by writing into multiple queues and consumers can read from multiple queues. However writing again into multiple queue will add sub seconds {few milliseconds} latency in system.
Second, Kinesis provides routing capability to selective route data records to different shards using partition key which can be processed by particular EC2 instances and can enable micro batch calculation {Counting & aggregation}.
Working on any AWS software is easy but with SQS is easiest one. With Kinesis, there is a need to provision enough shards ahead of time, dynamically increasing number of shards to manage spike load and decrease to save cost also required to manage. it's pain in Kinesis, No such things are required with SQS. SQS is infinitely scalable.

Excerpt from AWS Documentation:
We recommend Amazon Kinesis Streams for use cases with requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are simpler when all records for a given key are routed to the same record processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a real-time dashboard and another that archives data to Amazon Redshift. You want both applications to consume data from the same stream concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that runs a few hours behind the billing application. Because Amazon Kinesis Streams stores data for up to 7 days, you can run the audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track the successful completion of each item independently. Amazon SQS tracks the ack/fail, so the application does not have to maintain a persistent checkpoint/cursor. Amazon SQS will delete acked messages and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the backlog is cleared. With Amazon Kinesis Streams, you can scale up to a sufficient number of shards (note, however, that you'll need to provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional load spikes or the natural growth of your business. Because each buffered request can be processed independently, Amazon SQS can scale transparently to handle the load without any provisioning instructions from you.

The biggest advantage for me is the fact that Kinesis is a replayable queue, and SQS is not. So you can have multiple consumers of the same messages of Kinesis (or the same consumer at different times) where with SQS, once a message has been ack'd, it's gone from that queue.
SQS is better for worker queues because of that.

Another thing: Kinesis can trigger a Lambda, while SQS cannot. So with SQS you either have to provide an EC2 instance to process SQS messages (and deal with it if it fails), or you have to have a scheduled Lambda (which doesn't scale up or down - you get just one per minute).
Edit: This answer is no longer correct. SQS can directly trigger Lambda as of June 2018
https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html

The pricing models are different, so depending on your use case one or the other may be cheaper. Using the simplest case (not including SNS):
SQS charges per message (each 64 KB counts as one request).
Kinesis charges per shard per hour (1 shard can handle up to 1000 messages or 1 MB/second) and also for the amount of data you put in (every 25 KB).
Plugging in the current prices and not taking into account the free tier, if you send 1 GB of messages per day at the maximum message size, Kinesis will cost much more than SQS ($10.82/month for Kinesis vs. $0.20/month for SQS). But if you send 1 TB per day, Kinesis is somewhat cheaper ($158/month vs. $201/month for SQS).
Details: SQS charges $0.40 per million requests (64 KB each), so $0.00655 per GB. At 1 GB per day, this is just under $0.20 per month; at 1 TB per day, it comes to a little over $201 per month.
Kinesis charges $0.014 per million requests (25 KB each), so $0.00059 per GB. At 1 GB per day, this is less than $0.02 per month; at 1 TB per day, it is about $18 per month. However, Kinesis also charges $0.015 per shard-hour. You need at least 1 shard per 1 MB per second. At 1 GB per day, 1 shard will be plenty, so that will add another $0.36 per day, for a total cost of $10.82 per month. At 1 TB per day, you will need at least 13 shards, which adds another $4.68 per day, for a total cost of $158 per month.

Kinesis solves the problem of map part in a typical map-reduce scenario for streaming data. While SQS doesnt make sure of that. If you have streaming data that needs to be aggregated on a key, kinesis makes sure that all the data for that key goes to a specific shard and the shard can be consumed on a single host making the aggregation on key easier compared to SQS

Kinesis Use Cases
Log and Event Data Collection
Real-time Analytics
Mobile Data Capture
“Internet of Things” Data Feed
SQS Use Cases
Application integration
Decoupling microservices
Allocate tasks to multiple worker nodes
Decouple live user requests from intensive background work
Batch messages for future processing

I'll add one more thing nobody else has mentioned -- SQS is several orders of magnitude more expensive.

In very simple terms, and keeping costs out of the picture, the real intention of SNS-SQS are to make services loosely coupled. And this is only primary reason to use SQS where the order of the msgs are not so important and where you have more control of the messages. If you want a pattern of job queue using an SQS is again much better. Kinesis shouldn't be used be used in such cases because it is difficult to remove messages from kinesis because kinesis replays the whole batch on error. You can also use SQS as a dead letter queue for more control. With kinesis all these are possible but unheard of unless you are really critical of SQS.
If you want a nice partitioning then SQS won't be useful.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js