Patterns to write to DynamoDB from SQS queue with maximum throughput - amazon-web-services

I would like to set up a system that transfers data from an SQS queue to DynamoDB. Is there a mechanism to write at the approximate maximum throughput of the respective DynamoDB table if this is the only place that writes into that table avoiding throttling errors as much as possible?
I haven't seen such a pattern yet. If I have a lambda behind the SQS queue it is hard to measure how many writes are currently occuring because I have no control over the number of lambda instances. Then there might be temporary throughput limitations that need to be handled. The approach I have been thinking about is to have some sort of adaptive mechanism that lowers the write speed if throttling errors occur, possibly supported by real-time queries to CloudWatch to get the throughput in the last few seconds.
I have read the posts related to this topic here but didn't find a solution to this.
Thanks in advance

If I have a lambda behind the SQS queue it is hard to measure how many writes are currently occuring because I have no control over the number of lambda instances
Yes you do !
To me, lambda is definitely the way to go. You can set a maximum concurrency limit on every lambda function so that it does not fire too many parallel invocations. More details here
Also, unless you are doing some fine-tuned costs optimization, dynamoDB provides a on-demand feature where you don't have to care about provisioning (and therefore throttling) anymore. Using this feature could also guarantee that no throttling occurs.

Related

Autoscale AWS Lambda concurrency based off throttling errors

I have a AWS Lambda function using an AWS SQS trigger to pull messages, process them with an AWS Comprehend endpoint, and put the output in AWS S3. The AWS Comprehend endpoint has a rate limit which goes up and down throughout the day based off something I can control. The fastest way to process my data, which also optimizes the costs I am paying for the AWS Comprehend endpoint to be up, is to set concurrency high enough that I get throttling errors returned from the api. This however comes with the caveat, that I am paying for more AWS Lambda invocations, the flip side being, that to optimize the costs I am paying for AWS Lambda, I want 0 throttling errors.
Is it possible to set up autoscaling for the concurrency limit of the lambda such that it will increase if it isn't getting any throttling errors, but decrease if it is getting too many?
Very interesting use case.
Let me start by pointing out something that I found out the hard way in an almost 4 hour long call with AWS Tech Support after being puzzled for a couple days.
With SQS acting as a trigger for AWS Lambda, the concurrency cannot go beyond 1K. Even if the concurrency of Lambda is set at a higher limit.
There is now a detailed post on this over at Knowledge Center.
With that out of the way and assuming you are under 1K limit at any given point in time and so only have to use one SQS queue, here is what I feel can be explored:
Either use an existing cloudwatch metric (via Comprehend) or publish a new metric that is indicative of the load that you can handle at any given point in time. you can then use this to set an appropriate concurrency limit for the lambda function. This would ensure that even if you have SQS queue flooded with messages to be processed, lambda picks them up at the rate at which it can actually be processed.
Please Note: This comes out of my own philosophy of being proactive vs being reactive. I would not wait for something to fail to trigger other processes eg invocation errors in this case to adjust concurrency. System failures should be rare and actually raise alarm (if not panic!) rather than being normal that occurs a couple of times a day !
To build up on that, if possible I would suggest that you approach this the other way around i.e. scale Comprehend processing limit and AWS Lambda concurrency based on the messages in the SQS queue (backlog) or a combination of this backlog and the time of the day etc. This way, if every part of your pipeline is a function of the amount of backlog in the Queue, you can be rest assured that you are not spending more than you have at any given point in time.
More importantly, you always have capacity in place should the need arise or something out of normal happens.

Limit concurrent invocation of a AWS Lambda triggered from AWS SQS (Reserved concurrency ignored)?

To me this seemed like a simple use case when I started, but it turned out a lot harder than I had anticipated.
Problem
I have an AWS SQS acting as a job queue that triggers a worker AWS Lambda. However since the worker lambdas are sharing non-scalable resources it is important to limit the number of concurrent running lambdas to (for the sake of example) no more than 5 lambdas running simultaneously.
Simple enough, according to Managing Concurrency for a Lambda Function
Reserved concurrency also limits the maximum concurrency for the
function, and applies to the function as a whole
However, setting the Reserved concurrency-property to 5 seems to be completely ignored by SQS, with the queue Messages in Flight-property in my case showing closer to 20-30 concurrent executions depending on the amount of messages put into the queue.
Question
The closest I have come to a solution is to use a SQS FIFO queue and setting the MessageGroupId to a value of either randomly selecting or alternating between 1-5. However, due to uneven workload this is not optimal as it would be better to have the concurrency distributed by actual workload rather than by chance.
I have also tried using the AWS Step Functions as the Map-state has a MaxConcurrency parameter, which seemed to work well on small job queues, but due to each state having an input/output limit of 32kb, this was not feasible in my use-case.
Has anyone found a better or alternative solution? Are there any other ways Reserved concurrency is supposed to be used?
Similar
Here are some similar questions I have found, but I think my question is different because I am not interested in limiting the total number of invocation, and (although I have not tried it myself) I can not see why triggers from S3 or Kinesis Steam would behave different from SQS.
According to AWS docs AWS SQS doesn't take into account reserved concurrency. If number of batches to be processed is greater than reserved concurrency, your messages might end up in a dead-letter queue:
If your function returns an error, or can't be invoked because it's at
maximum concurrency, processing might succeed with additional
attempts. To give messages a better chance to be processed before
sending them to the dead-letter queue, set the maxReceiveCount on the
source queue's redrive policy to at least 5.
https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
You can check this article for details: https://zaccharles.medium.com/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0
This issue is resolved today Jan 2023. You can use maximum concurrency as suggested in this blog . I was using FIFO with groupid as my backend was non-scalable and i wanted to not have any throttling issue as having too many messages on DLQ does not help.
https://aws.amazon.com/blogs/compute/introducing-maximum-concurrency-of-aws-lambda-functions-when-using-amazon-sqs-as-an-event-source/

When should I use auto-scaling and when to use SQS?

I was studying about DynamoDb where I am stuck on a question for which I can't find any common solution.
My question is: if I have an application with dynamodb as db with initial write capacity of 100 writes per second and there is heavy load during peak hours suppose 300 writes per sec. In order to reduce load on the db which service should I use?
My take says we should go for auto-scaling but somewhere I studied that we can use sqs for making queue for data and kinesis also if order of data is necessary.
In the old days, before DynamoDB Auto-Scaling, a common use pattern was:
The application attempts to write to DynamoDB
If the request is throttled, the application stores the information in an Amazon SQS queue
A separate process regularly checks the SQS queue and attempts to write the data to DynamoDB. If successful, it removes the message from SQS
This allowed DynamoDB to be provisioned for average workload rather than peak workload. However, it has more parts that need to be managed.
These days, DynamoDB can use adaptive capacity and burst capacity to handle temporary changes. For larger changes, you can implement DynamoDB Auto Scaling, which is probably easier to implement than the SQS method.
The best solution depends on the characteristics of your application. Can it tolerate asynchronous database writes? Can it tolerate any throttling on database writes?
If you can handle some throttling from DynamoDB when there’s a sudden increase in traffic, you should use DynamoDB autoscaling.
If throttling is not okay, but asynchronous writes are okay, then you could use SQS in front of DynamoDB to manage bursts in traffic. In this case, you should still have autoscaling enabled to ensure that your queue workers have enough throughout available to them.
If you must have synchronous writes and you can not tolerate any throttling from DynamoDB, you should use DynamoDB’s on demand mode. (However, do note that there can still be throttling if you exceed 1k WCU or 3k RCU for a single partition key.)
Of course cost is also a consideration. Using DynamoDB with autoscaling will be the most cost effective method. I’m not sure how On Demand compares to the cost of using SQS.

AWS Lambda is seemingly not highly available when invoked from SNS

I am invoking a data processing lambda in bulk fashion by submitting ~5k sns requests in an asynchronous fashion. This causes all the requests to hit sns in a very short time. What I am noticing is that my lambda seems to have exactly 5k errors, and then seems to "wake up" and handle the load.
Am I doing something largely out of the ordinary use case here?
Is there any way to combat this?
I suspect it's a combination of concurrency, and the way lambda connects to SNS.
Lambda is only so good at automatically scaling up to deal with spikes in load.
Full details are here: (https://docs.aws.amazon.com/lambda/latest/dg/scaling.html), but the key points to note that
There's an account-wide concurrency limit, which you can ask to be
raised. By default it's much less than 5k, so that will limit how
concurrent your lambda could ever become.
There's a hard scaling limit (+1000 instances/minute), which means even if you've managed to convince AWS to let you have a concurrency limit of 30k, you'll have to be under sustained load for 30 minutes before you'll have that many lambdas going at once.
SNS is a non-stream-based asynchronous invocation (https://docs.aws.amazon.com/lambda/latest/dg/invoking-lambda-function.html#supported-event-source-sns) so what you see is a lot of errors as each SNS attempts to invoke 5k lambdas, but only the first X (say 1k) get through, but they keep retrying. The queue then clears concurrently at your initial burst (typically 1k, depending on your region), +1k a minute until your reach maximum capacity.
Note that SNS only retries three times at intervals (AWS is a bit sketchy about the intervals, but it is probably based on the retry: delay the service returns, so should be approximately intelligent); I suggest you setup a DLQ to make sure you're not dropping messages because the time for the queue to clear.
While your pattern is not a bad one, it seems like you're very exposed to the concurrency issues that surround lambda.
An alternative is to use a stream based event-source (like Kinesis), which processes in batches at a set concurrency (e.g. 500 records per lambda, concurrent by shard count, rather than 1:1 with SNS), and waits for each batch to finish before processing the next.

Avoid throttle dynamoDB

I am new to cloud computing, but had a question if a mechanism as what I am about to describe exists or is possible to create?
Dynamodb has provisioned throughput (eg. 100 writes/second). Of course, in real world application real life throughput is very dynamic and will almost never be your provisioned amount of 100 writes/second. I was thinking what would be great would be some type of queue for dynamodb. For example, my dynamodb during peak hours may receive 500 write requests per second (5 times what I have allocated) and would return errors. Is it there some queue I can put in between the client and database, so the client requests go to the queue, the client gets acknowledged their request has been dealt with, then the queue spits out the request to the dynamodb at a rate of 100/ writes per second exactly, so that way there are no error returned and I don't need to raise the through put which will raise my costs?
Putting AWS SQS is front of DynamoDB would solve this problem for you, and is not an uncommon design pattern. SQS is already well suited to scale as big as it needs to, and ingest a large amount of messages with unpredictable flow patterns.
You could either put all the messages into SQS first, or use SQS as an overflow buffer when you exceed the design thoughput on your DynamoDB database.
One or more worker instances can than read messages from the SQS queue and put them into DynamoDB at exactly the the pace you decide.
If the order of the messages coming in is extremely important, Kinesis is another option for you to ingest the incoming messages and then insert them into DynamoDB, in the same order they arrived, at a pace you define.
IMO, SQS will be easier to work with, but Kineses will give you more flexibility if your needs are more complicated.
This cannot be accomplished using DynamoDB alone. DynamoDB is designed for uniform, scalable, predictable workloads. If you want to put a queue in front of DynamoDB you have do that yourself.
DynamoDB does have a little tolerance for burst capacity, but that is not for sustained use. You should read the best practices section Consider Workload Uniformity When Adjusting Provisioned Throughput, but here are a few, what I think are important, paragraphs with a few things emphasized by me:
For applications that are designed for use with uniform workloads, DynamoDB's partition allocation activity is not noticeable. A temporary non-uniformity in a workload can generally be absorbed by the bursting allowance, as described in Use Burst Capacity Sparingly. However, if your application must accommodate non-uniform workloads on a regular basis, you should design your table with DynamoDB's partitioning behavior in mind (see Understand Partition Behavior), and be mindful when increasing and decreasing provisioned throughput on that table.
If you reduce the amount of provisioned throughput for your table, DynamoDB will not decrease the number of partitions . Suppose that you created a table with a much larger amount of provisioned throughput than your application actually needed, and then decreased the provisioned throughput later. In this scenario, the provisioned throughput per partition would be less than it would have been if you had initially created the table with less throughput.
There are tools that help with auto-scaling DynamoDB, such as sebdah/dynamic-dynamodb which may be worth looking into.
One update for those seeing this recently, for having burst capacity DynamoDB launched on 2018 the On Demand capacity mode.
You don't need to decide on the capacity beforehand, it will scale read and write capacity following the demand.
See:
https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing/