Avoid throttle dynamoDB - amazon-web-services

I am new to cloud computing, but had a question if a mechanism as what I am about to describe exists or is possible to create?
Dynamodb has provisioned throughput (eg. 100 writes/second). Of course, in real world application real life throughput is very dynamic and will almost never be your provisioned amount of 100 writes/second. I was thinking what would be great would be some type of queue for dynamodb. For example, my dynamodb during peak hours may receive 500 write requests per second (5 times what I have allocated) and would return errors. Is it there some queue I can put in between the client and database, so the client requests go to the queue, the client gets acknowledged their request has been dealt with, then the queue spits out the request to the dynamodb at a rate of 100/ writes per second exactly, so that way there are no error returned and I don't need to raise the through put which will raise my costs?

Putting AWS SQS is front of DynamoDB would solve this problem for you, and is not an uncommon design pattern. SQS is already well suited to scale as big as it needs to, and ingest a large amount of messages with unpredictable flow patterns.
You could either put all the messages into SQS first, or use SQS as an overflow buffer when you exceed the design thoughput on your DynamoDB database.
One or more worker instances can than read messages from the SQS queue and put them into DynamoDB at exactly the the pace you decide.
If the order of the messages coming in is extremely important, Kinesis is another option for you to ingest the incoming messages and then insert them into DynamoDB, in the same order they arrived, at a pace you define.
IMO, SQS will be easier to work with, but Kineses will give you more flexibility if your needs are more complicated.

This cannot be accomplished using DynamoDB alone. DynamoDB is designed for uniform, scalable, predictable workloads. If you want to put a queue in front of DynamoDB you have do that yourself.
DynamoDB does have a little tolerance for burst capacity, but that is not for sustained use. You should read the best practices section Consider Workload Uniformity When Adjusting Provisioned Throughput, but here are a few, what I think are important, paragraphs with a few things emphasized by me:
For applications that are designed for use with uniform workloads, DynamoDB's partition allocation activity is not noticeable. A temporary non-uniformity in a workload can generally be absorbed by the bursting allowance, as described in Use Burst Capacity Sparingly. However, if your application must accommodate non-uniform workloads on a regular basis, you should design your table with DynamoDB's partitioning behavior in mind (see Understand Partition Behavior), and be mindful when increasing and decreasing provisioned throughput on that table.
If you reduce the amount of provisioned throughput for your table, DynamoDB will not decrease the number of partitions . Suppose that you created a table with a much larger amount of provisioned throughput than your application actually needed, and then decreased the provisioned throughput later. In this scenario, the provisioned throughput per partition would be less than it would have been if you had initially created the table with less throughput.
There are tools that help with auto-scaling DynamoDB, such as sebdah/dynamic-dynamodb which may be worth looking into.

One update for those seeing this recently, for having burst capacity DynamoDB launched on 2018 the On Demand capacity mode.
You don't need to decide on the capacity beforehand, it will scale read and write capacity following the demand.
See:
https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing/

Related

Patterns to write to DynamoDB from SQS queue with maximum throughput

I would like to set up a system that transfers data from an SQS queue to DynamoDB. Is there a mechanism to write at the approximate maximum throughput of the respective DynamoDB table if this is the only place that writes into that table avoiding throttling errors as much as possible?
I haven't seen such a pattern yet. If I have a lambda behind the SQS queue it is hard to measure how many writes are currently occuring because I have no control over the number of lambda instances. Then there might be temporary throughput limitations that need to be handled. The approach I have been thinking about is to have some sort of adaptive mechanism that lowers the write speed if throttling errors occur, possibly supported by real-time queries to CloudWatch to get the throughput in the last few seconds.
I have read the posts related to this topic here but didn't find a solution to this.
Thanks in advance
If I have a lambda behind the SQS queue it is hard to measure how many writes are currently occuring because I have no control over the number of lambda instances
Yes you do !
To me, lambda is definitely the way to go. You can set a maximum concurrency limit on every lambda function so that it does not fire too many parallel invocations. More details here
Also, unless you are doing some fine-tuned costs optimization, dynamoDB provides a on-demand feature where you don't have to care about provisioning (and therefore throttling) anymore. Using this feature could also guarantee that no throttling occurs.

When should I use auto-scaling and when to use SQS?

I was studying about DynamoDb where I am stuck on a question for which I can't find any common solution.
My question is: if I have an application with dynamodb as db with initial write capacity of 100 writes per second and there is heavy load during peak hours suppose 300 writes per sec. In order to reduce load on the db which service should I use?
My take says we should go for auto-scaling but somewhere I studied that we can use sqs for making queue for data and kinesis also if order of data is necessary.
In the old days, before DynamoDB Auto-Scaling, a common use pattern was:
The application attempts to write to DynamoDB
If the request is throttled, the application stores the information in an Amazon SQS queue
A separate process regularly checks the SQS queue and attempts to write the data to DynamoDB. If successful, it removes the message from SQS
This allowed DynamoDB to be provisioned for average workload rather than peak workload. However, it has more parts that need to be managed.
These days, DynamoDB can use adaptive capacity and burst capacity to handle temporary changes. For larger changes, you can implement DynamoDB Auto Scaling, which is probably easier to implement than the SQS method.
The best solution depends on the characteristics of your application. Can it tolerate asynchronous database writes? Can it tolerate any throttling on database writes?
If you can handle some throttling from DynamoDB when there’s a sudden increase in traffic, you should use DynamoDB autoscaling.
If throttling is not okay, but asynchronous writes are okay, then you could use SQS in front of DynamoDB to manage bursts in traffic. In this case, you should still have autoscaling enabled to ensure that your queue workers have enough throughout available to them.
If you must have synchronous writes and you can not tolerate any throttling from DynamoDB, you should use DynamoDB’s on demand mode. (However, do note that there can still be throttling if you exceed 1k WCU or 3k RCU for a single partition key.)
Of course cost is also a consideration. Using DynamoDB with autoscaling will be the most cost effective method. I’m not sure how On Demand compares to the cost of using SQS.

How to compute initial Auto-scaling limits for DynamoDb table

Our table has bursty writes, expected once a week. We have auto-scaling enabled, with provisioned capacity as 5 WCU's, with 70% target utilization. This suffices for our off-peak (non-bursty) traffic. However, during the bursty writes, the WCU's reach around 1.5-2k, which leads to a lot of throttled writed and ultimately failures to write as well.
1) Is the auto-scaling suitable for such an use-case?
2) If yes, what should our initial provisioned capacity be?
This answer will tell you why auto-scaling is not working for you:
https://stackoverflow.com/a/53005089/4985580
This answer will tell you how you can configure your SDK to retry operations over a much longer period (and therefore stop your operation failures furing peak requests).
What should be done when the provisioned throughput is exceeded?
Ultimately you should probably move your tables to on-demand.
For tables using on-demand mode, DynamoDB instantly accommodates
customers’ workloads as they ramp up or down to any previously
observed traffic level. If the level of traffic hits a new peak,
DynamoDB adapts rapidly to accommodate the workload.
No, auto-scaling is not suitable for your needs. It takes a few minutes to scale up and it does that by increasing a fixed percentage of your current capacity at each time. There's also a limited number of times it scales up or down per day, so you can't get from 5 to 2,000 in a matter of minutes. You may not even get that in a matter of hours.
I'd suggest to try on demand mode, or manually setting capacity to 2,000 some time before you actually need it (it doesn't really scale instantly).
I strongly advise to read the ENTIRE dynamo documentation with regards to best practices for primary key, GSI, data architecture. Depending on the size of your table (lager that 10 Gb), the 2,000 units may get spread across partitions and you could potentially still have throttled requests.

Hot partition problem in DynamoDB gone with the new on-demand feature?

I read the following announcement with great interest.
https://aws.amazon.com/about-aws/whats-new/2018/11/announcing-amazon-dynamodb-on-demand/
The new "on-demand" feature really helps with capacity planning. Reading the documentation, I can't really see if they do some "magic" to resolve the problem of hot partitions, and partition key distribution.
Is partition key design just as important if you provision a table "on-demand"?
Yes, partition key design is just as important. That aspect has not changed.
Since you mentioned adaptive capacity in a comment, one thing to make sure is clear. Once it is on for a table, it is on and DynamoDB is monitoring your table.
There are two features at play here:
* On-demand capacity mode
* Adaptive capacity
On-demand capacity mode allows you to pay per each request to DynamoDB instead of provisioning a particular amount of RCUs/WCUs (this is called provisioned capacity). The benefit is that you only pay for what you use (and not for what you provision), but the downside is that if you receive a constant flow of requests, you would end up paying more if you provisioned the right amount of RCUs/WCUs. The on-demand capacity mode is the best suit for spiky traffic, while the provisioned mode is better for applications with a constant, predictable stream of requests
Adaptive capacity is a different feature, and it can work with either on-demand or provisioned capacity modes. It allows to "borrow" unused capacity from other partitions if one of your partitions receive a higher share of requests. It used to take some time to enable adaptive capacity, but as for now, adaptive capacity is enabled immediately.
Even with adaptive capacity, a good key design is still important. It only helps with cases when it is hard to achieve a balanced distribution of requests among shards. A single partition in DynamoDB can only handle up to 3K RCUs and 1K WCUs. So if a single partition receives more than that even with adaptive capacity requests will be throttled. So you have to design your keys to avoid this scenario.
As of 5/2019 the answer to this question has changed. I'd like to preface my answer by saying I have not validated this against a production workload. Also my answer assumes you, like the OP, are using on-demand pricing.
First a general understanding of how DynamoDB (DDB) adaptive capacity works can be gleamed by reading this. Adaptive capacity is on by default. In short, when a hot partition exceeds its throughput capacity, DDB "moves the rudder" and instantly increases throughput capacity on the partition.
Before 5/2019 you'd 300 seconds of instant burst capacity, then you'd be throttled until adaptive capacity fully "kicked in" (5-30 minutes).
On 5/23/2019 AWS announced that adaptive capacity is now instant. This means no more 5-30 minute wait.
Does this mean if you use DDB on-demand pricing, your hot partition problems go away? Yes and no.
Yes, in that you should not get throttled.
No, in that your bank account will take the hit. In order to not get throttled, DDB will scale up on-demand (now instantly). You will pay for the RCUs and WCUs needed to handle the throughput.
So the moral is, you still have to think about hot partitions and design your application to avoid them as much as possible. You won't be paying for it in downtime/unhappy customers, you'll be paying for it out of profits.
#Glenn First of all thankyou for the great question, after some research i have reached to the conclution that hot partition problem is still important but only for 5-30 minutes as soon as dynamo db will detect you are having hot partitions it will use the mechanism like adaptive capacity and automatic resharding, dynamo db has improved a lot since its launch and now AWS handles hot partitions by something called automatic resharding, i think automatic resharding works in both on demand and provision model but i could not find any proof for that i will update the answer as soon as i find it for reference you can watch this keynote.
AWS reinvent 2018 keynote

DynamoDB: when does ProvisionedThroughputExceededException raise

I'm using AWS Java SDK in Apache Spark job to populate DynamoDB table with data extracted from S3. Spark job just writes data using single PutItems with very intense flow (three m3.xlarge nodes used only to write) and without any retry policy.
DynamoDB docs state that AWS SDK has backoff policy, but eventually if rate is too high ProvisionedThroughputExceededException can be raised. My spark job worked for three days and was constrained only by DynamoDB thoughput (equal 500 units) so I expect rate was extremely high and queue was extremely long, however I didn't have any signs of thrown exceptions or lost data.
So, my question is - when it is possible to get an exception when writing to DynamoDB with very high rate.
You can also get throughput exception if you have a hot partition. Because throughput is divided between partitions, each partition has a lower limit than total provisioned throughput, so if you write to the same partition often, you can hit limit even if you are not using full provisioned throughput.
Another thing to consider is that DynamoDB does accumulate unused throughput and use it to burst throughput available for short duration if you go above your limit briefly.
Edit: DynamoDB now has new adaptive capacity feature which somewhat solves the problem of hot partitions by redistributing total throughput unequally.