DynamoDB: when does ProvisionedThroughputExceededException raise - amazon-web-services

I'm using AWS Java SDK in Apache Spark job to populate DynamoDB table with data extracted from S3. Spark job just writes data using single PutItems with very intense flow (three m3.xlarge nodes used only to write) and without any retry policy.
DynamoDB docs state that AWS SDK has backoff policy, but eventually if rate is too high ProvisionedThroughputExceededException can be raised. My spark job worked for three days and was constrained only by DynamoDB thoughput (equal 500 units) so I expect rate was extremely high and queue was extremely long, however I didn't have any signs of thrown exceptions or lost data.
So, my question is - when it is possible to get an exception when writing to DynamoDB with very high rate.

You can also get throughput exception if you have a hot partition. Because throughput is divided between partitions, each partition has a lower limit than total provisioned throughput, so if you write to the same partition often, you can hit limit even if you are not using full provisioned throughput.
Another thing to consider is that DynamoDB does accumulate unused throughput and use it to burst throughput available for short duration if you go above your limit briefly.
Edit: DynamoDB now has new adaptive capacity feature which somewhat solves the problem of hot partitions by redistributing total throughput unequally.

Related

Why Is Dynamodb sets read/write capacity for the on-demand table?

I created an on-demand DynamoDb table, and as I know that Dynamodb automatically scales the write/read capacity on on-demand mode.
But AWS Glue job gives error as "An error occurred while calling o201.pyWriteDynamicFrame. DynamoDB write exceeds max retry 10" because of the write capacity. How is this possible if the table is on on-demand mode? I didn't set any read/write capacity and and the table isn't even on the provisioned mode.
Dynamodb Table:
AWS Glue job output:
Dynamodb Tabble Throttled:
Thanks.
Here is what you need to know about On-Demand mode tables
On Demand
If you recently switched an existing table to on-demand capacity mode for the first time, or if you created a new table with on-demand capacity mode enabled, the table has the following previous peak settings, even though the table has not served traffic previously using on-demand capacity mode:
Following are examples of possible scenarios.
A provisioned table configured as 100 WCU and 100 RCU. When this table is switched to on-demand for the first time, DynamoDB will ensure it is scaled out to instantly sustain at least 4,000 write units/sec and 12,000 read units/sec.
A provisioned table configured as 8,000 WCU and 24,000 RCU. When this table is switched to on-demand, it will continue to be able to sustain at least 8,000 write units/sec and 24,000 read units/sec at any time.
A provisioned table configured with 8,000 WCU and 24,000 RCU, that consumed 6,000 write units/sec and 18,000 read units/sec for a sustained period. When this table is switched to on-demand, it will continue to be able to sustain at least 8,000 write units/sec and 24,000 read units/sec. The previous traffic may further allow the table to sustain much higher levels of traffic without throttling.
A table previously provisioned with 10,000 WCU and 10,000 RCU, but currently provisioned with 10 RCU and 10 WCU. When this table is switched to on-demand, it will be able to sustain at least 10,000 write units/sec and 10,000 read units/sec.
Important
If you need more than double your previous peak on table, DynamoDB automatically allocates more capacity as your traffic volume increases to help ensure that your workload does not experience throttling. However, throttling can occur if you exceed double your previous peak within 30 minutes. For example, if your application’s traffic pattern varies between 25,000 and 50,000 strongly consistent reads per second where 50,000 reads per second is the previously reached traffic peak, DynamoDB recommends spacing your traffic growth over at least 30 minutes before driving more than 100,000 reads per second.
Above information is directly from AWS Docs src
Glue Workers
Now, when you begin to write using AWS Glue, you will very quickly exceed the 4000 WCU limit, which means you have exceeded the rule which is double your previous peak (4000) within 30 minutes.... So what now??
Pre-warming your table
DynamoDB provides you capacity in the form of partitions, where each partition is capable of providing you 1000 WCU and 3000 RCU. DynamoDB only ever scales partitions out, never merging in.
For that reason, we can "pre-warm" our DynamoDB tables by creating them in Provisioned-mode and allocating our peak WCU. For eg. let's imagine we expect Glue to consume 40,000 WCU, then we will be sure our table can handle that following these steps:
Create table in provisioned mode
No Autoscaling
40,000 WCU
40,000 RCU
When table is marked as Active (1-2 mins)
Switch capacity mode to On-Demand
Now, you have a new DynamoDB table in On-Demand which is capable of providing 40,000 WCU out of the gates, not the 4,000 WCU provided by default. This will eliminate throttling from Glue.
DynamoDB sets read/write capacity for its on-demand tables in order to balance performance and cost. The read/write capacity units determine the rate at which DynamoDB can read and write data to the table, with a larger number of units allowing for a higher rate of read/write operations. By setting these values, users can control the performance of their DynamoDB table and ensure that it meets the demands of their application. Additionally, setting the capacity units helps DynamoDB automatically manage the distribution of data and traffic, ensuring low latency and high reliability.

DynamoDB throttling and hot partition?

I have a DynamoDB table that has provisioned read way above the consumed read, the utilization percentage is 70%.
I'm still getting throttling on the table and I couldn't figure out why. Once thing I'm suspecting is hot partition, but I'm unable to verify.
In the case of hot partitioning, does it throttle only the read to the hot partition, or to reads to all partitions?
A single partition can serve 1000WCU and 3000RCU. If you consume capacity on a single partition exceeding that you will be throttled.
If you have a number of keys sharing the same partition then throttling will be for all keys sharing that partition. DynamoDB does have Adaptive Capacity feature that can isolate hot keys but would sometimes have to have some throttling before that happens.
2 things you can try:
Enable CloudWatch contributor insights to understand which keys are throttling allowing you to make necessary changes.
Scale the table up to a large WCU manually, once complete re-enable your autoscaling as before, this will give you more partitions in the backend which may reduce your throttling by separating keys.

Do DynamoDB consumed capacity units in on-demand mode compare to provisioned capacity?

Right now I am using on-demand mode for my DynamoDB tables, as I didn't know how much data to expect. But now that the application has run a while, I can see the metrics for ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits for my tables in CloudWatch.
In on-demand mode I pay per request, whereas in provisioned capacity mode I have to pay for the provisioned capacity. If I simply take the metrics for (max) consumed capacity units and compare the prices of those in provisioned capacity mode to my current costs, I believe provisioned capacity mode would be a lot cheaper for me.
My question is, can I simply take the metrics and take the max (plus some buffer) of the consumed capacity units and configure them as provisioned capacity, or is that an error in reasoning on my part?
There are two other things you need to consider:
How 'bursty' is your throughput?
Are you using SDKs to connect to your database?
Setting your provision to the maximum throughput you ever see will ensure you don't get throttled requests, however you will probably be setting the provision too high. Dynamodb can actually consume more provision than you have set using Burst Capacity. This will accomodate short bursts of high throughput over the space of 5 minutes. If you see sustained peaks, for example your database is busy in the day but not the night, you might consider setting your tables to Autoscale. In this case you can set the provisioned throughput lower, and Dyanmodb will automatically scale up provision as required. Note that autoscale is good for workloads that vary over the course of hours (e.g. for handling daily peak hours). It's not good for reacting to events that occur in less than about 30 minutes.
If you are using official SDKs, they will handle throttle responses, and retry any failed requests. This gives Dynamodb some time to scale without your application failing requests.

Avoid throttle dynamoDB

I am new to cloud computing, but had a question if a mechanism as what I am about to describe exists or is possible to create?
Dynamodb has provisioned throughput (eg. 100 writes/second). Of course, in real world application real life throughput is very dynamic and will almost never be your provisioned amount of 100 writes/second. I was thinking what would be great would be some type of queue for dynamodb. For example, my dynamodb during peak hours may receive 500 write requests per second (5 times what I have allocated) and would return errors. Is it there some queue I can put in between the client and database, so the client requests go to the queue, the client gets acknowledged their request has been dealt with, then the queue spits out the request to the dynamodb at a rate of 100/ writes per second exactly, so that way there are no error returned and I don't need to raise the through put which will raise my costs?
Putting AWS SQS is front of DynamoDB would solve this problem for you, and is not an uncommon design pattern. SQS is already well suited to scale as big as it needs to, and ingest a large amount of messages with unpredictable flow patterns.
You could either put all the messages into SQS first, or use SQS as an overflow buffer when you exceed the design thoughput on your DynamoDB database.
One or more worker instances can than read messages from the SQS queue and put them into DynamoDB at exactly the the pace you decide.
If the order of the messages coming in is extremely important, Kinesis is another option for you to ingest the incoming messages and then insert them into DynamoDB, in the same order they arrived, at a pace you define.
IMO, SQS will be easier to work with, but Kineses will give you more flexibility if your needs are more complicated.
This cannot be accomplished using DynamoDB alone. DynamoDB is designed for uniform, scalable, predictable workloads. If you want to put a queue in front of DynamoDB you have do that yourself.
DynamoDB does have a little tolerance for burst capacity, but that is not for sustained use. You should read the best practices section Consider Workload Uniformity When Adjusting Provisioned Throughput, but here are a few, what I think are important, paragraphs with a few things emphasized by me:
For applications that are designed for use with uniform workloads, DynamoDB's partition allocation activity is not noticeable. A temporary non-uniformity in a workload can generally be absorbed by the bursting allowance, as described in Use Burst Capacity Sparingly. However, if your application must accommodate non-uniform workloads on a regular basis, you should design your table with DynamoDB's partitioning behavior in mind (see Understand Partition Behavior), and be mindful when increasing and decreasing provisioned throughput on that table.
If you reduce the amount of provisioned throughput for your table, DynamoDB will not decrease the number of partitions . Suppose that you created a table with a much larger amount of provisioned throughput than your application actually needed, and then decreased the provisioned throughput later. In this scenario, the provisioned throughput per partition would be less than it would have been if you had initially created the table with less throughput.
There are tools that help with auto-scaling DynamoDB, such as sebdah/dynamic-dynamodb which may be worth looking into.
One update for those seeing this recently, for having burst capacity DynamoDB launched on 2018 the On Demand capacity mode.
You don't need to decide on the capacity beforehand, it will scale read and write capacity following the demand.
See:
https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing/

When does DynamoDB throttle request?

In the answer to "How is Amazon DynamoDB throughput calculated and limited?" it's been suggested, that DynamoDB throttles request whenever you exceed provisioned throughput on per second basis. However, this contradicts my experience.
I've table where I post multiple rows, often the number of rows way exceeding provisioned write capacity. This happens in short bursts. At one point I've even got 5 minutes average above provisioned capacity. OTOH, 15 minutes average is below capacity. I haven't got any throttled request in that period.
5 minutes average peaks at 8.053 with provisioned capacity of 6:
15 minutes average peaks well below provisioned capacity:
So when does DynamoDB throttle requests? What kind of average does it take in account? How high above provisioned capacity can the burst be before it gets throttled?
DynamoDB is designed to ensure that your provisioned capacity is available on a per-second basis. If you provision a table for ten 1kB reads per second then DynamoDB will give you enough capacity to handle that throughput rate. In addition, DynamoDB will sometimes allow you to achieve limited bursting above your provisioned throughput for a short period of time. This is intended to absorb natural variations in customer workloads. This bursting is not guaranteed and it is not always available (and the nature of the available bursting may change over time). As is currently described in the best practices documentation, in order to get the best performance you should have an evenly distributed workload that does not exceed your provisioned capacity and distributes the load evenly over the key space. However, if the reality of production behavior for your application deviates from an evenly distributed workload then DynamoDB may absorb some of the bursts.
As for how much to provision your table, it depends a lot on your workload. You could start with provisioning to something like 80% of your peaks and then adjust your table capacity depending on how many throttles you receive (which you can see in your CloudWatch graphs) and your application’s tolerance for latency induced by retries. Keep in mind that DynamoDB does not allow unlimited bursts above your provisioned capacity. You may be able to absorb short bursts but you cannot sustain a throughput rate above your provisioned capacity level for an extended period of time. The general guidance we can give is to provision for something close to your peaks and then dial down while watching for throttles.
This answer was posted in AWS forums
Disclaimer: I work for Amazon, DynamoDB team.
There's a hint in the DynamoDB documentation that explains how bursting works:
When you are not fully utilizing a partition's throughput, DynamoDB retains a portion of your unused capacity for later bursts of throughput usage. DynamoDB currently retains up five minutes (300 seconds) of unused read and write capacity.
But it also says that you cannot rely on this behavior:
However, do not design your application so that it depends on burst capacity being available at all times: DynamoDB can and does use burst capacity for background maintenance and other tasks without prior notice.
At least that would explain why it was possible to have a 5 minute average above the provisioned capacity. With the explanation above, it would even be possible to have 15 minute averages (or longer timespans) to be above the provisioned capacity, if you have a spike in the very beginning of the interval and less usage within the 300 seconds before the start of the interval.
DynamoDB provides some flexibility in your per-partition throughput provisioning by providing burst capacity. Whenever you're not fully using a partition's throughput, DynamoDB reserves a portion of that unused capacity for later bursts of throughput to handle usage spikes.
DynamoDB currently retains up to 5 minutes (300 seconds) of unused read and write capacity. During an occasional burst of read or write activity, these extra capacity units can be consumed quickly—even faster than the per-second provisioned throughput capacity that you've defined for your table.
DynamoDB can also consume burst capacity for background maintenance and other tasks without prior notice.