DynamoDB fill empty table with tonns of data capped at 1000WCU

DynamoDB fill empty table with tonns of data capped at 1000WCU - amazon-web-services

I'm writing a script, that should fill the new table with data in the shortest terms (~650Gb table).
The partition(hash) key is different between all records, so I can't imagine the better key.
I've set the provisioned WCU for this table at 4k.
When script works, 16 independent threads put different data into the table at a high rate. During execution, I receive ProvisionedThroghputException. The Cloudwatch graphs show that consumed WCU is capped at 1000WCU.
It could happen if all data is put to one partition.
As I understand, the DynamoDb would create the new partition, when data size would exceed the 10Gb limit. Is it so?
So, during this data fill operation, I have only 1 partition and the limit of 1000WCU is understandable.
I've checked the https://aws.amazon.com/ru/premiumsupport/knowledge-center/dynamodb-table-throttled/
But seems that these suggestions are applied to already filled tables and you try to add a lot of new data there.
So I have 3 questions:
1. How I can speed up the process of inserting data into the new empty table?
2. When DynamoDB decide to create a new partition?
3. Can I set up a minimum number of partitions (for ex. 4), to use all the power of provisioned WCU (4k)?
UPD Cloudwatch graph:
UPD2 the HASH key is long number. Actually it's not strongly unique. But max rows with same HASH key but different RANGE keys is 2.

You can't manually specify the number of partitions used by DDB. It's automatically handled behind the scenes.
However, the way it's handled is laid out in the link provided by F_SO_K.
1 for every 10GB of data
1 for every 3000RCU and/or 1000WCU provisioned.
If you've provisioned 4000WCU, then you should have at least 4 partitions and you should be seeing 4000WCU consumed. Especially given that you said your hash key is unique for every record, you should have data uniformly spread out and not be running into a "hot" partition.
You mentioned cloudwatch showing consumed WCU at 1000, does cloudwatch also show provisioned capacity at 4000WCU?
If so, not sure what's going on, may have to call AWS.

Related

How does partition capacity limit relate to table's total capacity in DynamoDB?

In the Dynamodb table, each partition is subject to a hard limit of 1,000 write capacity units and 3,000 read capacity units. What I don't understand is that how these limits relate to the table's total RCU/WCU?
For example, if I configure a table's RCU to 6000 and WCU to 3000. Is this capacity evenly used by all partitions in the table? Or do all partitions fight for the total capacity?
I can't find a way to know how many partitions the DynamoDB table is using. Is there a metric to tell me that?

The single-partition limit will only matter if your workload is so terribly imbalanced that a significant percentager of requests go to the same partition. In a better designed data model, you have a large number of different partition keys, which allows DynamoDB to use a large number of different partitions, so you never see a significant percentage of your requests going to the same partition.
That does not mean, however, that the load on all partitions is equal. It might very well be that one partition sees twice the number of requests as another partition. A few years ago, this meant your performance suffered: DynamoDB split the provisioned capacity (RCU/WCU) equally between partitions, so as the busier partition got throttled sooner, the total capacity you got from DynamoDB was less than what you paid for. However, they fixed this a few years ago with what they call adaptive capacity: DynamoDB now detects when your workloads total capacity is under what you paid for, and increase the capacity limits on individual partitions.
For example if you provision 10,000 RCU capacity and DynamoDB divides your data into 10 partitions, each of those start out with 1,000 RCU. However, it one partition gets double the requests as other, this will lead the workload to doing only 1000+9*500 = 5,500 RCU, significanltly less than the 10,000 you are paying for. So DynamoDB quickly recognizes this, and increases the busy partition's limit from 1,000 to 1,818 RCU - and now the total performance is 1,818 + 9*909 = 9,999 RCU. DynamoDB does this automatically for you - you don't need to do anything special. All you need to is to make sure that your workload has enough different partition keys, and no significant percentage of requests go to one specific partition keys - otherwise DynamoDB will not be able to achieve high total RCU - it will always be limited by that single-partition limit of 3,000.
Regarding your last question, I don't know if there is such a metric (maybe another responder will know), but the important thing to check is that you have a lot of partition keys. If that's the case, and your workload doesn't access one specific key for a large percentage of the requests, you should be safe.

AWS DynamoDB: What does the graph implies? What needs to be done? Few of my btachwrite (delete request) failed

Can somebody tell what needs to be done?
Im facing few issues when I am having 1000+ events.
Few of them are not getting deleted after my process.
Im doing a batch delete through batchwriteitem

Each partition on a DynamoDB table is subject to a hard limit of 1,000 write capacity units and 3,000 read capacity units. If your workload is unevenly distributed across partitions, or if the workload relies on short periods of time with high usage (a burst of read or write activity), the table might be throttled.
It seems You are using DynamoDB adaptive capacity, however, DynamoDB adaptive capacity automatically boosts throughput capacity to high-traffic partitions. However, each partition is still subject to the hard limit. This means that adaptive capacity can't solve larger issues with your table or partition design. To avoid hot partitions and throttling, optimize your table and partition structure.
https://aws.amazon.com/premiumsupport/knowledge-center/dynamodb-table-throttled/
One way to better distribute writes across a partition key space in Amazon DynamoDB is to expand the space. You can do this in several different ways. You can add a random number to the partition key values to distribute the items among partitions. Or you can use a number that is calculated based on something that you're querying on.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-sharding.html

Dynamically add shards to DynamoDB and remap old data

I know that dynamoDB supports shards. I wanted to know that is it possible to add shards dynamically.
Suppose I provisioned 4 shards and shardkey would be customerID.
Now in the future I want to provision 6 more shards, is it possible to add it?
Suppose if we can add 6 more shards how will the old data gets remapped to new shards and will the availability or consistency take hit?
For remapping my guess is that they must using consistent hashing.

No, There is no way to provision partitions as many as you want manually.
The number of Dynamodb partition is decided by specific criteria.
This is the criteria.
Partitions by capacity = (RCUs/3000) + (WCUs/1000)
It is depending on how many capacity you provision to the table.
Partitions by size = TableSizeInGB/10
It is depending on how far the table size is.
Total Partitions = Take the largest of your Partitions by capacity and Partitions by size and round this up to an integer.
For more information, I recommend you read the post .

handling dynamo db read and write units

I am using dynamo db as back end database in my project, I am storing items in the table with each of size 80 Kb or more(contains nested JSON), and my partition key is a unique valued column(unique for each item). Now i want to perform pagination on this table i.e., my UI will provide(start-Integer, limit-Integer and type-2 string constants) and my API should retrieve the items from dynamo db based on the provided input query parameters from UI. I am using SCAN method from boto3 (python SDK) but this scan is reading all the items from my table prior to considering my filters and causing provision throughput error, but I cannot afford to either increase my table's throughput or opt table auto-scaling. Is there any way how my problem can be solved? Please give your suggestions

Do you have a limit set on your scan call? If not, DynamoDB will return you 1MB of data by default. You could try using limit and some kind of sleep or delay in your code, so that you process your table at a slower rate and stay within your provisioned read capacity. If you do this, you'll have to use the LastEvaluatedKey returned to you to page through your table.
Keep in mind that just to read a single one of your 80kb items, you'll be using 10 read capacity units, and perhaps more if your items are larger.

Do Global Secondary Index (GSI) in DynamoDB impact tables provision capacity

I have queries for 2 use cases with different throughput needs being directed to one DynamoDB table.
First use case needs read/write only using primary key, but needs at least 1700/sec write and 8000/sec read
Second Use case utilizes every GSI, but queries that use GSI are few and far between. Less than 10 queries per minute.
So my provisioned capacity for GSI will be far less than what is provisioned for primary key. Does this mean when I do a write on the table, the performance upper bound is what I have provisioned for GSI?

Asked AWS Support same question, Below is their answer:
Your question is worth asking. In the scenario you mention your read/write request in GSI will be throttled, and 10 writes / min will be the effective limit. This will create issues when ever you update your primary table, the updates will get mirrored to GSI. So either you should Provision similar write capacity to GSI or do not keep attribute in GSI that will get updated frequently.
Here is link to our documentation that will help you :
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.ThroughputConsiderations

I think so. When you add new items they will need to be added to the GSI index as well, so the same capacity is needed there as well
In order for a table write to succeed, the provisioned throughput settings for the table and all of its global secondary indexes must have enough write capacity to accommodate the write; otherwise, the write to the table will be throttled.
There're more details and use-cases here:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.ThroughputConsiderations

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js