Do Global Secondary Index (GSI) in DynamoDB impact tables provision capacity - amazon-web-services

I have queries for 2 use cases with different throughput needs being directed to one DynamoDB table.
First use case needs read/write only using primary key, but needs at least 1700/sec write and 8000/sec read
Second Use case utilizes every GSI, but queries that use GSI are few and far between. Less than 10 queries per minute.
So my provisioned capacity for GSI will be far less than what is provisioned for primary key. Does this mean when I do a write on the table, the performance upper bound is what I have provisioned for GSI?

Asked AWS Support same question, Below is their answer:
Your question is worth asking. In the scenario you mention your read/write request in GSI will be throttled, and 10 writes / min will be the effective limit. This will create issues when ever you update your primary table, the updates will get mirrored to GSI. So either you should Provision similar write capacity to GSI or do not keep attribute in GSI that will get updated frequently.
Here is link to our documentation that will help you :
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.ThroughputConsiderations

I think so. When you add new items they will need to be added to the GSI index as well, so the same capacity is needed there as well
In order for a table write to succeed, the provisioned throughput settings for the table and all of its global secondary indexes must have enough write capacity to accommodate the write; otherwise, the write to the table will be throttled.
There're more details and use-cases here:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.ThroughputConsiderations

Related

DynamoDB partitions and custom sharding

Let's assume I have some data I want to store in my DynamoDB table. I want to use as the Primary Key the following structure: {timestamp}_{short_uuid}, e.g. "1643207769_123423-ab31d-12345d-12355". I want to ensure good distribution of these items across the partitions.
I'm wondering if "enforcing" sharding of the data by introducing the hash-range key with a specific range (like 1-20) is a good idea? This means my Primary Key would consist of:
"Partition Key" = "range(1-20)" and "Sort Key": "{timestamp}_{short_uuid}".
In other words, will the hash-range key provide better distribution than just simple partition key (regardless the high cardinality like in my example)? Eventually, I'm not interested on which partition the item will end up, I just want to avoid potential hot partition problem.
With thanks to Alex DeBrie's Everything you need to know about DynamoDB Partitions for much of this information.
Some NoSQL databases expose the partition hashing algorithm and/or the cluster topology, but DynamoDB does not. So, you don't know what it is and you can't control it.
Prior to 2018 you needed to be much more aware of how your items were sharded because DynamoDB shared your table's provisioned read/write capacity evenly across all partitions.
In 2018, AWS introduced adaptive capacity and made it instant in May 2019. So, now your table's provisioned read/write capacity shifts to the partitions where it's needed and, as well as being able to add new partitions as needed, DynamoDB will also split highly-active partitions to provide consistent performance.
The upshot is that as long as you stay within an individual partition's size and throughout limits, you should not worry about primary keys too much.
DynamoDB hash function(which they didn't disclose) will distribute it better than you can as they are topology aware( + you have low cardinality in the partition key).
Not sure about your usage but if you want sorting, then use a sort key.

AWS DynamoDB: What does the graph implies? What needs to be done? Few of my btachwrite (delete request) failed

Can somebody tell what needs to be done?
Im facing few issues when I am having 1000+ events.
Few of them are not getting deleted after my process.
Im doing a batch delete through batchwriteitem
Each partition on a DynamoDB table is subject to a hard limit of 1,000 write capacity units and 3,000 read capacity units. If your workload is unevenly distributed across partitions, or if the workload relies on short periods of time with high usage (a burst of read or write activity), the table might be throttled.
It seems You are using DynamoDB adaptive capacity, however, DynamoDB adaptive capacity automatically boosts throughput capacity to high-traffic partitions. However, each partition is still subject to the hard limit. This means that adaptive capacity can't solve larger issues with your table or partition design. To avoid hot partitions and throttling, optimize your table and partition structure.
https://aws.amazon.com/premiumsupport/knowledge-center/dynamodb-table-throttled/
One way to better distribute writes across a partition key space in Amazon DynamoDB is to expand the space. You can do this in several different ways. You can add a random number to the partition key values to distribute the items among partitions. Or you can use a number that is calculated based on something that you're querying on.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-sharding.html

DynamoDB fill empty table with tonns of data capped at 1000WCU

I'm writing a script, that should fill the new table with data in the shortest terms (~650Gb table).
The partition(hash) key is different between all records, so I can't imagine the better key.
I've set the provisioned WCU for this table at 4k.
When script works, 16 independent threads put different data into the table at a high rate. During execution, I receive ProvisionedThroghputException. The Cloudwatch graphs show that consumed WCU is capped at 1000WCU.
It could happen if all data is put to one partition.
As I understand, the DynamoDb would create the new partition, when data size would exceed the 10Gb limit. Is it so?
So, during this data fill operation, I have only 1 partition and the limit of 1000WCU is understandable.
I've checked the https://aws.amazon.com/ru/premiumsupport/knowledge-center/dynamodb-table-throttled/
But seems that these suggestions are applied to already filled tables and you try to add a lot of new data there.
So I have 3 questions:
1. How I can speed up the process of inserting data into the new empty table?
2. When DynamoDB decide to create a new partition?
3. Can I set up a minimum number of partitions (for ex. 4), to use all the power of provisioned WCU (4k)?
UPD Cloudwatch graph:
UPD2 the HASH key is long number. Actually it's not strongly unique. But max rows with same HASH key but different RANGE keys is 2.
You can't manually specify the number of partitions used by DDB. It's automatically handled behind the scenes.
However, the way it's handled is laid out in the link provided by F_SO_K.
1 for every 10GB of data
1 for every 3000RCU and/or 1000WCU provisioned.
If you've provisioned 4000WCU, then you should have at least 4 partitions and you should be seeing 4000WCU consumed. Especially given that you said your hash key is unique for every record, you should have data uniformly spread out and not be running into a "hot" partition.
You mentioned cloudwatch showing consumed WCU at 1000, does cloudwatch also show provisioned capacity at 4000WCU?
If so, not sure what's going on, may have to call AWS.

How can I implement two sort keys in Dynamo DB?

I’m building a database using DynamoDB on AWS.
I am using variable X as a partition key, and variable Y as a sort key.
I also have a variable Z which i need as a second sort key.
Is there a way to do this?
Generally in DynamoDB you can create Local Secondary Indexes if you need alternative sort key:
To give your application a choice of sort keys, you can create one or
more local secondary indexes on an Amazon DynamoDB table and issue
Query or Scan requests against these indexes.
Important things to note are that LSIs can only be created when you create your main table, and they can't be deleted later.
You can have max 5 LSI per table.
This is where you might use an LSI or GSI, but if this are queries you are not doing often and/or not at a high velocity, or the returned set with the PK and SK is small enough, you could just do a filter expression on the query results. Yes this is a little waste, but the cost of that waste could be less than the literal cost to having and maintain a secondary index. I cannot answer if this is the case for your workload though.

How do I avoid hot partitions when using DyanmoDB row level access control?

I’m looking at adding row-level permissions to a DynamoDB table using dynamodb:LeadingKeys to restrict access per Provider ID. Currently I only have one provider ID, but I know I will have more. However they providers will vary in size with those sizes being very unbalanced.
If I use Provider ID as my partition key, it seems to me like my DB will end up with very hot partitions for the large providers and mostly unused ones for the smaller providers. Prior to adding the row-level access control I was using deviceId as the partition key since it is a more random name, so partitions well, but now I think I have to move that to the sort key.
Current partitioning that works well:
HASHKEY: DeviceId
With permissions I think I need to go to:
HASHKEY: ProviderID (only a handful of them)
RangeKey: DeviceId
Any suggestions as to a better way to set this up?
In general, you no longer need to worry about hot partitions in DynamoDB, especially if the partition keys which are being requested the most remain relatively constant.
More Info: https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Expanding on Michael's comment...
If you don't need a range key now...why add one?
The only reason to have a range key is that you need to Query DDB and return multiple records.
If all you ever need is a single record using GetItem, then you don't need a range key.
Simply concatenate ${ProviderId}.${DeviceId} together to make up your hash key.
Edit
Since you want to be able to list device Ids for a single provider, then you do need providerID as the partition key and deviceID as the range key.
As Icehorn's answer mentions, "hot partitions" aren't as big a deal as they used to be. Unless you expect the data for a single providerID to go over 10GB, I'd start with the simple implementation of hashKey(providerID).
If you have expect more than 10GB of data or you end up with a hot partition...then consider concatenating (1..n) integer to the providerID.
This will mean that you'd have to query multiple partitions to get all the deviceIDs.
This approach is detailed in Multi Tenant SaaS Storage Strategies