Let's assume I have some data I want to store in my DynamoDB table. I want to use as the Primary Key the following structure: {timestamp}_{short_uuid}, e.g. "1643207769_123423-ab31d-12345d-12355". I want to ensure good distribution of these items across the partitions.
I'm wondering if "enforcing" sharding of the data by introducing the hash-range key with a specific range (like 1-20) is a good idea? This means my Primary Key would consist of:
"Partition Key" = "range(1-20)" and "Sort Key": "{timestamp}_{short_uuid}".
In other words, will the hash-range key provide better distribution than just simple partition key (regardless the high cardinality like in my example)? Eventually, I'm not interested on which partition the item will end up, I just want to avoid potential hot partition problem.
With thanks to Alex DeBrie's Everything you need to know about DynamoDB Partitions for much of this information.
Some NoSQL databases expose the partition hashing algorithm and/or the cluster topology, but DynamoDB does not. So, you don't know what it is and you can't control it.
Prior to 2018 you needed to be much more aware of how your items were sharded because DynamoDB shared your table's provisioned read/write capacity evenly across all partitions.
In 2018, AWS introduced adaptive capacity and made it instant in May 2019. So, now your table's provisioned read/write capacity shifts to the partitions where it's needed and, as well as being able to add new partitions as needed, DynamoDB will also split highly-active partitions to provide consistent performance.
The upshot is that as long as you stay within an individual partition's size and throughout limits, you should not worry about primary keys too much.
DynamoDB hash function(which they didn't disclose) will distribute it better than you can as they are topology aware( + you have low cardinality in the partition key).
Not sure about your usage but if you want sorting, then use a sort key.
Related
Can somebody tell what needs to be done?
Im facing few issues when I am having 1000+ events.
Few of them are not getting deleted after my process.
Im doing a batch delete through batchwriteitem
Each partition on a DynamoDB table is subject to a hard limit of 1,000 write capacity units and 3,000 read capacity units. If your workload is unevenly distributed across partitions, or if the workload relies on short periods of time with high usage (a burst of read or write activity), the table might be throttled.
It seems You are using DynamoDB adaptive capacity, however, DynamoDB adaptive capacity automatically boosts throughput capacity to high-traffic partitions. However, each partition is still subject to the hard limit. This means that adaptive capacity can't solve larger issues with your table or partition design. To avoid hot partitions and throttling, optimize your table and partition structure.
https://aws.amazon.com/premiumsupport/knowledge-center/dynamodb-table-throttled/
One way to better distribute writes across a partition key space in Amazon DynamoDB is to expand the space. You can do this in several different ways. You can add a random number to the partition key values to distribute the items among partitions. Or you can use a number that is calculated based on something that you're querying on.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-sharding.html
I’m building a database using DynamoDB on AWS.
I am using variable X as a partition key, and variable Y as a sort key.
I also have a variable Z which i need as a second sort key.
Is there a way to do this?
Generally in DynamoDB you can create Local Secondary Indexes if you need alternative sort key:
To give your application a choice of sort keys, you can create one or
more local secondary indexes on an Amazon DynamoDB table and issue
Query or Scan requests against these indexes.
Important things to note are that LSIs can only be created when you create your main table, and they can't be deleted later.
You can have max 5 LSI per table.
This is where you might use an LSI or GSI, but if this are queries you are not doing often and/or not at a high velocity, or the returned set with the PK and SK is small enough, you could just do a filter expression on the query results. Yes this is a little waste, but the cost of that waste could be less than the literal cost to having and maintain a secondary index. I cannot answer if this is the case for your workload though.
I’m looking at adding row-level permissions to a DynamoDB table using dynamodb:LeadingKeys to restrict access per Provider ID. Currently I only have one provider ID, but I know I will have more. However they providers will vary in size with those sizes being very unbalanced.
If I use Provider ID as my partition key, it seems to me like my DB will end up with very hot partitions for the large providers and mostly unused ones for the smaller providers. Prior to adding the row-level access control I was using deviceId as the partition key since it is a more random name, so partitions well, but now I think I have to move that to the sort key.
Current partitioning that works well:
HASHKEY: DeviceId
With permissions I think I need to go to:
HASHKEY: ProviderID (only a handful of them)
RangeKey: DeviceId
Any suggestions as to a better way to set this up?
In general, you no longer need to worry about hot partitions in DynamoDB, especially if the partition keys which are being requested the most remain relatively constant.
More Info: https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Expanding on Michael's comment...
If you don't need a range key now...why add one?
The only reason to have a range key is that you need to Query DDB and return multiple records.
If all you ever need is a single record using GetItem, then you don't need a range key.
Simply concatenate ${ProviderId}.${DeviceId} together to make up your hash key.
Edit
Since you want to be able to list device Ids for a single provider, then you do need providerID as the partition key and deviceID as the range key.
As Icehorn's answer mentions, "hot partitions" aren't as big a deal as they used to be. Unless you expect the data for a single providerID to go over 10GB, I'd start with the simple implementation of hashKey(providerID).
If you have expect more than 10GB of data or you end up with a hot partition...then consider concatenating (1..n) integer to the providerID.
This will mean that you'd have to query multiple partitions to get all the deviceIDs.
This approach is detailed in Multi Tenant SaaS Storage Strategies
We have to work on an IoT system. Basically sensors sending data to the cloud, and users being able to access the data belonging to them.
The amount of data can be pretty substantial so we need to ensure something that covers both security and heavy load.
The typology of data is pretty straightforward, basically a data and its value at a specified time.
The idea was to use DynamoDB for this, having a table with :
[id of sensor-array]
[id of sensor]
[type of measure]
[value of measure]
[date of measure]
The idea was for the IoT system to put directly (in python) data into the database.
Our questions are :
In terms of performance :
will DynamoDB be able to handle a lot of insertions on a daily basis (we may be talking about hundredth of thousands insertions per minute) ?
does querying the table by giving the id of the sensor array and a minimal date will ensure being able to retrieve the data in a efficient fashion?
In terms of security is it okay to proceed this way?
We used to use NoSQL like MongoDB, so we're finding hard to apply our notions on DynamoDB where the data seems to be arranged in a pretty simple fashion.
Thanks.
will DynamoDB be able to handle a lot of insertions on a daily basis (we may be talking about hundredth of thousands insertions per minute) ?
Yes, DynamoDB will sustain (and cost) all the write throughput you provision for the table. As your data is small, aggregating before writing in batches (BatchWriteItem) is probably more cost efficient than individual writes.
does querying the table by giving the id of the sensor array and a minimal date will ensure being able to retrieve the data in a efficient fashion?
Yes, queries by hash (id) and range keys (date) would be very efficient. You may need secondary indexes for more complex queries though.
In terms of security is it okay to proceed this way?
Although data is encrypted in transit and client-side encryption is straightforward, there is a lot uncovered here. For example, AWS IoT provides TLS mutual authentication with certificates, IAM or Cognito, among other security features. AWS IoT can store data in DynamoDB using a simple rule.
I have queries for 2 use cases with different throughput needs being directed to one DynamoDB table.
First use case needs read/write only using primary key, but needs at least 1700/sec write and 8000/sec read
Second Use case utilizes every GSI, but queries that use GSI are few and far between. Less than 10 queries per minute.
So my provisioned capacity for GSI will be far less than what is provisioned for primary key. Does this mean when I do a write on the table, the performance upper bound is what I have provisioned for GSI?
Asked AWS Support same question, Below is their answer:
Your question is worth asking. In the scenario you mention your read/write request in GSI will be throttled, and 10 writes / min will be the effective limit. This will create issues when ever you update your primary table, the updates will get mirrored to GSI. So either you should Provision similar write capacity to GSI or do not keep attribute in GSI that will get updated frequently.
Here is link to our documentation that will help you :
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.ThroughputConsiderations
I think so. When you add new items they will need to be added to the GSI index as well, so the same capacity is needed there as well
In order for a table write to succeed, the provisioned throughput settings for the table and all of its global secondary indexes must have enough write capacity to accommodate the write; otherwise, the write to the table will be throttled.
There're more details and use-cases here:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.ThroughputConsiderations