I need to be able to run some range-based queries on my DynamoDB table, such as int_attribute > 5, or starts_with(string_attribute, "foo"). These can all be answered by creating a global or local secondary index and then submitting a Query to these indexes. However, running a Query requires that you also provide a single value of the partition key to restrict the query set. Neither of these queries has a strict equality condition, so I am therefore considering giving all the items in my Dynamo table the same partition key, and distinguishing them only with the sort key. My dataset is will within the 10 GB partition size limit.
Are there any catastrophic issues that might occur if I do this?
Yes, you can create a GSI where every item goes under the same partition key. The thing to be aware of is you'll generally be putting all those writes into the same physical partition, each of which has a max update rate of 1,000 WCU.
If your update rate is below that, proceed. If your update rate is above that, you'll want to follow a pattern of sharding the GSI partition key value so it spreads across more partitions.
Say you require 10,000 WCU for the GSI. You can assign each item's GSI PK value to a random value-{x} where x is 0 to 9. Then yes, at query time you do 10 queries and put the results back together yourself. This approach can scale as large as you need.
Related
Would there be any problem if I create my table without a RANGE value (ie, HASH value only)? I've heard that DynamoDB tables store entries in partitions, and these partitions are determined using the HASH key. If I use only HASH key on my table, would it cause too many partitions? Also, would it increase the time to seek data in my table using queries?
There's no drawback, the hash key decides the partition the data will live on and the partitions are designed to handle up to 3k RCU and 1k WCU or up to 10GB of data.
Performance "guarantees" are based on them delivering that, so it shouldn't matter.
Having unique partition keys may actually help you with scaling later down the line as chances are that requests can be spread out more evenly, since there is no query operation that works on multiple items in an item collection.
would it cause too many partitions?
There's no such thing in Dynamo DB.
would it increase the time to seek data in my table using queries
You can't Query() a DDB table unless it has a range key. With a only a hash key you can only use GetItem(). Scan() is also allowed, but you really shouldn't be using that regularly.
You'd have to add a Global Secondary Index (GSI) with a has & range key in order to Query() your data.
I'd like to list records from my DDB table ordered by creation date.
My table has an attribute DateCreated.
All examples I can find describe ordering within some partition.
But I want global ordering.
Am I supposed to create an artificial attribute which will have the same value across all records, just to use it as a partition key? E.g. add new attribute GlobalPartition with value 1 to every record in the table, and create a GSI with partition key GlobalPartition and sort key DateCreated. Isn't there a better way?
Thx!
As you noticed, DynamoDB indeed does not have an option to sort items "globally". In other words, there is no way to Scan the database in sorted partition-key order. You can only sort items inside one partition, sorted by the "sort key".
When you have a small amount of data, you can indeed do what you said: Have a single partition with everything in this partition. However it's not clear how practical this approach becomes as your single partition grows - to gigabytes or terabytes, and how well DynamoDB can load-balance when you have just a single partition (I never saw any DynamoDB documentation which answer this question).
So another option is not to have a single partition but rather have a number of them. For example, consider that you want to sort items by date. Now insead of having a single partition, have a partition per month, i.e., the partition key is the month number. Now, if you want to sort everything within a month, you can do it directly, but if you want to get a sorted list of a full year, you need to Query twelve partitions, in order, getting a sorted list in each one and combining it to a sorted list for the full year. So-called time-series databases are often modeled this way.
If you want to sort any data in DynamoDB you need to add Sort Key index on that attribute. If value is not in attribute which maps to tables' sort key, or table does not have sort key, then you need to create GSI and put GSI's sort key on that attribute. You can use LSI too. Any attribute, which maps to "Sort Key" of any index. Table, LSI, GSI.
Check for more details "ScanIndexForward" param of the query request.
If ScanIndexForward is true, DynamoDB returns the results in the order in which they are stored (by sort key value). This is the default behavior. If ScanIndexForward is false, DynamoDB reads the results in reverse order by sort key value, and then returns the results to the client.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html#API_Query_RequestSyntax
UI has checkbox too for this:
"Global sort" is not possible, while "global" would mean scan operation and it just runs through all rows in database and filters by filters, yet it does not have sorting option. On query on attribute mapped to sort key has ScanIndexForward option to change sort direction.
I have a DynamoDB table with a good partition key (PK=uuid) but there is a GSI (PK=type, SK=created) where type has only 6 unique values and created is epoch time.
My question is if I start to do a lot of reads with this GSI, will that affect the performance of the whole table? I see that the read capacity for both the table and the GSI is not shared according to this AWS docs but what will happen behind the scene if we start to use this GSI a lot? Will Dynamodb writes get impacted?
A global secondary index has separate capacity units, to avoid performance on the table itself.
A global secondary index is stored in its own partition space away from the base table and scales separately from the base table.
Performance on your table can only be impacted if its own credits are depleted. The global secondary index sits in its own partition space which can be treated as if has its own boundaries.
In addition as DynamoDB uses separate credits for read (RCU) and write (WCU) these 2 actions would never have a performance impact on the other.
Is it possible to increase the index from 5 to 15?
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
Secondary Indexes Per Table You can define a maximum of 5 local
secondary indexes and 5 global secondary indexes per table.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html
The Query operation finds items based on primary key values. You can
query any table or secondary index that has a composite primary key (a
partition key and a sort key).
If I understood this correct, you can set 1 main hashkey, 5 secondary keys and another 5 global... and you can only query against an index.
We are thinking about using DynamoDB for a NoSQL database but we are completely stumbed by this. In Mongo or Elastic or Solr.. you can query by pretty much any doc attr you want.
In this app we already have 15 attributes we know we will want to query against, but DynamoDB only offers the ability to index 5.. unless i am mistaken... is there another way to query aside from against a preset index?
You can define a maximum of 5 local secondary indexes.
There is an initial quota of 20 global secondary indexes per table. To request a service quota increase, see https://aws.amazon.com/support.
Source: Secondary Indexes # Developer Guide
Unfortunately, 5 local secondary indexes(LSI) service quota could not be extended.
When you have more (>20) attributes to query from DynamoDB, it is not efficiently possible. You have to use Scan which evaluates all the items in the table, which is not efficient. You have to either move to a different database or use AWS ElasticSearch to index the attributes for searching.
The limit for number of global secondary index per table has been increased to 20.
You can cut a support case incase you need to create more than 20 global secondary index for a DynamoDB table.
https://aws.amazon.com/about-aws/whats-new/2018/12/amazon-dynamodb-increases-the-number-of-global-secondary-indexes-and-projected-index-attributes-you-can-create-per-table/
It turns out that the answer was to wait. Now dynamodb supports 20 global secondary indexes. According to:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
There is an initial limit of 20 global secondary indexes per table. To
request a service limit increase see https://aws.amazon.com/support.
Docker image still has the 5 GSI limit :(
https://hub.docker.com/r/amazon/dynamodb-local
There is a way to circumvent the limit for number of indexes by overloading indexed column. For example, you may store multiple data attributes in the same partition key or sort key. The problem may arise when the values of those different attributes can overlap. In this case, you can, say, prepend a suffix that distinguishes between different attributes.
Let's take a look at an example. Let's say, we have a data set with attributes like user-name, employee-name, company-name, and you we want to store them all in a same indexed column (say partition key of a global secondary index). Some values for the attributes may overlap, so we "tag" our attributes with a prefix: user#name, employee#name and company#name.
This allows us to query with conditions like begins_with(user#abc) without mixing up different attributes but still having them all indexed.
More information is in the official AWS documentation: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
Hope this helps.
i am working on a migration from MS Sql to DynamoDB and i'm not sure what's the best hash key for my purpose. In MS SQL i've an item table where i store some product information for different customers, so actually the primary key are two columns customer_id and item_no. In application code i need to query specific items and all items for a customer id, so my first idea was to setup the customer id as hash key and the item no as range key. But is this the best concept in terms of partitioning? I need to import product data daily with 50.000-100.000 products for some larger customers and as far as i know it would be better to have a random hash key. Otherwise the import job will run on one partition only.
Can somebody give me a hint what's the best data model in this case?
Bye,
Peter
It sounds like you need item_no as the partition key, with customer_id as the sort key. Also, in order to query all items for a customer_id efficiently you will want to create a Global Secondary Index on customer_id.
This configuration should give you a good distribution while allowing you to run the queries you have specified.
You are on the right track, you should really be careful on how you are handling write operations as you are executing an import job in a daily basis. Also avoid adding indexes unnecessarily as they will only multiply your writing operations.
Using customer_id as hash key and item_no as range key will provide the best option not only to query but also to upload your data.
As you mentioned, randomization of your customer ids would be very helpful to optimize the use of resources and prevent a possibility of a hot partition. In your case, I would follow the exact example contained in the DynamoDB documentation:
[...] One way to increase the write throughput of this application
would be to randomize the writes across multiple partition key values.
Choose a random number from a fixed set (for example, 1 to 200) and
concatenate it as a suffix [...]
So when you are writing your customer information just randomly assign the suffix to your customer ids, make sure you distribute them evenly (e.g. CustomerXYZ.1, CustomerXYZ.2, ..., CustomerXYZ.200).
To read all of the items you would need to obtain all of the items for each suffix. For example, you would first issue a Query request for the partition key value CustomerXYZ.1, then another Query for CustomerXYZ.2, and so on through CustomerXYZ.200. Because you know the suffix range (on this case 1...200), you only need to query the records appending each suffix to the customer id.
Each query by the hash key CustomerXYZ.n should return a set of items (specified by the range key) from that specific customer, your application would need to merge the results from all of the Query requests.
This will for sure make your life harder to read the records (in terms of the additional requests needed), however, the benefits of optimized throughput and performance will pay off. Remember a hot partition will not only increase your overall financial cost, but will also impact drastically your performance.
If you have a well designed partition key your queries will always return very quickly with minimum cost.
Additionally, make sure your import job does not execute write operations grouped by customer, for example, instead of writing all items from a specific customer in series, sort the write operations so they are distributed across all customers. Even though your customers will be distributed by several partitions (due to the id randomization process), you are better off taking this additional safety measure to prevent a burst of write activity in a single partition. More details below:
From the 'Distribute Write Activity During Data Upload' section of the official DynamoDB documentation:
To fully utilize all of the throughput capacity that has been
provisioned for your tables, you need to distribute your workload
across your partition key values. In this case, by directing an uneven
amount of upload work toward items all with the same partition key
value, you may not be able to fully utilize all of the resources
DynamoDB has provisioned for your table. You can distribute your
upload work by uploading one item from each partition key value first.
Then you repeat the pattern for the next set of sort key values for
all the items until you upload all the data [...]
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
I hope that helps. Regards.