3 fields composite primary key (unique item) in Dynamodb - amazon-web-services

I am trying to create a table to store invoice line items in DynamoDB. Let's say the item is defined by CompanyCode, InvoiceNumber and LineItemId, amount and other line item details.
A unique item is defined by the combination of the first 3 attributes. Any 2 of those attributes can be same for the different items. What should I select as the Hash Attribute and the Range Attribute?

Some Intro
For efficiency I would propose totally different design. With NoSQL databases (and DynamoDB is not different) we always need to consider the access patterns first. Also, if possible we should strive to fit all our data within same table and several indexes. From what we have from OP and his comments, these are the two access patterns:
For a company X, get complete invoice Y (including all items or range of items) [based on this comment ]
Get all invoices for company X [ based on this comment ]
We now wonder what is a good Primary Key? Translates to question what is a good Partition Key (PK) and what is a good Sort Key (SK) and which secondary indexes do we need to create and of what kind (local or global)? Some reminders:
Primary Key can be on one column or composite
Composite primary key consists of Partition Key and Sort Key
Partition key is used as input to the hashing function that will determine partition of the items
Sort key can also be composite, which allows us to model one-to-many relationships in DynamoDB as given in one of the comments links: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html
When creating query on the table or index, you always need to use '=' operator on the Partition Key
When querying ranges on Sort Key you have option for KeyConditionExpression which provides you with set of operators for sorting and everything in between (one of them being function begins_with (a, substr) )
You are also allowed to use FilterExpression if you need to further refine the Query results (filter on the projected attributes)
Local Secondary Indexes (LSI) have same Partition Key but different Sort Key than your original table and give you different view of your data, organized according to an alternative Sort Key
Global Secondary Indexes (GSI) have different Partition Key and different Sort Key than your original table and give you completely different view on data
All items with the same partition key are stored together, and for composite Primary keys, are ordered by the sort key value. DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB.
Back To Modeling
It is obvious that we are dealing with multiple entities that need to be modeled and fit into the same table. To satisfy condition of Partition Key being unique on the table, CompanyCode comes as a natural Partition Key - so I would ensure that is unique. If not then you need to ask yourself how can you model the second access pattern?
Assuming we have established uniqueness on the CompanyCode let's simplify and say that it comes in the form of an e-mail (or could be domain or just a code, but I will use email for demonstration).
Relationship between Company and Invoices is always 1:many.
Relationship between Invoice and Items is always 1:many.
I propose design as in the image below:
With PK being CompanyCode and SK being InvoiceNumber can store all attributes about that invoice for that company.
Nothing prevents me to also add record where the SK is Customer which allows me to store all attributes about the company.
With GSI1 , we will create reverse lookup where GSI1PK is my tables SK (InvoiceNumber) and my GSI1SK is my tables PK (CompanyCode).
I am using same table to store line items with PK being LineItemId and SK being CompanyCode (still unique)
For Item entity items my GSI1PK is still InvoiceNumber and my GSI1SK is LineItemId which is tables PK so its same as for Invoice entity items.
Now the access patterns supported with this:
If I want to get invoice Y for company X and all the items (access pattern 1): Query the table where CompanyCode=X and use KeyConditionExpression with = operator on the Sort Key InvoiceNumber. If I want to get all the items tied to that invoice, I will project Items attribute using ProjectionExpression.
By retrieving all the items with previous query for company X and invoice Y, I can now run BatchGetItem API call (using my unique composite key LineItemId+CompanyCode) on table to get all items belonging to that particular invoice of that particular customer. (this comes with some constraints of BatchGetItem API)
To support access pattern 2, I will do a query with CompanyCode=X on PK and use KeyConditionExpression on the SK with begins_with (a, substr) function/operator to get only invoices for company X and not the metadata about that company. That will give me all invoices for given company/customer.
Additionally, with above GSI1, for any given InvoiceNumber I can easily select all the line items that belong to that particular invoice. REMEMBER: The key values in a global secondary index do not need to be unique - so in my GSI1 I could have had easily invoice_1 -> (item_1, item_2) and then another invoice_1 -> (item_1,item_2) but the difference between two items in GSI would be in the SK (it would be associated with different CompanyCode (but for demonstration purposes I used invoice_1 and invoice_2).

I believe the first option offered by #georgeaf99 won't work, because if you do it that way, then CompanyCode has to be unique in the table. Therefore, there would only be one item allowed per company. I think the second solution is the only real way to do it.
You can use CompanyCode as the Hash Key, and then all other fields that combine to make the item unique (in this case InvoiceNumber and LineItemId) need to be somehow combined into one value (such as concatenation with a field delimiter), which would be your Range Key. Unfortunately that is kind of ugly, but that's the nature of a NoSQL database like DynamoDB. However, it will allow you to successfully store records with the correct uniqueness. When reading the records back, if you don't want to parse the combined field back out to its individual parts, then you'll have to add additional separate fields for InvoiceNumber and LineItemID.
If you don't have a large number of invoices per company, you can query by only the Hash Key and do the filtering on the client side. If you have a large number of invoices per company and need to be able to query only the items for a single invoice, then I would create a secondary index on CompanyCode and InvoiceNumber.

As I'm sure you have figured out you cannot have more than two attributes form your primary key (hash+range). Thus, depending on the type of queries you will be performing and the size of your data you can structure your table in different ways.
(Optimized for the query type you mentioned above: only CompanyCode & all 3)
Best sol'n for small/medium size data sets:
Hash Key: CompanyCode
Perform the query using only CompanyCode and
then filter your results on the other two attributes
Optimal solution for large data sets:
Hash Key: CompanyCode
Range Key: InvoiceNumber+LineItemId
This allows you to query only on an index, but the table structure is pretty ugly

Related

How to structure DynamoDB index to allow retrieval by multiple fields

I'm new to DynamoDB and trying to figure out how to structure my data/table/index. My schema includes an itemid (unique) and an orderid (multiple items per order), along with some other arbitrary attributes. I want to be able to retrieve a single item by its itemid, but also retrieve a set of items by their OrderId.
My initial instinct was to set the itemid as the primary key and the orderid as the sort key, but that didn't allow me to query by orderid only. However the same problem occurs if I reverse those.
Example data:
ItemId
OrderId
abc-123
1234
def-345
1234
ghi-678
5678
jkl-901
5678
I think I may need a Global Se but not quite understanding where those fit.
If your question is really whether you "are able" to do this, then with ItemId as the partition key, you can still retrieve by OrderId, with the Scan operation, which will let you filter by any attribute.
However Scan will perform full table scans, so the real question is probably whether you can retrieve by OrderId efficiently. In that case, you would indeed need a Global Secondary Index with OrderId and ItemId as the composite attribute key.
This is typically achieved using what's called a "single table design". What this means, is that you store all your data in one table, and store it normalized, i.e. duplicate your data so that it fits your access patterns.
Generally speaking, if you do not know your access patterns beforehand, dynamodb might not be a good fit. For many systems, a good solution is to have the "main" access patterns in dynamo and then offloading some not so performance critical ad-hoc queries by replicating data to something like elasticsearh.
If you have a table with the hash key PK (String) and the sort key SK (String), you can store your data like this. Use transactions to keep the multiple items up to date and consistent etc.
PK
SK
shippingStatus
totalPrice
cartQuantity
order_1234
order_status
PENDING
123123
order_1234
item_abc-123
3
order_1234
item_def-345
1
order_5678
order_status
SHIPPED
54321
order_5678
item_jkl-901
5
item_abc-123
order_1234
item_abc-123
order_9876
item_abc-123
order_5656
This table illustrates the schemaless nature of a dynamo table (except from the PK/SK). With this setup, you can store "metadata" about the order in the order_1234/order_status item. Then, you can query for items with PK order_1234 and SK starts_with "item_" to get all the items for that order. You can do the same to get all the orders for an item - query for PK item_abc-123 and SK starting with "order_" to get all the orders.
I highly recommend this talk by Rick Houlihan to get into single table design and data modelling in dynamo :)
https://www.youtube.com/watch?v=HaEPXoXVf2k

Retrieving arrays of nested information in AppSync schema

I have worked out a fairly complex chain of DynamoDB resolvers on a GraphQL AppSync query. What I am curious to know is if I could have possibly designed this in a way to require fewer DynamoDB queries.
Here is my GraphQL Schema:
type Tag {
PartitionKey: ID!
SortKey: ID!
TagName: String!
TagType: String
}
type Model {
PartitionKey: ID!
Name: String
Version: Int
FBX: String
# ms since epoch
CreatedAt: AWSTimestamp
Description: String
Tags: [String]
}
type Query {
GetAllModels(count: Int, nextToken: String): PaginatedModels!
}
This is the query that I am doing:
query GetAllModels{
GetAllModels {
Models {
PartitionKey
Name
Version
CreatedAt
Description
Tags {
TagName
TagType
}
}
}
}
My DynamoDB table is set up as so:
PartionKey | SortKey | TagName | TagType | ModelName | Description
Model-0 | Model-0 | ModelZero | Blah Blah
Model-0 | Tag-Pine |
Model-0 | Tag-Apple |
Tag-Pine | Tag-Pine | Pine | Tree
Tag-Apple | Tag-Apple | Apple | Fruit
So in my resolvers I am going:
GetAllModels will scan with two filters. One filter for PartitionKey beginning with 'Model-' and another filter for SortKey begining with 'Model-'. This is to get all Models.
Next there is a resolver attached to 'Tags' in the Model object. This will query with two expressions. One for PartitionKey = source.Parition and a second for SortKey begin_with 'Tag-' this gets me all of the tags on a model.
Next there are two resolvers on the Tag object. One on TagName and another on TagType. These do a direct GetItem to get their appropriate value with PartitionKey = source.Sort and SortKey = source.SortKey set as the keys.
So each scanned Model ends up firing off 3 more queries to DynamoDB. This just seems a bit excessive to me. But I cannot see any other way to do this. Is there some way to be able to get both TagName and TagType in one query?
Is there a better way to approach this?
I see a few things that I would personally change. The first is that I would avoid the nested DynamoDB scan operations. At least one of these can be replaced with a much faster query operation. The second is that I would consider rethinking how you are storing the data. Currently, there is no good way to list model objects.
Why is there no good way to list model objects?
Assuming each model object will have multiple tags then you are going to have a table that is sparsely populated by model objects. i.e. out of 100 rows you may have 20 - 50 models depending on how many tags the average model has. In DynamoDB, a table is split up based on the partition key causing rows that share the same partition key to be stored near each other to speed up query operations. With your setup where the Partition Key is essentially the unique id of a single model object this means that we can easily get a single model object. You can also quickly get the tags for a single object since those records are nearby as well.
The issue.
The DynamoDB scan operation looks at each partition one at a time, reads as many records as the requests limit allows or all of them if the limit is sufficiently large, and then, only after reading the records from the individual partitions, applies the filter expression before returning the final result. This means you may ask for the first 10 models but since the limit is applied before the scan filter, you may very well only get back 1 model (if that one model had 9 or more tags which would exhaust the limit while DynamoDB was reading the first partition). This may seem strange when coming from many different database systems and is an important consideration of its design.
Here are two solutions to address this concern:
1. Store Models in one table and Tags in another.
NoSQL databases like DynamoDB allow you to store many types of data in the same table but there is nothing wrong with splitting them out. Traditionally it can be a pain to work with multiple tables in a NoSQL database that lacks a join operation or something similar, but fortunately for us we can use GraphQL to "join" data for us. With the approach, the Model table has a single partition key named "id" and your GetAllModels resolver is still a scan but this time on the model table. This way the table is not sparse and you will get 10 models when you ask for 10 models. The Tag table should have a partition key of modelId and a sort key of tagId. You would then have a resolver on the Model.tags field that does a query against the Tag table and looks for rows with the modelId == $ctx.source.id.
This is essentially how #model and #connection work in the new graphql transform tooling launched as part of the amplify cli. You can see more here although the docs are as of writing still being improved. https://aws-amplify.github.io/amplify-js/media/api_guide
2. Store Models and Tags in the same table but change the key structure.
This approach works if you can reliably say that you will have less than 10GB of data per data type (e.g. Model & Tag). For this approach you have a single table with a PartitionKey of Type and Sort Key of id. When you create objects you create them with a Type e.g "Tag" or "Model" etc and a unique id (like a uuid). To list objects of the same type you do a DynamoDB query operation on the partition key of the type to list e.g. "Tag" or "Model". You can then use GSIs to efficiently look up related objects. In your case you would store a "modelId" is every Tag object. You would then make a GSI using the "modelId" as the Partition Key. To list all the tags for a given model you could then do a DynamoDB query operation against that GSI.
I'm sure there are many more ways to do this but hopefully this helps point in the right direction.

Database design in DynamoDB: Bookmark storage

I'm interested in best practices on how set up tables and indices for given query requirements. I have a basic understanding of related concepts such partition and sort keys or LSI and GSI secondary indices, but have problems putting it all together and designing one or more table with indices that supports a palpable example.
The example i'm looking at is a "Bookmark storage", where multiple users can store bookmarks to URLs and annotate those with a number of tags. A User has multiple Urls (= bookmark). Each Url has a date and can have one or more Tags.
A bookmark might have the following basic structure:
{
"user": "watQuadrat",
"url": "http://stackoverflow.com",
"date": 1494161436362,
"tags": [ "forum", "programming" ]
}
My biggest question at this point is how to setup the table structure so that I can accommodate the various different ways the data could be queried, e.g.:
List all Tags for a User, sorted by how often the user used a tag
List all Tags for a User, sorted alphabetically
List all Tags for an Url, sorted by how often this tag was given for the url
List all Tags matching a given search string, sorted by how often the tag was used (e.g. search for "shop", return all tags that match such as "shopping" order by how often they were used)
List all Urls for a User, sorted by date
List all Urls for a User and a Tag, sorted by date
List all Urls for a Tag, sorted by how often the tag was given to each url
List all Users for an Url, sorted by date
How would this be designed so i can perform all of these queries in a performant way? Would you design this any different when additionally trying to reduce cost?
Considering the scenario you have described, I would design the table as mentioned below. Here I have assumed that one user can create only one bookmark from a given url. And also I have used a new derived attribute called TagCount which denotes the count of tags for that bookmark.
Table Structure
Primary partition key : UserID
Primary sort key : Url
Local Secondary Indexes
Index 1
Partition key : UserID
Sort key : Date
Index 2
Partition key : UserID
Sort key : TagCount
Global Secondary Indexes
Index 1
Partition key : Url
Sort key : Date
Index 2
Partition key : Url
Sort key : TagCount
With this design you can do your queries in the following manner.
List all Tags for a User, sorted by count
Query using LSI UserID-TagCount
List all Tags for an Url, sorted by count
Query using GSI Url-TagCount
List all Tags matching a given string, sorted by count
I assume the string you meant here belongs to url. If so you will have to perform a scan
List all Urls for a User, sorted by date
Query using LSI UserId-Date
List all Urls for a User and a Tag, sorted by date
Query LSI UserId-Date table with a filter expression for searching tag
List all Urls for a Tag, sorted by count
You will have to do a scan here
List all Users for an Url, sorted by date
Query GSI Url-Date
If you are concerned about the cost. You can loose some GSIs based on the query patterns you would expect.
Update 1
Considering the updated requirement, since there are many queries based on the tag, I think there should be a second table with the following structure
Primary partition key : TagName
Primary sort key : UserID
Global Secondary Indexe
Partition key : UserID
Sort key : Usage - Derived attribute similar to tag count, total usage of the tag

DynamoDB : Global Secondary Index utilisation in queries

I am coming from RDMS background and I started using DynamoDB recently.
I have following DyamoDB table with three Global Secondary Indexes (GSI)
Id (primary key), user_id(GSI), event_type (GSI), product_id (GSI)
, rate, create_date
I have following three query patterns:
a) WHERE event_type=?
b) WHERE event_type=? AND product_id=?
c) WHERE product_id=?
d) WHERE product_id=? AND user_id=?
I know in MySQL I need to create following indexes to optimize above queries :
composite index (event_type,product_id) : for queries "a" and "b"
composite index (product_id,user_id) : for queries "c" and "d"
My question is , if I create three GSIs for 'event_type', 'product_id' and 'user_id' fields in DyanomoDB, do the query patterns "b" and "d" utilize these three independent GSIs ?
Firstly, unlike in RDBMS, the Dynamodb doesn't choose the GSI based on the fields used in filter expression (I meant there is no SQL optimizer to choose the appropriate index based on the fields used in SQL).
You will have to query the GSI directly to get the data. You can refer the GSI query page to understand more on this.
You can create two GSIs:-
1) Event type
2) Product id
You make sure to include the other required fields in the GSI especially product id, user id and any other required fields. This way when you query the GSI, you get all the fields required to fulfill the use case. As long as you have one field from GSI, you can include other fields in Filter expression to filter the data. This ensures that you dont create unnecessary GSIs which requires additional space and cost.

DynamoDB query table for two indexed fields

In my dynamoDB table I have two fields which I intend to query for a specific value. One will returns several results, the second will return one result (aka one row).
I already have a range key field which is the date that the raw inserted.
What are my options if I want to have two fields to query by a single field?
From my understanding, there is an option to create a secondary index, but I must query it alongside with the hash key, which is not what I need. Is there any option to achieve that in a single table instead of creating a second table for that query?
Example Folders table:
user_id (hash key)
date_created (range key)
folder_id
First query: Select all folders Id's for a specific user order by date descending
Second query: Select single raw for a specific folder_id
If I use secondary index on the folder_id, I will have to query it alongside the user_id (hash key), which is as you can see, not what I want to achieve.
Thanks.
You need a Global Secondary Index.
folder_id (hash key)
date_created (range) - optional
Then you can also query by the folder_id
As you said yourself, you can't use LSI in this scenario.
You will have to create another table where folder_id is the hash key and the rest of the data as attributes.
You can also query for one primary key that will return the least amount of results than filter out the rest in code.