Retrieving arrays of nested information in AppSync schema - amazon-web-services

I have worked out a fairly complex chain of DynamoDB resolvers on a GraphQL AppSync query. What I am curious to know is if I could have possibly designed this in a way to require fewer DynamoDB queries.
Here is my GraphQL Schema:
type Tag {
PartitionKey: ID!
SortKey: ID!
TagName: String!
TagType: String
}
type Model {
PartitionKey: ID!
Name: String
Version: Int
FBX: String
# ms since epoch
CreatedAt: AWSTimestamp
Description: String
Tags: [String]
}
type Query {
GetAllModels(count: Int, nextToken: String): PaginatedModels!
}
This is the query that I am doing:
query GetAllModels{
GetAllModels {
Models {
PartitionKey
Name
Version
CreatedAt
Description
Tags {
TagName
TagType
}
}
}
}
My DynamoDB table is set up as so:
PartionKey | SortKey | TagName | TagType | ModelName | Description
Model-0 | Model-0 | ModelZero | Blah Blah
Model-0 | Tag-Pine |
Model-0 | Tag-Apple |
Tag-Pine | Tag-Pine | Pine | Tree
Tag-Apple | Tag-Apple | Apple | Fruit
So in my resolvers I am going:
GetAllModels will scan with two filters. One filter for PartitionKey beginning with 'Model-' and another filter for SortKey begining with 'Model-'. This is to get all Models.
Next there is a resolver attached to 'Tags' in the Model object. This will query with two expressions. One for PartitionKey = source.Parition and a second for SortKey begin_with 'Tag-' this gets me all of the tags on a model.
Next there are two resolvers on the Tag object. One on TagName and another on TagType. These do a direct GetItem to get their appropriate value with PartitionKey = source.Sort and SortKey = source.SortKey set as the keys.
So each scanned Model ends up firing off 3 more queries to DynamoDB. This just seems a bit excessive to me. But I cannot see any other way to do this. Is there some way to be able to get both TagName and TagType in one query?
Is there a better way to approach this?

I see a few things that I would personally change. The first is that I would avoid the nested DynamoDB scan operations. At least one of these can be replaced with a much faster query operation. The second is that I would consider rethinking how you are storing the data. Currently, there is no good way to list model objects.
Why is there no good way to list model objects?
Assuming each model object will have multiple tags then you are going to have a table that is sparsely populated by model objects. i.e. out of 100 rows you may have 20 - 50 models depending on how many tags the average model has. In DynamoDB, a table is split up based on the partition key causing rows that share the same partition key to be stored near each other to speed up query operations. With your setup where the Partition Key is essentially the unique id of a single model object this means that we can easily get a single model object. You can also quickly get the tags for a single object since those records are nearby as well.
The issue.
The DynamoDB scan operation looks at each partition one at a time, reads as many records as the requests limit allows or all of them if the limit is sufficiently large, and then, only after reading the records from the individual partitions, applies the filter expression before returning the final result. This means you may ask for the first 10 models but since the limit is applied before the scan filter, you may very well only get back 1 model (if that one model had 9 or more tags which would exhaust the limit while DynamoDB was reading the first partition). This may seem strange when coming from many different database systems and is an important consideration of its design.
Here are two solutions to address this concern:
1. Store Models in one table and Tags in another.
NoSQL databases like DynamoDB allow you to store many types of data in the same table but there is nothing wrong with splitting them out. Traditionally it can be a pain to work with multiple tables in a NoSQL database that lacks a join operation or something similar, but fortunately for us we can use GraphQL to "join" data for us. With the approach, the Model table has a single partition key named "id" and your GetAllModels resolver is still a scan but this time on the model table. This way the table is not sparse and you will get 10 models when you ask for 10 models. The Tag table should have a partition key of modelId and a sort key of tagId. You would then have a resolver on the Model.tags field that does a query against the Tag table and looks for rows with the modelId == $ctx.source.id.
This is essentially how #model and #connection work in the new graphql transform tooling launched as part of the amplify cli. You can see more here although the docs are as of writing still being improved. https://aws-amplify.github.io/amplify-js/media/api_guide
2. Store Models and Tags in the same table but change the key structure.
This approach works if you can reliably say that you will have less than 10GB of data per data type (e.g. Model & Tag). For this approach you have a single table with a PartitionKey of Type and Sort Key of id. When you create objects you create them with a Type e.g "Tag" or "Model" etc and a unique id (like a uuid). To list objects of the same type you do a DynamoDB query operation on the partition key of the type to list e.g. "Tag" or "Model". You can then use GSIs to efficiently look up related objects. In your case you would store a "modelId" is every Tag object. You would then make a GSI using the "modelId" as the Partition Key. To list all the tags for a given model you could then do a DynamoDB query operation against that GSI.
I'm sure there are many more ways to do this but hopefully this helps point in the right direction.

Related

Dynamodb One table design - scan and filter with limit approach?

So I'm following the one table design and the PK keys that are with the below format
P#product1
P#product2
P#product3
W#warehouse1
W#warehouse2
P#product4
....
With this query pattern "get all products" , I need to run a scan to get all records "begins_with = P#" and I'm not sure if this is the ideal approach.
I understand Scan is resource-consuming (and I would love not to have to rely on it)
Not to mention that if I want to put in limit & pagination, the scenario becomes even more cumbersome (as limit is applied before the filter). E.g: the first scan with a limit of 10 may return only 3 products, next one may only return 2 , etc..)
Is there a more straight forward approach? I was hoping to at least scan through say 87 products out of 1000 records, and will still be able to get 9 pages of 10 products per instead?
I've come across other forum topics and found this solution that we can utilise Dynamodb Global Secondary Index
Basically:
We'll set up an attribute , say entitytype(values can be product,warehouse...)
And create a Global Secondary Index with
GSI PK : to set to that entitytype
GSI SK : set to the original PK
We'll end up having the below in this GSI
product P#product1
product P#product2
warehouse W#warehouse1
We can then query against this GSI using Query entitytype=product

How to structure DynamoDB index to allow retrieval by multiple fields

I'm new to DynamoDB and trying to figure out how to structure my data/table/index. My schema includes an itemid (unique) and an orderid (multiple items per order), along with some other arbitrary attributes. I want to be able to retrieve a single item by its itemid, but also retrieve a set of items by their OrderId.
My initial instinct was to set the itemid as the primary key and the orderid as the sort key, but that didn't allow me to query by orderid only. However the same problem occurs if I reverse those.
Example data:
ItemId
OrderId
abc-123
1234
def-345
1234
ghi-678
5678
jkl-901
5678
I think I may need a Global Se but not quite understanding where those fit.
If your question is really whether you "are able" to do this, then with ItemId as the partition key, you can still retrieve by OrderId, with the Scan operation, which will let you filter by any attribute.
However Scan will perform full table scans, so the real question is probably whether you can retrieve by OrderId efficiently. In that case, you would indeed need a Global Secondary Index with OrderId and ItemId as the composite attribute key.
This is typically achieved using what's called a "single table design". What this means, is that you store all your data in one table, and store it normalized, i.e. duplicate your data so that it fits your access patterns.
Generally speaking, if you do not know your access patterns beforehand, dynamodb might not be a good fit. For many systems, a good solution is to have the "main" access patterns in dynamo and then offloading some not so performance critical ad-hoc queries by replicating data to something like elasticsearh.
If you have a table with the hash key PK (String) and the sort key SK (String), you can store your data like this. Use transactions to keep the multiple items up to date and consistent etc.
PK
SK
shippingStatus
totalPrice
cartQuantity
order_1234
order_status
PENDING
123123
order_1234
item_abc-123
3
order_1234
item_def-345
1
order_5678
order_status
SHIPPED
54321
order_5678
item_jkl-901
5
item_abc-123
order_1234
item_abc-123
order_9876
item_abc-123
order_5656
This table illustrates the schemaless nature of a dynamo table (except from the PK/SK). With this setup, you can store "metadata" about the order in the order_1234/order_status item. Then, you can query for items with PK order_1234 and SK starts_with "item_" to get all the items for that order. You can do the same to get all the orders for an item - query for PK item_abc-123 and SK starting with "order_" to get all the orders.
I highly recommend this talk by Rick Houlihan to get into single table design and data modelling in dynamo :)
https://www.youtube.com/watch?v=HaEPXoXVf2k

Querying nested attributes in Amazon DynamoDB

How can I efficiently query on nested attributes in Amazon DynamoDB?
I have a document structure as below, which lets me store related information in the document itself (rather than referencing it).
It makes sense to store the seminars nested in the course, since they will likely be queried alongside the course (they are all course-specific, i.e. a course has many seminars, and a seminar belongs to a course).
In CouchDB, which I’m migrating from, I could write a View that would project some nested attributes for querying. I understand that I can’t project anything that isn’t a top-level attribute into a dynamodb secondary index, so this approach doesn’t seem to work.
This brings me back to the question: how can I efficiently query on nested attributes without scanning, if I can’t use them as keys in an index?
For example, if I want to get average attendance at Nelson Mandela Theatre, how can I query for the values of registrations and attendees in all seminars that have a location of “Nelson Mandela Theatre” without resorting to a scan?
{
“course_id”: “ABC-1234567”,
“course_name”: “Statistics 101”,
“tutors”: [“Cognito-sub-1”, “Cognito-sub-2”],
“seminars”: [
{
“seminar_id”: “XXXYYY-12345”,
“epoch_time”: “123456789”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “How to lie with statistics”,
“registrations”: “92”,
“attendees”: “61”
},
{
“seminar_id”: “BBBCCC-44444”,
“epoch_time”: “155555555”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “Statistical significance for dog owners”,
“registrations”: “244”,
“attendees”: “240”
},
{
“seminar_id”: “XXXAAA-54321”,
“epoch_time”: “223456789”,
“duration”: “4000”,
“location”: “Starbucks”,
“name”: “Is feral cat population growth a leading indicator for the S&P 500?”,
“registrations”: “40”
}
]
}
{
“course_id”: “CJX-5553389”,
“course_name”: “Cat Health 101”,
“tutors”: [“Cognito-sub-4”, “Cognito-sub-9”],
“seminars”: [
{
“seminar_id”: “TTRHJK-43278”,
“epoch_time”: “123456789”,
“duration”: “5400”,
“location”: “Catwoman Hall”,
“name”: “Emotional support octopi for cats”,
“registrations”: “88”,
“attendees”: “87”
},
{
“seminar_id”: “BBBCCC-44444”,
“epoch_time”: “123666789”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “Statistical significance for cat owners”,
“registrations”: “44”,
“attendees”: “44”
}
]
}
Index cannot be created for nested attributes (i.e. document data types in Dynamodb).
Document Types – A document type can represent a complex structure
with nested attributes—such as you would find in a JSON document. The
document types are list and map.
Query Api:-
A query operation searches only primary key attribute values and supports a subset of comparison operators on key attribute values to refine the search process.
Scan API:-
A scan operation scans the entire table. You can specify filters to apply to the results to refine the values returned to you, after the complete scan.
In order to use Query API, the hash key value is required. The OP doesn't have any information that hash key value is available. As per OP, the data needs to be queried by location attribute which is inside the Dynamodb List data type. Now, the option is to look at GSI.
Kindly read more about the GSI. One of the rules is that GSI can be created using top level attributes only. So, the location can't be used to create the index.
So, creating the GSI in order to use Query API has been ruled out as well.
The index key attributes can consist of any top-level String, Number,
or Binary attributes from the base table; other scalar types, document
types, and set types are not allowed.
Because of the above mentioned reasons, the Query API can't be used to get the data based on location attribute assuming hash key value is not available.
If hash key value is available, FilterExpression can be used to filter the data. Only way to filter the data present in the complex list data type is CONTAINS function. In order to use CONTAINS function, all the attributes in the occurrence is required to match the data (i.e. seminar_id, location, duration and all other attributes). So, it is definitely not possible to fulfil the use case mentioned in the OP using the current data model.
Proposed alternate solution:-
Re-modeling the data structure as mentioned below could be an option to resolve the problem. There is definitely no other solution available to fulfil the use case using Query API.
Main Table :-
Course Id - Hash Key
seminar_id - Sort Key
GSI :-
Seminar location - Hash Key
Course Id - Sort Key
In a DynamoDB table, each key value must be unique. However, the key
values in a global secondary index do not need to be unique.
Now, you can use the Query API on GSI to get the data for Seminar location is equal to Nelson Mandela Theatre. You can use the course id in the query api if you know the value. The query api will potentially give multiple items in the result set. You can use FilterExpression if you would like to further filter the data based on some non key attributes.
This is an example from here where you use a filter expression, it is with a scan operation, but maybe you can apply something similar for query instead of scan (take a look at the API):
{
"TableName": "MyTable",
"FilterExpression": "#k_Compatible.#k_RAM = :v_Compatible_RAM",
"ExpressionAttributeNames": {
"#k_Compatible": "Compatible",
"#k_RAM": "RAM"
},
"ExpressionAttributeValues": {
":v_Compatible_RAM": "RAM1"
}
}
You can do one thing to make it working on Scan
Store the object in stringify format like
{
"language": "[{\"language\":\"Male\",\"proficiency\":\"Female\"}]"
}``
and then can perform scan operation
language: {
contains: "Male"
}
on client side you can perform JSON.parse(language)
I have not such experience with DynamoDB yet but started setudying it since I'm planning on use it for my next project.
As far as I could understand from AWS documentation, the answer to your question is: it's not possible to efficiently query on nested attributes.
Looking at Best Practices, spetially Best Practices for Using Secondary Indexes in DynamoDB, it's possible to understand that the right approach should be using diffent line types under the same Partition Key as shown here. Then under the same course_id you would have a generic sorting key(sk). The first register would then have sk = 'Details' with course's data, then other registers like "seminar-1" and it's data, and so on.
You would then set seminar's properties you would like to query as SGI (Secondary Global Index) bearing in mind that it can only have 5 SGI per table.
Hope it helps.
You can use document paths to filter the values. Use seminars.location as the document path.

DynamoDB query using DynamoDBMapper

Say if I had a DynamoDB table:
UserId (hash): S
BookName (range): S
BorrowedTime: S
ReturnedTime: S
UserId is the primary key (hash), and I needed to set BookName as sort key (range) because another item being added to the database was overwriting the previous with the same UserId.
How would I go about creating a query using DynamoDBMapper, but the fields being queried are the time fields (which are non-keys)? For instance, say if I wanted to return the UserId and BookName of any book borrowed over 2 weeks ago that hasn't been returned yet?
Do I need to setup a GSI on both BorrowedTime and ReturnedTime fields?
Yes you can make a GSI using BorrowedTime and ReturnedTime or you can use scan instead of a query , if you use scan you dont need to make a gsi but scan operations scan the whole database so it is not really recommended on large db or frequent use.

3 fields composite primary key (unique item) in Dynamodb

I am trying to create a table to store invoice line items in DynamoDB. Let's say the item is defined by CompanyCode, InvoiceNumber and LineItemId, amount and other line item details.
A unique item is defined by the combination of the first 3 attributes. Any 2 of those attributes can be same for the different items. What should I select as the Hash Attribute and the Range Attribute?
Some Intro
For efficiency I would propose totally different design. With NoSQL databases (and DynamoDB is not different) we always need to consider the access patterns first. Also, if possible we should strive to fit all our data within same table and several indexes. From what we have from OP and his comments, these are the two access patterns:
For a company X, get complete invoice Y (including all items or range of items) [based on this comment ]
Get all invoices for company X [ based on this comment ]
We now wonder what is a good Primary Key? Translates to question what is a good Partition Key (PK) and what is a good Sort Key (SK) and which secondary indexes do we need to create and of what kind (local or global)? Some reminders:
Primary Key can be on one column or composite
Composite primary key consists of Partition Key and Sort Key
Partition key is used as input to the hashing function that will determine partition of the items
Sort key can also be composite, which allows us to model one-to-many relationships in DynamoDB as given in one of the comments links: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html
When creating query on the table or index, you always need to use '=' operator on the Partition Key
When querying ranges on Sort Key you have option for KeyConditionExpression which provides you with set of operators for sorting and everything in between (one of them being function begins_with (a, substr) )
You are also allowed to use FilterExpression if you need to further refine the Query results (filter on the projected attributes)
Local Secondary Indexes (LSI) have same Partition Key but different Sort Key than your original table and give you different view of your data, organized according to an alternative Sort Key
Global Secondary Indexes (GSI) have different Partition Key and different Sort Key than your original table and give you completely different view on data
All items with the same partition key are stored together, and for composite Primary keys, are ordered by the sort key value. DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB.
Back To Modeling
It is obvious that we are dealing with multiple entities that need to be modeled and fit into the same table. To satisfy condition of Partition Key being unique on the table, CompanyCode comes as a natural Partition Key - so I would ensure that is unique. If not then you need to ask yourself how can you model the second access pattern?
Assuming we have established uniqueness on the CompanyCode let's simplify and say that it comes in the form of an e-mail (or could be domain or just a code, but I will use email for demonstration).
Relationship between Company and Invoices is always 1:many.
Relationship between Invoice and Items is always 1:many.
I propose design as in the image below:
With PK being CompanyCode and SK being InvoiceNumber can store all attributes about that invoice for that company.
Nothing prevents me to also add record where the SK is Customer which allows me to store all attributes about the company.
With GSI1 , we will create reverse lookup where GSI1PK is my tables SK (InvoiceNumber) and my GSI1SK is my tables PK (CompanyCode).
I am using same table to store line items with PK being LineItemId and SK being CompanyCode (still unique)
For Item entity items my GSI1PK is still InvoiceNumber and my GSI1SK is LineItemId which is tables PK so its same as for Invoice entity items.
Now the access patterns supported with this:
If I want to get invoice Y for company X and all the items (access pattern 1): Query the table where CompanyCode=X and use KeyConditionExpression with = operator on the Sort Key InvoiceNumber. If I want to get all the items tied to that invoice, I will project Items attribute using ProjectionExpression.
By retrieving all the items with previous query for company X and invoice Y, I can now run BatchGetItem API call (using my unique composite key LineItemId+CompanyCode) on table to get all items belonging to that particular invoice of that particular customer. (this comes with some constraints of BatchGetItem API)
To support access pattern 2, I will do a query with CompanyCode=X on PK and use KeyConditionExpression on the SK with begins_with (a, substr) function/operator to get only invoices for company X and not the metadata about that company. That will give me all invoices for given company/customer.
Additionally, with above GSI1, for any given InvoiceNumber I can easily select all the line items that belong to that particular invoice. REMEMBER: The key values in a global secondary index do not need to be unique - so in my GSI1 I could have had easily invoice_1 -> (item_1, item_2) and then another invoice_1 -> (item_1,item_2) but the difference between two items in GSI would be in the SK (it would be associated with different CompanyCode (but for demonstration purposes I used invoice_1 and invoice_2).
I believe the first option offered by #georgeaf99 won't work, because if you do it that way, then CompanyCode has to be unique in the table. Therefore, there would only be one item allowed per company. I think the second solution is the only real way to do it.
You can use CompanyCode as the Hash Key, and then all other fields that combine to make the item unique (in this case InvoiceNumber and LineItemId) need to be somehow combined into one value (such as concatenation with a field delimiter), which would be your Range Key. Unfortunately that is kind of ugly, but that's the nature of a NoSQL database like DynamoDB. However, it will allow you to successfully store records with the correct uniqueness. When reading the records back, if you don't want to parse the combined field back out to its individual parts, then you'll have to add additional separate fields for InvoiceNumber and LineItemID.
If you don't have a large number of invoices per company, you can query by only the Hash Key and do the filtering on the client side. If you have a large number of invoices per company and need to be able to query only the items for a single invoice, then I would create a secondary index on CompanyCode and InvoiceNumber.
As I'm sure you have figured out you cannot have more than two attributes form your primary key (hash+range). Thus, depending on the type of queries you will be performing and the size of your data you can structure your table in different ways.
(Optimized for the query type you mentioned above: only CompanyCode & all 3)
Best sol'n for small/medium size data sets:
Hash Key: CompanyCode
Perform the query using only CompanyCode and
then filter your results on the other two attributes
Optimal solution for large data sets:
Hash Key: CompanyCode
Range Key: InvoiceNumber+LineItemId
This allows you to query only on an index, but the table structure is pretty ugly