Querying nested attributes in Amazon DynamoDB - amazon-web-services

How can I efficiently query on nested attributes in Amazon DynamoDB?
I have a document structure as below, which lets me store related information in the document itself (rather than referencing it).
It makes sense to store the seminars nested in the course, since they will likely be queried alongside the course (they are all course-specific, i.e. a course has many seminars, and a seminar belongs to a course).
In CouchDB, which I’m migrating from, I could write a View that would project some nested attributes for querying. I understand that I can’t project anything that isn’t a top-level attribute into a dynamodb secondary index, so this approach doesn’t seem to work.
This brings me back to the question: how can I efficiently query on nested attributes without scanning, if I can’t use them as keys in an index?
For example, if I want to get average attendance at Nelson Mandela Theatre, how can I query for the values of registrations and attendees in all seminars that have a location of “Nelson Mandela Theatre” without resorting to a scan?
{
“course_id”: “ABC-1234567”,
“course_name”: “Statistics 101”,
“tutors”: [“Cognito-sub-1”, “Cognito-sub-2”],
“seminars”: [
{
“seminar_id”: “XXXYYY-12345”,
“epoch_time”: “123456789”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “How to lie with statistics”,
“registrations”: “92”,
“attendees”: “61”
},
{
“seminar_id”: “BBBCCC-44444”,
“epoch_time”: “155555555”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “Statistical significance for dog owners”,
“registrations”: “244”,
“attendees”: “240”
},
{
“seminar_id”: “XXXAAA-54321”,
“epoch_time”: “223456789”,
“duration”: “4000”,
“location”: “Starbucks”,
“name”: “Is feral cat population growth a leading indicator for the S&P 500?”,
“registrations”: “40”
}
]
}
{
“course_id”: “CJX-5553389”,
“course_name”: “Cat Health 101”,
“tutors”: [“Cognito-sub-4”, “Cognito-sub-9”],
“seminars”: [
{
“seminar_id”: “TTRHJK-43278”,
“epoch_time”: “123456789”,
“duration”: “5400”,
“location”: “Catwoman Hall”,
“name”: “Emotional support octopi for cats”,
“registrations”: “88”,
“attendees”: “87”
},
{
“seminar_id”: “BBBCCC-44444”,
“epoch_time”: “123666789”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “Statistical significance for cat owners”,
“registrations”: “44”,
“attendees”: “44”
}
]
}

Index cannot be created for nested attributes (i.e. document data types in Dynamodb).
Document Types – A document type can represent a complex structure
with nested attributes—such as you would find in a JSON document. The
document types are list and map.
Query Api:-
A query operation searches only primary key attribute values and supports a subset of comparison operators on key attribute values to refine the search process.
Scan API:-
A scan operation scans the entire table. You can specify filters to apply to the results to refine the values returned to you, after the complete scan.
In order to use Query API, the hash key value is required. The OP doesn't have any information that hash key value is available. As per OP, the data needs to be queried by location attribute which is inside the Dynamodb List data type. Now, the option is to look at GSI.
Kindly read more about the GSI. One of the rules is that GSI can be created using top level attributes only. So, the location can't be used to create the index.
So, creating the GSI in order to use Query API has been ruled out as well.
The index key attributes can consist of any top-level String, Number,
or Binary attributes from the base table; other scalar types, document
types, and set types are not allowed.
Because of the above mentioned reasons, the Query API can't be used to get the data based on location attribute assuming hash key value is not available.
If hash key value is available, FilterExpression can be used to filter the data. Only way to filter the data present in the complex list data type is CONTAINS function. In order to use CONTAINS function, all the attributes in the occurrence is required to match the data (i.e. seminar_id, location, duration and all other attributes). So, it is definitely not possible to fulfil the use case mentioned in the OP using the current data model.
Proposed alternate solution:-
Re-modeling the data structure as mentioned below could be an option to resolve the problem. There is definitely no other solution available to fulfil the use case using Query API.
Main Table :-
Course Id - Hash Key
seminar_id - Sort Key
GSI :-
Seminar location - Hash Key
Course Id - Sort Key
In a DynamoDB table, each key value must be unique. However, the key
values in a global secondary index do not need to be unique.
Now, you can use the Query API on GSI to get the data for Seminar location is equal to Nelson Mandela Theatre. You can use the course id in the query api if you know the value. The query api will potentially give multiple items in the result set. You can use FilterExpression if you would like to further filter the data based on some non key attributes.

This is an example from here where you use a filter expression, it is with a scan operation, but maybe you can apply something similar for query instead of scan (take a look at the API):
{
"TableName": "MyTable",
"FilterExpression": "#k_Compatible.#k_RAM = :v_Compatible_RAM",
"ExpressionAttributeNames": {
"#k_Compatible": "Compatible",
"#k_RAM": "RAM"
},
"ExpressionAttributeValues": {
":v_Compatible_RAM": "RAM1"
}
}

You can do one thing to make it working on Scan
Store the object in stringify format like
{
"language": "[{\"language\":\"Male\",\"proficiency\":\"Female\"}]"
}``
and then can perform scan operation
language: {
contains: "Male"
}
on client side you can perform JSON.parse(language)

I have not such experience with DynamoDB yet but started setudying it since I'm planning on use it for my next project.
As far as I could understand from AWS documentation, the answer to your question is: it's not possible to efficiently query on nested attributes.
Looking at Best Practices, spetially Best Practices for Using Secondary Indexes in DynamoDB, it's possible to understand that the right approach should be using diffent line types under the same Partition Key as shown here. Then under the same course_id you would have a generic sorting key(sk). The first register would then have sk = 'Details' with course's data, then other registers like "seminar-1" and it's data, and so on.
You would then set seminar's properties you would like to query as SGI (Secondary Global Index) bearing in mind that it can only have 5 SGI per table.
Hope it helps.

You can use document paths to filter the values. Use seminars.location as the document path.

Related

How to structure DynamoDB index to allow retrieval by multiple fields

I'm new to DynamoDB and trying to figure out how to structure my data/table/index. My schema includes an itemid (unique) and an orderid (multiple items per order), along with some other arbitrary attributes. I want to be able to retrieve a single item by its itemid, but also retrieve a set of items by their OrderId.
My initial instinct was to set the itemid as the primary key and the orderid as the sort key, but that didn't allow me to query by orderid only. However the same problem occurs if I reverse those.
Example data:
ItemId
OrderId
abc-123
1234
def-345
1234
ghi-678
5678
jkl-901
5678
I think I may need a Global Se but not quite understanding where those fit.
If your question is really whether you "are able" to do this, then with ItemId as the partition key, you can still retrieve by OrderId, with the Scan operation, which will let you filter by any attribute.
However Scan will perform full table scans, so the real question is probably whether you can retrieve by OrderId efficiently. In that case, you would indeed need a Global Secondary Index with OrderId and ItemId as the composite attribute key.
This is typically achieved using what's called a "single table design". What this means, is that you store all your data in one table, and store it normalized, i.e. duplicate your data so that it fits your access patterns.
Generally speaking, if you do not know your access patterns beforehand, dynamodb might not be a good fit. For many systems, a good solution is to have the "main" access patterns in dynamo and then offloading some not so performance critical ad-hoc queries by replicating data to something like elasticsearh.
If you have a table with the hash key PK (String) and the sort key SK (String), you can store your data like this. Use transactions to keep the multiple items up to date and consistent etc.
PK
SK
shippingStatus
totalPrice
cartQuantity
order_1234
order_status
PENDING
123123
order_1234
item_abc-123
3
order_1234
item_def-345
1
order_5678
order_status
SHIPPED
54321
order_5678
item_jkl-901
5
item_abc-123
order_1234
item_abc-123
order_9876
item_abc-123
order_5656
This table illustrates the schemaless nature of a dynamo table (except from the PK/SK). With this setup, you can store "metadata" about the order in the order_1234/order_status item. Then, you can query for items with PK order_1234 and SK starts_with "item_" to get all the items for that order. You can do the same to get all the orders for an item - query for PK item_abc-123 and SK starting with "order_" to get all the orders.
I highly recommend this talk by Rick Houlihan to get into single table design and data modelling in dynamo :)
https://www.youtube.com/watch?v=HaEPXoXVf2k

Is there a way how to address nested properties in AWS DynamoDB for purpose of documentClient.query() call?

I am currently testing how to design a query from AWS.DynamoDB.DocumentClient query() call that takes params: DocumentClient.QueryInput, which is used for retrieving data collection from a table in DynamoDB.
Query seems to be simple and working fine while working with indexes of type String or Number only. What I am not able to make is an query, that will use a valid index and filter upon an attribute that is nested (see my data structure please).
I am using FilterExpression, where can be defined logic for filtering - and that seems to be working fine in all cases except cases when trying to do filtering on nested attribute.
Current parameters, I am feeding query with
parameters {
TableName: 'myTable',
ProjectionExpression: 'HashKey, RangeKey, Artist ,#SpecialStatus, Message, Track, Statistics'
ExpressionAttributeNames: { '#SpecialStatus': 'Status' },
IndexName: 'Artist-index',
KeyConditionExpression: 'Artist = :ArtistName',
ExpressionAttributeValues: {
':ArtistName': 'BlindGuadian',
':Track': 'Mirror Mirror'
},
FilterExpression: 'Track = :Track'
}
Data structure in DynamoDB's table:
{
'Artist' : 'Blind Guardian',
..
'Track': 'Mirror Mirror',
'Statistics' : [
{
'Sales': 42,
'WrittenBy' : 'Kursch'
}
]
}
Lets assume we want to filter out all entries from DB, by using Artist in KeyConditionExpression. We can achieve this by feeding Artist with :ArtistName. Now the question, how to retrieve records that I can filter upon WritenBy, which is nested in Statistics?
To best of my knowledge, we are not able to use any other type but String, Number or Binary for purpose of making secondary indexes. I've been experimenting with Secondary Indexes and Sorting Keys as well but without luck.
I've tried documentClient.scan(), same story. Still no luck with accessing nested attributes in List (FilterExpression just won't accept it).
I am aware of possibility to filter result on "application" side, once the records are retrieved (by Artists for instance) but I am interested to filter it out in FilterExpression
If I understand your problem correctly, you'd like to create a query that filters on the value of a complex attribute (in this case, a list of objects).
You can filter on the contents of a list by indexing into the list:
var params = {
TableName: "myTable",
FilterExpression: "Statistics[0].WrittenBy = :writtenBy",
ExpressionAttributeValues: {
":writtenBy": 'Kursch'
}
};
Of course, if you don't know the specific index, this wont really help you.
Alternatively, you could use the CONTAINS function to test if the object exists in your list. The CONTAINS function will require all the attributes in the object to match the condition. In this case, you'd need to provide Sales and WrittenBy, which probably doesn't solve your problem here.
The shape of your data is making your access pattern difficult to implement, but that is often the case with DDB. You are asking DDB to support a query of a list of objects, where the object has a specific attribute with a specific value. As you've seen, this is quote tricky to do. As you know, getting the data model to correctly support your access patterns is critical to your success with DDB. It can also be difficult to get right!
A couple of ideas that would make your access pattern easier to implement:
Move WrittenBy out of the complex attribute and put it alongside the other top-level attributes. This would allow you to use a simple FilterExpression on the WrittenBy attribute.
If the WrittenBy attribute must stay within the Statistics list, make it stand alone (e.g. [{writtenBy: Kursch}, {Sales: 42},...]). This way, you'd be able to use the CONTAINS keyword in your search.
Create a secondary index with the WrittenBy field in either the PK or SK (whichever makes sense for your data model and access patterns).

Retrieving arrays of nested information in AppSync schema

I have worked out a fairly complex chain of DynamoDB resolvers on a GraphQL AppSync query. What I am curious to know is if I could have possibly designed this in a way to require fewer DynamoDB queries.
Here is my GraphQL Schema:
type Tag {
PartitionKey: ID!
SortKey: ID!
TagName: String!
TagType: String
}
type Model {
PartitionKey: ID!
Name: String
Version: Int
FBX: String
# ms since epoch
CreatedAt: AWSTimestamp
Description: String
Tags: [String]
}
type Query {
GetAllModels(count: Int, nextToken: String): PaginatedModels!
}
This is the query that I am doing:
query GetAllModels{
GetAllModels {
Models {
PartitionKey
Name
Version
CreatedAt
Description
Tags {
TagName
TagType
}
}
}
}
My DynamoDB table is set up as so:
PartionKey | SortKey | TagName | TagType | ModelName | Description
Model-0 | Model-0 | ModelZero | Blah Blah
Model-0 | Tag-Pine |
Model-0 | Tag-Apple |
Tag-Pine | Tag-Pine | Pine | Tree
Tag-Apple | Tag-Apple | Apple | Fruit
So in my resolvers I am going:
GetAllModels will scan with two filters. One filter for PartitionKey beginning with 'Model-' and another filter for SortKey begining with 'Model-'. This is to get all Models.
Next there is a resolver attached to 'Tags' in the Model object. This will query with two expressions. One for PartitionKey = source.Parition and a second for SortKey begin_with 'Tag-' this gets me all of the tags on a model.
Next there are two resolvers on the Tag object. One on TagName and another on TagType. These do a direct GetItem to get their appropriate value with PartitionKey = source.Sort and SortKey = source.SortKey set as the keys.
So each scanned Model ends up firing off 3 more queries to DynamoDB. This just seems a bit excessive to me. But I cannot see any other way to do this. Is there some way to be able to get both TagName and TagType in one query?
Is there a better way to approach this?
I see a few things that I would personally change. The first is that I would avoid the nested DynamoDB scan operations. At least one of these can be replaced with a much faster query operation. The second is that I would consider rethinking how you are storing the data. Currently, there is no good way to list model objects.
Why is there no good way to list model objects?
Assuming each model object will have multiple tags then you are going to have a table that is sparsely populated by model objects. i.e. out of 100 rows you may have 20 - 50 models depending on how many tags the average model has. In DynamoDB, a table is split up based on the partition key causing rows that share the same partition key to be stored near each other to speed up query operations. With your setup where the Partition Key is essentially the unique id of a single model object this means that we can easily get a single model object. You can also quickly get the tags for a single object since those records are nearby as well.
The issue.
The DynamoDB scan operation looks at each partition one at a time, reads as many records as the requests limit allows or all of them if the limit is sufficiently large, and then, only after reading the records from the individual partitions, applies the filter expression before returning the final result. This means you may ask for the first 10 models but since the limit is applied before the scan filter, you may very well only get back 1 model (if that one model had 9 or more tags which would exhaust the limit while DynamoDB was reading the first partition). This may seem strange when coming from many different database systems and is an important consideration of its design.
Here are two solutions to address this concern:
1. Store Models in one table and Tags in another.
NoSQL databases like DynamoDB allow you to store many types of data in the same table but there is nothing wrong with splitting them out. Traditionally it can be a pain to work with multiple tables in a NoSQL database that lacks a join operation or something similar, but fortunately for us we can use GraphQL to "join" data for us. With the approach, the Model table has a single partition key named "id" and your GetAllModels resolver is still a scan but this time on the model table. This way the table is not sparse and you will get 10 models when you ask for 10 models. The Tag table should have a partition key of modelId and a sort key of tagId. You would then have a resolver on the Model.tags field that does a query against the Tag table and looks for rows with the modelId == $ctx.source.id.
This is essentially how #model and #connection work in the new graphql transform tooling launched as part of the amplify cli. You can see more here although the docs are as of writing still being improved. https://aws-amplify.github.io/amplify-js/media/api_guide
2. Store Models and Tags in the same table but change the key structure.
This approach works if you can reliably say that you will have less than 10GB of data per data type (e.g. Model & Tag). For this approach you have a single table with a PartitionKey of Type and Sort Key of id. When you create objects you create them with a Type e.g "Tag" or "Model" etc and a unique id (like a uuid). To list objects of the same type you do a DynamoDB query operation on the partition key of the type to list e.g. "Tag" or "Model". You can then use GSIs to efficiently look up related objects. In your case you would store a "modelId" is every Tag object. You would then make a GSI using the "modelId" as the Partition Key. To list all the tags for a given model you could then do a DynamoDB query operation against that GSI.
I'm sure there are many more ways to do this but hopefully this helps point in the right direction.

3 fields composite primary key (unique item) in Dynamodb

I am trying to create a table to store invoice line items in DynamoDB. Let's say the item is defined by CompanyCode, InvoiceNumber and LineItemId, amount and other line item details.
A unique item is defined by the combination of the first 3 attributes. Any 2 of those attributes can be same for the different items. What should I select as the Hash Attribute and the Range Attribute?
Some Intro
For efficiency I would propose totally different design. With NoSQL databases (and DynamoDB is not different) we always need to consider the access patterns first. Also, if possible we should strive to fit all our data within same table and several indexes. From what we have from OP and his comments, these are the two access patterns:
For a company X, get complete invoice Y (including all items or range of items) [based on this comment ]
Get all invoices for company X [ based on this comment ]
We now wonder what is a good Primary Key? Translates to question what is a good Partition Key (PK) and what is a good Sort Key (SK) and which secondary indexes do we need to create and of what kind (local or global)? Some reminders:
Primary Key can be on one column or composite
Composite primary key consists of Partition Key and Sort Key
Partition key is used as input to the hashing function that will determine partition of the items
Sort key can also be composite, which allows us to model one-to-many relationships in DynamoDB as given in one of the comments links: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html
When creating query on the table or index, you always need to use '=' operator on the Partition Key
When querying ranges on Sort Key you have option for KeyConditionExpression which provides you with set of operators for sorting and everything in between (one of them being function begins_with (a, substr) )
You are also allowed to use FilterExpression if you need to further refine the Query results (filter on the projected attributes)
Local Secondary Indexes (LSI) have same Partition Key but different Sort Key than your original table and give you different view of your data, organized according to an alternative Sort Key
Global Secondary Indexes (GSI) have different Partition Key and different Sort Key than your original table and give you completely different view on data
All items with the same partition key are stored together, and for composite Primary keys, are ordered by the sort key value. DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB.
Back To Modeling
It is obvious that we are dealing with multiple entities that need to be modeled and fit into the same table. To satisfy condition of Partition Key being unique on the table, CompanyCode comes as a natural Partition Key - so I would ensure that is unique. If not then you need to ask yourself how can you model the second access pattern?
Assuming we have established uniqueness on the CompanyCode let's simplify and say that it comes in the form of an e-mail (or could be domain or just a code, but I will use email for demonstration).
Relationship between Company and Invoices is always 1:many.
Relationship between Invoice and Items is always 1:many.
I propose design as in the image below:
With PK being CompanyCode and SK being InvoiceNumber can store all attributes about that invoice for that company.
Nothing prevents me to also add record where the SK is Customer which allows me to store all attributes about the company.
With GSI1 , we will create reverse lookup where GSI1PK is my tables SK (InvoiceNumber) and my GSI1SK is my tables PK (CompanyCode).
I am using same table to store line items with PK being LineItemId and SK being CompanyCode (still unique)
For Item entity items my GSI1PK is still InvoiceNumber and my GSI1SK is LineItemId which is tables PK so its same as for Invoice entity items.
Now the access patterns supported with this:
If I want to get invoice Y for company X and all the items (access pattern 1): Query the table where CompanyCode=X and use KeyConditionExpression with = operator on the Sort Key InvoiceNumber. If I want to get all the items tied to that invoice, I will project Items attribute using ProjectionExpression.
By retrieving all the items with previous query for company X and invoice Y, I can now run BatchGetItem API call (using my unique composite key LineItemId+CompanyCode) on table to get all items belonging to that particular invoice of that particular customer. (this comes with some constraints of BatchGetItem API)
To support access pattern 2, I will do a query with CompanyCode=X on PK and use KeyConditionExpression on the SK with begins_with (a, substr) function/operator to get only invoices for company X and not the metadata about that company. That will give me all invoices for given company/customer.
Additionally, with above GSI1, for any given InvoiceNumber I can easily select all the line items that belong to that particular invoice. REMEMBER: The key values in a global secondary index do not need to be unique - so in my GSI1 I could have had easily invoice_1 -> (item_1, item_2) and then another invoice_1 -> (item_1,item_2) but the difference between two items in GSI would be in the SK (it would be associated with different CompanyCode (but for demonstration purposes I used invoice_1 and invoice_2).
I believe the first option offered by #georgeaf99 won't work, because if you do it that way, then CompanyCode has to be unique in the table. Therefore, there would only be one item allowed per company. I think the second solution is the only real way to do it.
You can use CompanyCode as the Hash Key, and then all other fields that combine to make the item unique (in this case InvoiceNumber and LineItemId) need to be somehow combined into one value (such as concatenation with a field delimiter), which would be your Range Key. Unfortunately that is kind of ugly, but that's the nature of a NoSQL database like DynamoDB. However, it will allow you to successfully store records with the correct uniqueness. When reading the records back, if you don't want to parse the combined field back out to its individual parts, then you'll have to add additional separate fields for InvoiceNumber and LineItemID.
If you don't have a large number of invoices per company, you can query by only the Hash Key and do the filtering on the client side. If you have a large number of invoices per company and need to be able to query only the items for a single invoice, then I would create a secondary index on CompanyCode and InvoiceNumber.
As I'm sure you have figured out you cannot have more than two attributes form your primary key (hash+range). Thus, depending on the type of queries you will be performing and the size of your data you can structure your table in different ways.
(Optimized for the query type you mentioned above: only CompanyCode & all 3)
Best sol'n for small/medium size data sets:
Hash Key: CompanyCode
Perform the query using only CompanyCode and
then filter your results on the other two attributes
Optimal solution for large data sets:
Hash Key: CompanyCode
Range Key: InvoiceNumber+LineItemId
This allows you to query only on an index, but the table structure is pretty ugly

Deduplicaton / matching in Couchdb?

I have documents in couchdb. The schema looks like below:
userId
email
personal_blog_url
telephone
I assume two users are actually the same person as long as they have
email or
personal_blog_url or
telephone
be identical.
I have 3 views created, which basically maps email/blog_url/telephone to userIds and then combines the userIds into the group under the same key, e.g.,
_view/by_email:
----------------------------------
key values
a_email#gmail.com [123, 345]
b_email#gmail.com [23, 45, 333]
_view/by_blog_url:
----------------------------------
key values
http://myblog.com [23, 45]
http://mysite.com/ss [2, 123, 345]
_view/by_telephone:
----------------------------------
key values
232-932-9088 [2, 123]
000-111-9999 [45, 1234]
999-999-0000 [1]
My questions:
How can I merge the results from the 3 different views into a final user table/view which contains no duplicates?
Or whether it is a good practice to do such deduplication in couchdb?
Or what would be a good way to do a deduplication in couch then?
ps. in the finial view, suppose for all dupes, we only keep the smallest userId.
Thanks.
Good question. Perhaps you could listen to _changes and search for the fields you want to be unique for the real user in the views you suggested (by_*).
Merge the views into one (emit different fields in one map):
function (doc) {
if (!doc.email || !doc.personal_blog_url || !doc.telephone) return;
emit([1, doc.email], [doc._id]);
emit([2, doc.personal_blog_url], [doc._id]);
emit([3, doc.telephone], [doc._id]);
}
Merge the lists of id's in reduce
When new doc in changes feed arrives, you can query the view with keys=[[1, email], [2, personal_blog_url], ...] and merge the three lists. If its minimal id is smaller then the changed doc, update the field realId, otherwise update the documents in the list with the changed id.
I suggest using different document to store { userId, realId } relation.
You can't create new documents by just using a view. You'd need a task of some sort to do the actual merging.
Here's one idea.
Instead of creating 3 views, you could create one view (that indexes the data if it exists):
Key Values
--- ------
[userId, 'phone'] 777-555-1212
[userId, 'email'] username#example.com
[userId, 'url'] favorite.url.example.com
I wouldn't store anything else except the raw value, as you'd end up with lots of unnecessary duplication of data (if you stored the full object for example).
Then, to query, you could do something like:
...startkey=[userId]&endkey=[userId,{}]
That would give you all of the duplicate information as a series of docs for that user Id. You'd still need to parse it apart to see if there were duplicates. But, this way, the results would be nicely merged into a single CouchDB call.
Here's a nice example of using arrays as keys on StackOverflow.
You'd still probably load the original "user" document if it had other data that wasn't part of the de-duplication process.
Once discovered, you could consider cleaning up the data on the fly and prevent new duplicates from occurring as new data is entered into your application.