Examples for DynamoDB Materialized Graph Pattern - amazon-web-services

I started looking into DynamoDB, but got stuck reading this part about the materialized graph pattern: Best Practices for Managing Many-to-Many Relationships.
I guess I get some ideas, but don't understand the whole thing yet.
As far as I understand the pattern the main table stores edges and each edge can have properties (the data-attribute).
For example (taken from the shown tables):
Node 1 (PK 1) has an edge to Node 2 which is of type DATE, and the edge is of type BIRTH (SK DATE|2|BIRTH).
I guess this would somewhat be the same as ()-[:BIRTH]->(:DATE { id: 2 }) in Cipher, right?
But after this it becomes unclear how everything fits together.
For example:
Can the data attribute be a map?
Does the data attribute have to be written to two places on writes? E.g. under (1, DATE|2|BIRTH) and (2, DATE|2)?
If I want to add a new person that is born 1980-12-19, do I have to look up the corresponding node first?
How can I get all properties associated with a node? How to get all properties associated with an edge?
How can I query adjacent nodes?
...
Can someone explain to me how everything fits together by walking through a few use cases?
Thanks in advance.

Hopefully this answers all of your questions. Here's a couple of introductory things. I'll be using a generic table for all of my examples. The hash key is node_a and the sort key is node_b. There is a reverse lookup GSI where node_b is the hash key and node_a is the sort key.
1. Can the data attribute be a map?
The data attribute can be any of the supported data types in DynamoDB, including a map.
2. Does the data attribute have to be written to two places on writes?
The data attribute should be written to only one place. For the example of birthdate, you could do either one of these DynamoDB entries:
node_a | node_b | data
----------|-----------|---------------
user-1 | user-1 | {"birthdate":"2000-01-01", "firstname": "Bob", ...}
user-1 | birthdate | 2000-01-01
In the first row, we created an edge from the user-1 node that loops back on itself. In the second row, we created an edge from user-1 to birthdate. Either way is fine, and the best choice depends on how you will be accessing your data. If you need to be able to find users with a birthdate in a given range, then you should create a birthdate node. If you just need to look up a user's information from their user ID, then you can user either strategy, but the first row will usually be a more efficient use of your table's throughput.
3. If I want to add a new person that is born 1980-12-19, do I have to look up the corresponding node first?
No. Just insert one of the rows from the example above.
You only have to look up the node if there is a more complex access pattern, such as "update the name of the person who was born on 1980-12-19". In that case, you would need to look up by birthdate to get the person node, and then modify something related to the person node. However, that use case is really two different operations. You can rephrase that sentence as "Find the person who was born on 1980-12-19, and update the name", which makes the two operations more apparent.
4.(a) How can I get all properties associated with a node?
Suppose you want to find all the edges for "myNode". You would query the main table with the key condition expression of node_a="myNode" and query the reverse lookup GSI with the key condition expression of node_b="myNode". This is the equivalent of SELECT * FROM my_table WHERE node_a="myNode" OR node_b="myNode".
4.(b) How to get all properties associated with an edge?
All of the properties of an edge are stored directly in the attributes of the edge, but you may still run into a situation where you don't know exactly where the data is. For example:
node_a | node_b | data
----------|-----------|---------------
thing-1 | thing-2 | Is the data here?
thing-2 | thing-1 | Or here?
If you know the ordering of the edge nodes (ie. which node is node_a and node_b) then you need only one GetItem operation to retrieve the data. If you don't know which order the nodes are in, then you can use BatchGetItems to look up both of the rows in the table (only one of the rows should exist unless you're doing something particularly complex involving a directed graph).
5. How can I query adjacent nodes?
Adjacent nodes are simply two nodes that have an edge connecting them. You would use the same query as 4a, except that instead of being interested in the data attribute, you're interested in the IDs of the other nodes.
Some more examples
Using a graph pattern to model a simple social network
Using a graph pattern to model user-owned resources
How to model a circular relationship between actors and films in DynamoDB (answer uses a graph pattern)
Modeling many-to-many relationships in DynamoDB
From relational DB to single DynamoDB table: a step-by-step exploration. This is a killer piece. It's got an AWS re:Invent talk embedded in it, and the author of this blog post adds his own further explanation on top of it.

Related

Gremlin load data format

I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).
Say I have a CSV with the following columns:
date_order_created
customer_no
order_no
zip_code
item_id
item_short_description
The requirements for the Gremlin load format are that the data is in an edge file and a vertex file.
The edge file must have the following columns: id, label, from and to.
The vertex file must have: id and label columns.
I have been referring to this page for guidance: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html
It states that in the edge file, the from column must equate to "the vertex ID of the from vertex."
And that (in the edge file) the to column must equate to "the vertex ID of the to vertex."
My questions:
Which columns need to be renamed to id, label, from and to? Or, should I add new columns?
Do I only need one vertex file or multiple?
You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.
As to the column headers, let's say you had a node (vertex) file of the form:
~id,~label,name,breed,age:Int
dog-1,Dog,Toby,Retriever,11
dog-2,Dog,Scamp,Spaniel,12
The edge file (for dogs that are friends), might look like this
~id,~label,~from,~to
e-1,FRIENDS_WITH,dog-1,dog-2
In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if customer_no is guaranteed to be unique, rather than store it as a property called customer_no you could instead make it the ~id. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.
So in summary, you need to always provide the required fields like ~id and ~label. They are accessed differently using Gremlin steps such as hasLabel and hasId once the data is loaded. Columns with names from your domain like order_no will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')
To follow on from Kelvin's response and provide some further detail around data modeling...
Before getting to the point of loading the data into a graph database, you need to determine what the graph data model will look like. This is done by first deriving a "naive" approach of how you think the entities in the data are connected and then validating this approach by asking the relevant questions (which will turn into queries) that you want to ask of the data.
By way of example, I notice that your dataset has information related to customers, orders, and items. It also has some relevant attributes related to each. Knowing nothing about your use case, I may derive a "naive" model that looks like:
What you have with your original dataset appears similar to what you might see in a relational database as a Join Table. This is a table that contains multiple foreign keys (the ids/no's fields) and maybe some related properties for those relationships. In a graph, relationships are materialized through the use of edges. So in this case, you are expanding this join table into the original set of entities and the relationships between each.
To validate that we have the correct model, we then want to look at the model and see if we can answer relevant questions that we would want to ask of this data. By example, if we wanted to know all items purchased by a customer, we could trace our finger from a customer vertex to the item vertex. Being able to see how to get from point A to point B ensures that we will be able to easily write graph queries for these questions later on.
After you derive this model, you can then determine how best to transform the original source data into the CSV bulk load format. So in this case, you would take each row in your original dataset and convert that to:
For your vertices:
~id, ~label, zip_code, date_order_created, item_short_description
customer001, Customer, 90210, ,
order001, Order, , 2023-01-10,
item001, Item, , , "A small, non-descript black box"
Note that I'm reusing the no's/ids for the customer, item, and order as the ID for their related vertices. This is always good practice as you can then easily lookup a customer, order, or item by that ID. Also note that the CSV becomes a sparse 2-dimensional array of related entities and their properties. I'm only providing the properties related to each type of vertex. By leaving the others blank, they will not be created.
For your edges, you then need to materialize the relationships between each entity based on the fact that they are related by being in the same row of your source "join table". These relationships did not previously have a unique identifier, so we can create one (it can be arbitrary or based on other parts of the data; it just needs to be unique). I like using the vertex IDs of the two related vertices and the label of the relationship when possible. For the ~from and ~to fields, we are including the vertices from which the relationship is deriving and what it is applying to, respectively:
~id, ~label, ~from, ~to
customer001-has_ordered-order001, has_ordered, customer001, order001
order001-contains-item001, contains, order001, item001
I hope that adds some further color and reasoning around how to get from your source data and into the format that Kelvin shows above.

Is this a reasonable way to design this DynamoDB table? Alternatives?

Our team has started to use AWS and one of our projects will require storing approval statuses of various recommendations in a table.
There are various things that identify a single recommendation, let's say they're : State, ApplicationDate, LocationID, and Phase. And then a bunch of attributes corresponding to the recommendation (title, volume, etc. etc.)
The use case will often require grabbing all entries for a given State and ApplicationDate (and then we will look at all the LocationId and Phase items that correspond to it) for review from a UI. Items are added to the table one at a time for a given Station, ApplicationDate, LocationId, Phase and updated frequently.
A dev with a little more AWS experience mentioned we should probably use State+ApplicationDate as the partition key, and LocationId+Phase as the sort key. These two pieces combined would make the primary key. I generally understand this, but how does that work if we start getting multiple recommendations for the same primary key? I figure we either are ok with just overwriting what was previously there, OR we have to add some other attribute so we can write a recommendation for the State+ApplicationDate/LocationId+Phase multiple times and get all previous values if we need to... but that would require adding something to the primary key right? Would that be like adding some kind of unique value to the sort key? Or for example, if we need to do status and want to record different values at different statuses, would we just need to add status to the sort key?
Does this sound like a reasonable approach or should I be exploring a different NAWS offering for storing this data?
Use a time-based id property, such as a ULID or KSID. This will provide randomness to avoid overwriting data, but also provide a time-based sorting of your data when used as part of a sort key
Because the id value is random, you will want to add it to your sort key for the table or index where you perform your list operations, and reserve the pk for known values that can be specified exactly.
It sounds like the 'State' is a value that can change. You can't update an item's key attributes on the table, so it is more common to use these attributes in a key for a GSI if they are needed to list data.
Given the above, an alternative design is to use the LocationId as the pk, the random id value as the sk, and a GSI with the GSI with 'State' as the pk and the random id as the sk. Or, if you want to list the items by State -> Phase -> date, the GSI sk could be a concatenation of the Phase and id property. The above pattern gives you another list mechanism using the LocationId + timestamp of the recommendation create time.

DynamoDB query by 3 fields

Hi I am struggling to construct my schema with three search fields.
So the two main queries I will use is:
Get all files from a user within a specific folder ordered by date.
Get all files from a user ordered by date.
Maybe there will be a additional query where I want:
All files from a user within a folder orderd by date and itemType == X
All files from a user orderd by date and itemType == X
So as of that the userID has to be the primaryKey.
But what should I use as my sortKey?. I tried to use a composite sortKey like: FOLDER${folderID}#FILE{itemID}#TIME{$timestamp} As I don't know the itemID I can't use the beginsWith expression right ?
What I could do is filter by beginsWith: folderID but then descending sort by date would not work.
Or should I move away from dynamoDB to a relationalDB with those query requirements in mind?
DynamoDB data modeling can be tough at first, but it sounds like you're off to a good start!
When you find yourself requiring an ID and sorting by time, you should know about KSUIDs. KSUID's are unique IDs that can be lexicographically sorted by time. That means that you can sort KSUIDs and they will order by creation time. This is super useful in DynamoDB. Let's check out an example.
When modeling the one-to-many relationship between Users and Folders, you might do something like this:
In this example, User with ID 1 has three folders with IDs 1, 2, and 3. But how do we sort by time? Let's see what this same table looks like with KSUIDs for the Folder ID.
In this example, I replaced the plain ol' ID with a KSUID. Not only does this give me a unique identifier, but it also ensures my Folder items are sorted by creation date. Pretty neat!
There are several solutions to filtering by itemType, but I'd probably start with a global secondary index with a partition key of USER#user_id#itemType and FOLDER#folder_id as the sort key. Your base table would then look like this
and your index would look like this
This index allows you to fetch all items or a specific folder for a given user and itemType.
These examples might not perfectly match your access patterns, but I hope they can get your data modeling process un-stuck! I don't see any reason why your access patterns can't be implemented in DynamoDB.
if you are sure about using dynamoDB you should analyze access patterns to this table in advance and chose part key, sort key based on the most frequent pattern. For other patterns, you should add GSI for each pattern. See https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
Usually, if it is about unknown patterns RDBMS looks better, or for HighLoad systems NO_SQL for highload workloads and periodic uploading data to something like AWS RedShift.

how to design schema in dynamodb for a reading comprehension quiz application where data would be heavy?

Pls check the uml diagram
What I want to know is if there is 30quest and their options in section 1 ,20question in section 2,30question in section 3, how should i keep in the table as RC passages would have 300-400 words, plus the questions,options it would be around 7-800 words per question.
So each question should have one row in the table or , testwise i should have different columns of section and all question, option should be saved in json format in one column(item for dynamodb)?
I would follow these rules for DynamoDB table design:
Definitely keep everything in one table. It's rare for one application to need multiple tables. It is OK to have different items (rows) in DynamoDB represent different kinds of objects.
Start by identifying your access patterns, that is, what are the questions you need to ask of your data? This will determine your choice of partition key, sort key, and indexes.
Try to pick a partition key that will result in object accesses being spread somewhat evenly over your different partitions.
If you will have lots of different tests, with accesses spread somewhat evenly over the tests, then TestID could be a good partition key. You will probably want to pull up all the tests for a given instructor, so you could have a column InstructorID with a global secondary index pointing back to the primary key attributes.
Your sort key could be heterogenous--it could be different depending on whether the item is a question or a student's answer. For questions, the sort key could be QuestionID with the content of the question stored as other attributes. For question options it could be QuestionID#OptionID, with something like an OptionDescription attribute for the content of the option. Keep in mind that it's OK to have sparse attributes--not every item needs something populated for every attribute, and it's OK to have attributes that are meaningless for many items. For answers, your sort key could be QuestionID#OptionID#StudentID, with the content of the student's answer stored as a StudentAnswer attribute.
Here is a guide on DynamoDB best practices. For something more digestible, search in YouTube for "aws reinvent dynamo rick houlihan." Rick Houlihan has some good talks about data modeling in DynamoDB. Here are a couple, and one more on data modeling:
https://www.youtube.com/watch?v=6yqfmXiZTlM&list=PL_EDAAla3DXWy4GW_gnmaIs0PFvEklEB7
https://www.youtube.com/watch?v=HaEPXoXVf2k
https://www.youtube.com/watch?v=DIQVJqiSUkE
The better approach is to store each question and its option as a row in DynamoDB Table . Definitely will not suggest , the second approach of storing the question and answer as a JSON is definitely not advisable as the maximum size of a DynamoDb Item is 400 Kb. In such scenarios , using a document database is much more helpful.
Also try to come up with the type of queries that you will be running . Some of the typical ones are
Get all questions in a section by SectionID
Get the details of a Question by Question Id
Get all questions
If you can provide some more information , I could guide you in data modelling
Also I did not see the UML diagram
The following is my suggestion.Create the DynamoDB table
Store each sectionId , question and its option as a row in DynamoDB Table
Partition Key :- SectionID , Sort Key :- QuestionId
Create a GSI on the table with Partition Key :- QuestionId, Sort Key :- OptionId

DynamoDB database model for storing different objects

I'm trying to learn DynamoDB just for didactic purposes, for that reason I propose myself to create a small project to sell vehicles (cars, bikes, quad bikes, etc) in order to learn and get some experience with NoSQL databases. I read a lot of documentation about creating the right models but I still cannot figure out the best way to store my data.
I want to get all the vehicles by filters like:
get all the cars not older than 3 months.
get all the cars not older than 3 months by brand, year and model.
And so on the same previous queries for bikes, quad bikes, etc.
After reading the official documentation and other pages with examples (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-general-nosql-design.html#bp-general-nosql-design-approach , https://medium.com/swlh/data-modeling-in-aws-dynamodb-dcec6798e955 , Separate tables vs map lists - DynamoDB), they said that the best designs use only one table for storing everything, so I end up with a model like the next below:
-------------------------------------------------------------------------------------
Partition key | Sort key | Specific attributes for each type of vehicle
-------------------------------------------------------------------------------------
cars | date#brand#year#model | {main attributes for the car}
bikes | date#brand#year#model | {main attributes for the bike}
-------------------------------------------------------------------------------------
I've used a composite sort key because they specify that is a good practice for searching data (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html).
But after defining my model I end up that the previous model will have a problem called "Hotspotting" or "Hoy key". (https://medium.com/expedia-group-tech/dynamodb-data-modeling-c4b02729ac08, https://dzone.com/articles/partitioning-behavior-of-dynamodb) because in the official documentation they recommend having partitions keys with high cardinality to avoid the problem.
So at this point, I'm a little stuck about how to define a good and scalable model. Could you provide me some help or examples about how to achieve a model to get the queries above mentioned?
Note: I also considered creating a specific table for each vehicle but that would create more problems because to find the information I would need to perform a full table scan.
A few things...
hot partitions, only come into play if you have multiple partitions...
Just because you've got multiple partition (hash) keys, doesn't automatically mean DDB will need multiple partitions. You'll also need more than 10GB of data and/or more than 3000 RCU or 1000 WCU being used.
Next, DDB now supports "Adaptive Capacity", so hot partitions aren't as big a deal as they used to be. why what you know about DynamoDB might be outdated
In connection with the even newer "Instantaneous Adaptive Capacity", you've got DDB on demand.
One final note, you may be under the impression that a given partition (hash) key can only have a maximum of 10GB of data under it. This is true if your table utilizes Local Secondary Indexes (LSI) but is not true otherwise. Thus, consider using global secondary indexes (GSI). There's extra cost associated with GSIs, so it's a trade off to consider.