I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).
Say I have a CSV with the following columns:
date_order_created
customer_no
order_no
zip_code
item_id
item_short_description
The requirements for the Gremlin load format are that the data is in an edge file and a vertex file.
The edge file must have the following columns: id, label, from and to.
The vertex file must have: id and label columns.
I have been referring to this page for guidance: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html
It states that in the edge file, the from column must equate to "the vertex ID of the from vertex."
And that (in the edge file) the to column must equate to "the vertex ID of the to vertex."
My questions:
Which columns need to be renamed to id, label, from and to? Or, should I add new columns?
Do I only need one vertex file or multiple?
You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.
As to the column headers, let's say you had a node (vertex) file of the form:
~id,~label,name,breed,age:Int
dog-1,Dog,Toby,Retriever,11
dog-2,Dog,Scamp,Spaniel,12
The edge file (for dogs that are friends), might look like this
~id,~label,~from,~to
e-1,FRIENDS_WITH,dog-1,dog-2
In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if customer_no is guaranteed to be unique, rather than store it as a property called customer_no you could instead make it the ~id. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.
So in summary, you need to always provide the required fields like ~id and ~label. They are accessed differently using Gremlin steps such as hasLabel and hasId once the data is loaded. Columns with names from your domain like order_no will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')
To follow on from Kelvin's response and provide some further detail around data modeling...
Before getting to the point of loading the data into a graph database, you need to determine what the graph data model will look like. This is done by first deriving a "naive" approach of how you think the entities in the data are connected and then validating this approach by asking the relevant questions (which will turn into queries) that you want to ask of the data.
By way of example, I notice that your dataset has information related to customers, orders, and items. It also has some relevant attributes related to each. Knowing nothing about your use case, I may derive a "naive" model that looks like:
What you have with your original dataset appears similar to what you might see in a relational database as a Join Table. This is a table that contains multiple foreign keys (the ids/no's fields) and maybe some related properties for those relationships. In a graph, relationships are materialized through the use of edges. So in this case, you are expanding this join table into the original set of entities and the relationships between each.
To validate that we have the correct model, we then want to look at the model and see if we can answer relevant questions that we would want to ask of this data. By example, if we wanted to know all items purchased by a customer, we could trace our finger from a customer vertex to the item vertex. Being able to see how to get from point A to point B ensures that we will be able to easily write graph queries for these questions later on.
After you derive this model, you can then determine how best to transform the original source data into the CSV bulk load format. So in this case, you would take each row in your original dataset and convert that to:
For your vertices:
~id, ~label, zip_code, date_order_created, item_short_description
customer001, Customer, 90210, ,
order001, Order, , 2023-01-10,
item001, Item, , , "A small, non-descript black box"
Note that I'm reusing the no's/ids for the customer, item, and order as the ID for their related vertices. This is always good practice as you can then easily lookup a customer, order, or item by that ID. Also note that the CSV becomes a sparse 2-dimensional array of related entities and their properties. I'm only providing the properties related to each type of vertex. By leaving the others blank, they will not be created.
For your edges, you then need to materialize the relationships between each entity based on the fact that they are related by being in the same row of your source "join table". These relationships did not previously have a unique identifier, so we can create one (it can be arbitrary or based on other parts of the data; it just needs to be unique). I like using the vertex IDs of the two related vertices and the label of the relationship when possible. For the ~from and ~to fields, we are including the vertices from which the relationship is deriving and what it is applying to, respectively:
~id, ~label, ~from, ~to
customer001-has_ordered-order001, has_ordered, customer001, order001
order001-contains-item001, contains, order001, item001
I hope that adds some further color and reasoning around how to get from your source data and into the format that Kelvin shows above.
Related
I'm trying to implement personalization and having problems with Items schema.
Imagine I'm Amazon, I've products their brands and their categories. In what kind of Items schema should I include this information?
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
What about categories? I've the same questions.
Metadata Fields Metadata includes string or non-string fields that
aren't required or don't use a reserved keyword. Metadata schemas have
the following restrictions:
Users and Items schemas require at least one metadata field,
Users and Interactions datasets can contain up to five metadata
fields. An Items dataset can contain up to 50 metadata fields.
If you add your own metadata field of type string, it must include the
categorical attribute. Otherwise, Amazon Personalize won't use the
field when training a model.
https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html
There are simply 2 ways to include your metadata in Items/Users datasets:
If it can be represented as a number value, then provide the actual value if it makes sense.
If it can be represented as string, then provide the string value and make sure, that categorical is set to true.
But let's take a look into "Why does they need me, to categorize my strings metadata?". The answer is pretty simple.
Let's start with an example.
If you would have Items as Amazon.com products and you would like to provide rates metadata field, then:
You could take all of the rates including the full review text sent by clients and simply put it as metadata field.
You can take just stars rating, calculate the average and put it as metadata field.
Probably the second one is making more sense in general. Having random, long reviews of product as metadata, pretty much changes nothing. Personalize doesn't understands if the review itself is good or bad, or if the author also recommends another product, so pretty much it doesn't really add anything to the recommendations.
However if you simply "cut" your dataset and calculate the average rating, like in the 2. point, then it makes a lot more sense. Maybe some of our customers like crappy products? Maybe they want to buy them, because they are famous YouTubers and they create videos about that? Based on their previous interactions and much more, Personalize will be able to perform just slightly better, because now it knows, that this product has rating of 5/5 or 3/5.
I wanted to show you, that for some cases, providing Items metadata as string makes no sense. That's why your string metadata must be categorical. It means, that it should be finite set of values, so it adds some knowledge for Personalize about given Item and why some of people might want to interact with it.
Going back to your question:
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
I would simply go with brand ID as string. You could also go with brand name, but probably single brand can be renamed, when it's still the same brand, so picking up the ID would be more constant. Also two different brands could have the same names, because they are present on different markets, so picking up the ID solves that.
The "categorical": true switch in your schema just tells Personalize:
Hey, do you see that string field? It's categorised, finite set of values. If you train a model for me, please include this one during the training, it's important!
And as it's said in documentation, if you will provide string metadata field, which is not marked as categorical, then Personalize will "think" that:
Hmm.. this field is a string, it has pretty random values and it's not marked as categorical. It's probably just a leftover from Items export job. Let's ignore that.
I'm trying to build a web application where users can upload a file (specifically the MDF file format) and view the data in forms of various charts. The files can contain any number of time based signals (various numeric data types) and users may name the signals wildly.
My thought on saving the data involves 2 steps:
Maintain a master table as an index, to save such meta information as file names, who uploaded it, when, etc. Records (rows) are added each time a new file is uploaded.
Create a new table (I'll refer to this as data tables) for each file uploaded, within the table each column will be one signal (first column being timestamps).
This brings the problem that I can't pre-define the Model for the data tables because the number, name, and datatype of the fields will differ among virtually all uploaded files.
I'm aware of some libs that help to build runtime dynamic models but they're all dated and questions about them on SO basically get zero answers. So despite the effort to make it work, I'm not even sure my approach is the optimal way to do what I want to do.
I also came across this Postgres specifc model field which can take nested arrays (which I believe fits the 2-D time based signals lists). In theory I could parse the raw uploaded file and construct such an array and basically save all the data in one field. Not knowing the limit of size of data, this could also be a nightmare for the queries later on, since to create the charts it usually takes only a few columns of signals at a time, compared to a total of up to hundreds of signals.
So my question is:
Is there a better way to organize the storage of data? And how?
Any insight is greatly appreciated!
If the name, number and datatypes of the fields will differ for each user, then you do not need an ORM. What you need is a query builder or SQL string composition like Psycopg. You will be programatically creating a table for each combination of user and uploaded file (if they are different) and programtically inserting the records.
Using postgresql might be a good choice, you might also create a GIN index on the arrays to speed up queries.
However, if you are primarily working with time-series data, then using a time-series database like InfluxDB, Prometheus makes more sense.
I started looking into DynamoDB, but got stuck reading this part about the materialized graph pattern: Best Practices for Managing Many-to-Many Relationships.
I guess I get some ideas, but don't understand the whole thing yet.
As far as I understand the pattern the main table stores edges and each edge can have properties (the data-attribute).
For example (taken from the shown tables):
Node 1 (PK 1) has an edge to Node 2 which is of type DATE, and the edge is of type BIRTH (SK DATE|2|BIRTH).
I guess this would somewhat be the same as ()-[:BIRTH]->(:DATE { id: 2 }) in Cipher, right?
But after this it becomes unclear how everything fits together.
For example:
Can the data attribute be a map?
Does the data attribute have to be written to two places on writes? E.g. under (1, DATE|2|BIRTH) and (2, DATE|2)?
If I want to add a new person that is born 1980-12-19, do I have to look up the corresponding node first?
How can I get all properties associated with a node? How to get all properties associated with an edge?
How can I query adjacent nodes?
...
Can someone explain to me how everything fits together by walking through a few use cases?
Thanks in advance.
Hopefully this answers all of your questions. Here's a couple of introductory things. I'll be using a generic table for all of my examples. The hash key is node_a and the sort key is node_b. There is a reverse lookup GSI where node_b is the hash key and node_a is the sort key.
1. Can the data attribute be a map?
The data attribute can be any of the supported data types in DynamoDB, including a map.
2. Does the data attribute have to be written to two places on writes?
The data attribute should be written to only one place. For the example of birthdate, you could do either one of these DynamoDB entries:
node_a | node_b | data
----------|-----------|---------------
user-1 | user-1 | {"birthdate":"2000-01-01", "firstname": "Bob", ...}
user-1 | birthdate | 2000-01-01
In the first row, we created an edge from the user-1 node that loops back on itself. In the second row, we created an edge from user-1 to birthdate. Either way is fine, and the best choice depends on how you will be accessing your data. If you need to be able to find users with a birthdate in a given range, then you should create a birthdate node. If you just need to look up a user's information from their user ID, then you can user either strategy, but the first row will usually be a more efficient use of your table's throughput.
3. If I want to add a new person that is born 1980-12-19, do I have to look up the corresponding node first?
No. Just insert one of the rows from the example above.
You only have to look up the node if there is a more complex access pattern, such as "update the name of the person who was born on 1980-12-19". In that case, you would need to look up by birthdate to get the person node, and then modify something related to the person node. However, that use case is really two different operations. You can rephrase that sentence as "Find the person who was born on 1980-12-19, and update the name", which makes the two operations more apparent.
4.(a) How can I get all properties associated with a node?
Suppose you want to find all the edges for "myNode". You would query the main table with the key condition expression of node_a="myNode" and query the reverse lookup GSI with the key condition expression of node_b="myNode". This is the equivalent of SELECT * FROM my_table WHERE node_a="myNode" OR node_b="myNode".
4.(b) How to get all properties associated with an edge?
All of the properties of an edge are stored directly in the attributes of the edge, but you may still run into a situation where you don't know exactly where the data is. For example:
node_a | node_b | data
----------|-----------|---------------
thing-1 | thing-2 | Is the data here?
thing-2 | thing-1 | Or here?
If you know the ordering of the edge nodes (ie. which node is node_a and node_b) then you need only one GetItem operation to retrieve the data. If you don't know which order the nodes are in, then you can use BatchGetItems to look up both of the rows in the table (only one of the rows should exist unless you're doing something particularly complex involving a directed graph).
5. How can I query adjacent nodes?
Adjacent nodes are simply two nodes that have an edge connecting them. You would use the same query as 4a, except that instead of being interested in the data attribute, you're interested in the IDs of the other nodes.
Some more examples
Using a graph pattern to model a simple social network
Using a graph pattern to model user-owned resources
How to model a circular relationship between actors and films in DynamoDB (answer uses a graph pattern)
Modeling many-to-many relationships in DynamoDB
From relational DB to single DynamoDB table: a step-by-step exploration. This is a killer piece. It's got an AWS re:Invent talk embedded in it, and the author of this blog post adds his own further explanation on top of it.
How do people deal with index data (the data usually shown on index pages, like a customer list) -vs- the model detail data?
When somebody goes to the customer/index route -- they only need access to a small subset of the full customer resource. Since I am dealing with legacy data, my customer model has > 10 relationships. It seems wasteful to have the api return a complete and full customer representation for every customer just to render a list/select/index view.
I know those relationships are somewhat lazy-loaded, but it still takes effort on the backend to pull all those relationships in. For some relationships (such as customer->invoices) this could be a large list of ids.
I feel answers to this can be very opinionated. But my two cents:
The API you are drawing on for your data should have an end-point to fetch the subset of data you're interested in, e.g. /api/mini-customer vs /api/customer.
You can then either define two separate models (one to represent the model in the list and one to represent the detailed view), or simply populate the original model with the subset of data and merge the extra data in at a later point.
That said, I've also seen plenty of cases such as the one you describe, where you load all data initially and just display the subset to begin with. If it's reasonable that the data will eventually be used and your page-load constraints can handle it, then this can be an acceptable approach.
I cant seem to find out what attribute selection filter does in pre process tab? someone could please tell me in simple language as im new to weka
when i apply it to my dataset it seems to remove a couple of attributes but im unsure why
A real data set may contain many attributes. Applying any data mining process on this data set (e.g. finding clusters, generating a classification model ...) may take very long time.
Instead of that, we can select some attributes(dimensions) which is called the most discriminative attributes. These attributes can almost describe the data set with lower number of attributes and this will speed up any process done on the data.
Attribute selection tab contains many different methods for selecting these attributes. One of them is CFS Feature Set Evaluation This filter gives you the attributes that have higher correlation with the class label which makes them discriminative attributes.