so I'm working in the Ab Initio GDE and got a specific set of IDs in different columns, that if I put them together should equal the levels of a tree structure.
the two columns are as follows:
enter image description here
Both IDs describe operators (each with a specific ID#). The TARGET_ID is the root while the SOURCE_ID is the followed up operator one Level under the previous one.
So the tree structure looks like this: Tree structure
I can't seem to find a way in the GDE to reference the previous read value in these columns, when transforming the data.
this data is sorted by a timestamp key because for each timestamp there is one tree structure.
After sort component I added a Levels column with Nulls to get the goal format.
My data consists of Time column, the two ID columns and the level column now.
I have tried partitioning the data for each SOURCE_ID value, so all entries of that partitioning has to be the next iterated level.
I can't seem to get the Code in "Text View" right, with the temp_level that I iterate up and the function which loads the previous or next entry value (I read it was the metacomp-function but I cant seem to find example implementation/code).
Can someone help me on how to solve this? Maybe I'm thinking too complicated and a different combination of components can do the trick as well.
Thanks in advance everyone
Related
I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).
Say I have a CSV with the following columns:
date_order_created
customer_no
order_no
zip_code
item_id
item_short_description
The requirements for the Gremlin load format are that the data is in an edge file and a vertex file.
The edge file must have the following columns: id, label, from and to.
The vertex file must have: id and label columns.
I have been referring to this page for guidance: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html
It states that in the edge file, the from column must equate to "the vertex ID of the from vertex."
And that (in the edge file) the to column must equate to "the vertex ID of the to vertex."
My questions:
Which columns need to be renamed to id, label, from and to? Or, should I add new columns?
Do I only need one vertex file or multiple?
You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.
As to the column headers, let's say you had a node (vertex) file of the form:
~id,~label,name,breed,age:Int
dog-1,Dog,Toby,Retriever,11
dog-2,Dog,Scamp,Spaniel,12
The edge file (for dogs that are friends), might look like this
~id,~label,~from,~to
e-1,FRIENDS_WITH,dog-1,dog-2
In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if customer_no is guaranteed to be unique, rather than store it as a property called customer_no you could instead make it the ~id. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.
So in summary, you need to always provide the required fields like ~id and ~label. They are accessed differently using Gremlin steps such as hasLabel and hasId once the data is loaded. Columns with names from your domain like order_no will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')
To follow on from Kelvin's response and provide some further detail around data modeling...
Before getting to the point of loading the data into a graph database, you need to determine what the graph data model will look like. This is done by first deriving a "naive" approach of how you think the entities in the data are connected and then validating this approach by asking the relevant questions (which will turn into queries) that you want to ask of the data.
By way of example, I notice that your dataset has information related to customers, orders, and items. It also has some relevant attributes related to each. Knowing nothing about your use case, I may derive a "naive" model that looks like:
What you have with your original dataset appears similar to what you might see in a relational database as a Join Table. This is a table that contains multiple foreign keys (the ids/no's fields) and maybe some related properties for those relationships. In a graph, relationships are materialized through the use of edges. So in this case, you are expanding this join table into the original set of entities and the relationships between each.
To validate that we have the correct model, we then want to look at the model and see if we can answer relevant questions that we would want to ask of this data. By example, if we wanted to know all items purchased by a customer, we could trace our finger from a customer vertex to the item vertex. Being able to see how to get from point A to point B ensures that we will be able to easily write graph queries for these questions later on.
After you derive this model, you can then determine how best to transform the original source data into the CSV bulk load format. So in this case, you would take each row in your original dataset and convert that to:
For your vertices:
~id, ~label, zip_code, date_order_created, item_short_description
customer001, Customer, 90210, ,
order001, Order, , 2023-01-10,
item001, Item, , , "A small, non-descript black box"
Note that I'm reusing the no's/ids for the customer, item, and order as the ID for their related vertices. This is always good practice as you can then easily lookup a customer, order, or item by that ID. Also note that the CSV becomes a sparse 2-dimensional array of related entities and their properties. I'm only providing the properties related to each type of vertex. By leaving the others blank, they will not be created.
For your edges, you then need to materialize the relationships between each entity based on the fact that they are related by being in the same row of your source "join table". These relationships did not previously have a unique identifier, so we can create one (it can be arbitrary or based on other parts of the data; it just needs to be unique). I like using the vertex IDs of the two related vertices and the label of the relationship when possible. For the ~from and ~to fields, we are including the vertices from which the relationship is deriving and what it is applying to, respectively:
~id, ~label, ~from, ~to
customer001-has_ordered-order001, has_ordered, customer001, order001
order001-contains-item001, contains, order001, item001
I hope that adds some further color and reasoning around how to get from your source data and into the format that Kelvin shows above.
I have a list of institutions with ID's and I have the same list but not in the same order as this image:
So I don't have it in the same order the next step is that this institutions have different areas and some institutes have some others don't. Like this list:
I have this formula =IF(ISNUMBER(MATCH(C$1,$J2:$O2,0)),"Found","Not Found") that helps me to look for the area in that row but is not in the same order this gives me errors, so I would like to look for the institute name in the range and look for the areas available in that institution row:
Here is the image of the error:
You can see in the image that the founds and not founds are not correct
It can be achieved using a helper table where you first sort your J2:Q14 range using
=SORT(J2:Q14,1,1)
You then use the following in cell C2
=IFERROR(IF(HLOOKUP(C$1,$J17:$Q$29,1,0)=C$1,"Fount",),"NOT")
Drag the cell down and right to fill all other cells.
(Of course the sorted table can be in another sheet altogether)
I want to reduce the selected items by their distance from a point using MarkLogic Optic.
I have a table with data and a lat long
const geoData = op.fromView("namespace","coordinate");
geoData.where(op.le(distance(-28.13,153.4,geoData.col(lat),geoData(long)),100))
The distance function I have already written and utilises geo.distance(cts.point(lat,long),cts.point(lat,long)) but the geoData.col("lat") passes an object that describes the full names space of the col not the value.
op.schemaCol('namespace', 'coordinate', 'long')
I suspect I need to do a map/reduce function but MarkLogic documentation gives the normal simplistic examples that are next to useless.
I would appreciate at some help.
FURTHER INFORMATION
I have mostly solved most of this problem except that some column have null values. The data is sparse and not all rows have a long lat.
So when the cts.points runs in the where statement and two null values are passed it raises an exception.
How do I coalesce or prevent execution of cts.points when data columns are null? I dont want to reduce the data set as the null value records still need to be returned they will just have a null distance.
Where possible, it's best to do filtering by passing a constraining cts.query() to where().
A constraining query matches the indexed documents and filters the set of rows to the rows that TDE projected from those documents before retrieving the filtered rows from the indexes.
If the lat and long columns are each distinct JSON properties or XML elements in the indexed documents, it may be possible to express the distance constraint using techniques similar to those summarized here:
http://docs.marklogic.com/guide/search-dev/geospatial#id_42017
In general, it's best to use map/reduce SJS functions for postprocessing on the filtered result set because the rows have to be retrieved to the enode to process in SJS.
Hoping that helps,
I started looking into DynamoDB, but got stuck reading this part about the materialized graph pattern: Best Practices for Managing Many-to-Many Relationships.
I guess I get some ideas, but don't understand the whole thing yet.
As far as I understand the pattern the main table stores edges and each edge can have properties (the data-attribute).
For example (taken from the shown tables):
Node 1 (PK 1) has an edge to Node 2 which is of type DATE, and the edge is of type BIRTH (SK DATE|2|BIRTH).
I guess this would somewhat be the same as ()-[:BIRTH]->(:DATE { id: 2 }) in Cipher, right?
But after this it becomes unclear how everything fits together.
For example:
Can the data attribute be a map?
Does the data attribute have to be written to two places on writes? E.g. under (1, DATE|2|BIRTH) and (2, DATE|2)?
If I want to add a new person that is born 1980-12-19, do I have to look up the corresponding node first?
How can I get all properties associated with a node? How to get all properties associated with an edge?
How can I query adjacent nodes?
...
Can someone explain to me how everything fits together by walking through a few use cases?
Thanks in advance.
Hopefully this answers all of your questions. Here's a couple of introductory things. I'll be using a generic table for all of my examples. The hash key is node_a and the sort key is node_b. There is a reverse lookup GSI where node_b is the hash key and node_a is the sort key.
1. Can the data attribute be a map?
The data attribute can be any of the supported data types in DynamoDB, including a map.
2. Does the data attribute have to be written to two places on writes?
The data attribute should be written to only one place. For the example of birthdate, you could do either one of these DynamoDB entries:
node_a | node_b | data
----------|-----------|---------------
user-1 | user-1 | {"birthdate":"2000-01-01", "firstname": "Bob", ...}
user-1 | birthdate | 2000-01-01
In the first row, we created an edge from the user-1 node that loops back on itself. In the second row, we created an edge from user-1 to birthdate. Either way is fine, and the best choice depends on how you will be accessing your data. If you need to be able to find users with a birthdate in a given range, then you should create a birthdate node. If you just need to look up a user's information from their user ID, then you can user either strategy, but the first row will usually be a more efficient use of your table's throughput.
3. If I want to add a new person that is born 1980-12-19, do I have to look up the corresponding node first?
No. Just insert one of the rows from the example above.
You only have to look up the node if there is a more complex access pattern, such as "update the name of the person who was born on 1980-12-19". In that case, you would need to look up by birthdate to get the person node, and then modify something related to the person node. However, that use case is really two different operations. You can rephrase that sentence as "Find the person who was born on 1980-12-19, and update the name", which makes the two operations more apparent.
4.(a) How can I get all properties associated with a node?
Suppose you want to find all the edges for "myNode". You would query the main table with the key condition expression of node_a="myNode" and query the reverse lookup GSI with the key condition expression of node_b="myNode". This is the equivalent of SELECT * FROM my_table WHERE node_a="myNode" OR node_b="myNode".
4.(b) How to get all properties associated with an edge?
All of the properties of an edge are stored directly in the attributes of the edge, but you may still run into a situation where you don't know exactly where the data is. For example:
node_a | node_b | data
----------|-----------|---------------
thing-1 | thing-2 | Is the data here?
thing-2 | thing-1 | Or here?
If you know the ordering of the edge nodes (ie. which node is node_a and node_b) then you need only one GetItem operation to retrieve the data. If you don't know which order the nodes are in, then you can use BatchGetItems to look up both of the rows in the table (only one of the rows should exist unless you're doing something particularly complex involving a directed graph).
5. How can I query adjacent nodes?
Adjacent nodes are simply two nodes that have an edge connecting them. You would use the same query as 4a, except that instead of being interested in the data attribute, you're interested in the IDs of the other nodes.
Some more examples
Using a graph pattern to model a simple social network
Using a graph pattern to model user-owned resources
How to model a circular relationship between actors and films in DynamoDB (answer uses a graph pattern)
Modeling many-to-many relationships in DynamoDB
From relational DB to single DynamoDB table: a step-by-step exploration. This is a killer piece. It's got an AWS re:Invent talk embedded in it, and the author of this blog post adds his own further explanation on top of it.
Pardon the dumb question, but where can I find a view for Django to display "tables" in a spreadsheet-like format (i.e. one record per row, one field per column).
(The fact that I have not been able to find one after some searching tells me that maybe my noob-level understanding of Django views is way off-base.)
Here I'm intending the term "table" fairly generically; basically, think of a table as an "list of lists", with the additional constraint that all the internal lists have the same length.
(It would be great if this table had spreadsheet features, such as the possibility of sorting the rows by clicking on the column headers, or of cutting and pasting arbitrary rectangular subsets of the cells, but I realize that this may be hoping for too much.)
An other way to do it is to use tabular inlines but a list view is what you are looking for.