I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).
Say I have a CSV with the following columns:
date_order_created
customer_no
order_no
zip_code
item_id
item_short_description
The requirements for the Gremlin load format are that the data is in an edge file and a vertex file.
The edge file must have the following columns: id, label, from and to.
The vertex file must have: id and label columns.
I have been referring to this page for guidance: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html
It states that in the edge file, the from column must equate to "the vertex ID of the from vertex."
And that (in the edge file) the to column must equate to "the vertex ID of the to vertex."
My questions:
Which columns need to be renamed to id, label, from and to? Or, should I add new columns?
Do I only need one vertex file or multiple?
You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.
As to the column headers, let's say you had a node (vertex) file of the form:
~id,~label,name,breed,age:Int
dog-1,Dog,Toby,Retriever,11
dog-2,Dog,Scamp,Spaniel,12
The edge file (for dogs that are friends), might look like this
~id,~label,~from,~to
e-1,FRIENDS_WITH,dog-1,dog-2
In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if customer_no is guaranteed to be unique, rather than store it as a property called customer_no you could instead make it the ~id. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.
So in summary, you need to always provide the required fields like ~id and ~label. They are accessed differently using Gremlin steps such as hasLabel and hasId once the data is loaded. Columns with names from your domain like order_no will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')
To follow on from Kelvin's response and provide some further detail around data modeling...
Before getting to the point of loading the data into a graph database, you need to determine what the graph data model will look like. This is done by first deriving a "naive" approach of how you think the entities in the data are connected and then validating this approach by asking the relevant questions (which will turn into queries) that you want to ask of the data.
By way of example, I notice that your dataset has information related to customers, orders, and items. It also has some relevant attributes related to each. Knowing nothing about your use case, I may derive a "naive" model that looks like:
What you have with your original dataset appears similar to what you might see in a relational database as a Join Table. This is a table that contains multiple foreign keys (the ids/no's fields) and maybe some related properties for those relationships. In a graph, relationships are materialized through the use of edges. So in this case, you are expanding this join table into the original set of entities and the relationships between each.
To validate that we have the correct model, we then want to look at the model and see if we can answer relevant questions that we would want to ask of this data. By example, if we wanted to know all items purchased by a customer, we could trace our finger from a customer vertex to the item vertex. Being able to see how to get from point A to point B ensures that we will be able to easily write graph queries for these questions later on.
After you derive this model, you can then determine how best to transform the original source data into the CSV bulk load format. So in this case, you would take each row in your original dataset and convert that to:
For your vertices:
~id, ~label, zip_code, date_order_created, item_short_description
customer001, Customer, 90210, ,
order001, Order, , 2023-01-10,
item001, Item, , , "A small, non-descript black box"
Note that I'm reusing the no's/ids for the customer, item, and order as the ID for their related vertices. This is always good practice as you can then easily lookup a customer, order, or item by that ID. Also note that the CSV becomes a sparse 2-dimensional array of related entities and their properties. I'm only providing the properties related to each type of vertex. By leaving the others blank, they will not be created.
For your edges, you then need to materialize the relationships between each entity based on the fact that they are related by being in the same row of your source "join table". These relationships did not previously have a unique identifier, so we can create one (it can be arbitrary or based on other parts of the data; it just needs to be unique). I like using the vertex IDs of the two related vertices and the label of the relationship when possible. For the ~from and ~to fields, we are including the vertices from which the relationship is deriving and what it is applying to, respectively:
~id, ~label, ~from, ~to
customer001-has_ordered-order001, has_ordered, customer001, order001
order001-contains-item001, contains, order001, item001
I hope that adds some further color and reasoning around how to get from your source data and into the format that Kelvin shows above.
I am currently using pandas (0.22.0) with read_table with names.
How can I address when my underlying data schema changes?
For example, my read_table is reading 5 columns and the data file has 5 columns. How would I tackle changes in the data(when a new column is added to the data, does that mean that I have to update schema when the data format changes? Is there a way to ignore the columns not mentioned via names in Pandase read_table
there is a usecols parameter that you can pass to read_table to read only a subset of the available data. So long as the 5 columns that you are concerned with are always present, you should be able to name them explicitly in the call.
cols_of_interest = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.read_table(file_path, usecols=cols_of_interest)
Documentation for pd.read_table here - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html
Note that you can also pass a callable which can decide which columns to parse, or specify column indices instead of named columns (depends on the underlying data I guess).
The problem I have here is I am iterating over data files with a set schema with read_table and names. I do not want to be updating schema every time when the underlying data changes.
I found a work-around (more of a hack) at this point. I added a few 'dummy' columns to names array.
I have a Microsoft Foundation Class (MFC) CMap object built where each object instance stores 160K~ entries of long data.
I need to store it on Oracle SQL.
We decided to save it as a BLOB since we do not want to make an additional table. We thought about saving it as local file and point the SQL column to that file, but we'd rather just keep it as BLOB on the server and clear the table every couple of weeks.
The table has a sequential key ID, and 2 columns of date/time. I need to add the BLOB column in order to store the CMap object.
Can you recommend a guide to do so (read/write Map to blob or maybe a clob)?
How do I create a BLOB field in Oracle, and how can I read and write my object to the BLOB? Perhaps using a CLOB?
CMAP cannot be inserted into blob/clob since its using pointers.
first of all use clob
and store array/vector instead of cmap.
jpg
In that Picture i have colored one part. i have attribute called "deviceModel". It contains more than one value.. i want to take using query from my domain which ItemName() contains deviceModel attribute values more than one value.
Thanks,
Senthil Raja
There is no direct approach to get what you are asking.. You need to manipulate by writing your own piece of code. By running SELECT query you will get the item Attribute-value pair. So here you need to traverse each each itemName() and count values of your desire attribute.
I think what you are refering to is called MultiValued Attributes. When you put a value in the attribute - if you don't replace the existing attribute value the values will multiply, giving you an array of items connected to the value of that attribute name.
How you create them will depend on the sdk/language you are using for your REST calls, however look for the Replace=true/false when you set the attribute's value.
Here is the documentation page on retrieving them: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/ (look under Using Amazon SimpleDB -> Using Select to Create Amazon SimpleDB Queries -> Queries on Attributes with Multiple Values)
How can I query the Sitecore archive and what can be queried?
For instance, can I make a query after the values of the fields of an archived Item ?
I suppose you're talking about the API to query Archive data. If that's the case, take a look at the Sitecore.Data.Archiving.SqlArchive class, and its method GetEntries() in particular. One of the parameters it accepts is an ArchiveQuery instance.
If you look closer at ArchiveQuery class, you'll see that it is possible to query by item ID, Parent ID, Name, archive date range, original location and "archived by" data of the item being archived.
There seems to be no ad-hoc API to use field data in such queries, but the data of archived fields is still stored in ArchivedFields SQL table. And you can try to address it directly to accomplish what you need (at your own risk, of course).
Hope this helps.