I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).
Say I have a CSV with the following columns:
date_order_created
customer_no
order_no
zip_code
item_id
item_short_description
The requirements for the Gremlin load format are that the data is in an edge file and a vertex file.
The edge file must have the following columns: id, label, from and to.
The vertex file must have: id and label columns.
I have been referring to this page for guidance: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html
It states that in the edge file, the from column must equate to "the vertex ID of the from vertex."
And that (in the edge file) the to column must equate to "the vertex ID of the to vertex."
My questions:
Which columns need to be renamed to id, label, from and to? Or, should I add new columns?
Do I only need one vertex file or multiple?
You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.
As to the column headers, let's say you had a node (vertex) file of the form:
~id,~label,name,breed,age:Int
dog-1,Dog,Toby,Retriever,11
dog-2,Dog,Scamp,Spaniel,12
The edge file (for dogs that are friends), might look like this
~id,~label,~from,~to
e-1,FRIENDS_WITH,dog-1,dog-2
In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if customer_no is guaranteed to be unique, rather than store it as a property called customer_no you could instead make it the ~id. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.
So in summary, you need to always provide the required fields like ~id and ~label. They are accessed differently using Gremlin steps such as hasLabel and hasId once the data is loaded. Columns with names from your domain like order_no will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')
To follow on from Kelvin's response and provide some further detail around data modeling...
Before getting to the point of loading the data into a graph database, you need to determine what the graph data model will look like. This is done by first deriving a "naive" approach of how you think the entities in the data are connected and then validating this approach by asking the relevant questions (which will turn into queries) that you want to ask of the data.
By way of example, I notice that your dataset has information related to customers, orders, and items. It also has some relevant attributes related to each. Knowing nothing about your use case, I may derive a "naive" model that looks like:
What you have with your original dataset appears similar to what you might see in a relational database as a Join Table. This is a table that contains multiple foreign keys (the ids/no's fields) and maybe some related properties for those relationships. In a graph, relationships are materialized through the use of edges. So in this case, you are expanding this join table into the original set of entities and the relationships between each.
To validate that we have the correct model, we then want to look at the model and see if we can answer relevant questions that we would want to ask of this data. By example, if we wanted to know all items purchased by a customer, we could trace our finger from a customer vertex to the item vertex. Being able to see how to get from point A to point B ensures that we will be able to easily write graph queries for these questions later on.
After you derive this model, you can then determine how best to transform the original source data into the CSV bulk load format. So in this case, you would take each row in your original dataset and convert that to:
For your vertices:
~id, ~label, zip_code, date_order_created, item_short_description
customer001, Customer, 90210, ,
order001, Order, , 2023-01-10,
item001, Item, , , "A small, non-descript black box"
Note that I'm reusing the no's/ids for the customer, item, and order as the ID for their related vertices. This is always good practice as you can then easily lookup a customer, order, or item by that ID. Also note that the CSV becomes a sparse 2-dimensional array of related entities and their properties. I'm only providing the properties related to each type of vertex. By leaving the others blank, they will not be created.
For your edges, you then need to materialize the relationships between each entity based on the fact that they are related by being in the same row of your source "join table". These relationships did not previously have a unique identifier, so we can create one (it can be arbitrary or based on other parts of the data; it just needs to be unique). I like using the vertex IDs of the two related vertices and the label of the relationship when possible. For the ~from and ~to fields, we are including the vertices from which the relationship is deriving and what it is applying to, respectively:
~id, ~label, ~from, ~to
customer001-has_ordered-order001, has_ordered, customer001, order001
order001-contains-item001, contains, order001, item001
I hope that adds some further color and reasoning around how to get from your source data and into the format that Kelvin shows above.
I currently have a postgres database in which I store a data about a photo, along with the location as a JSON (using Django). The location is obtained through GooglePlacesAPI-
https://developers.google.com/places/documentation/search (search for Search Responses for example responses)
So currently, every photo has a location column which contains the JSON information for the place as obtained from the GooglePlacesAPI
Now I would like to use Postgres' spatial capabilities to query based on the location, however I am not sure how to do that and what schema changes are required. The Postgres documentation seems to indicate that there would be a new table required with the location's name, lat, lng and other information. So does that mean that every location will be saved in a different table and will have a foreign key referenced to that?
And so the JSON will need to be essentially flattened to be stored in the table?
If so, is there a recommended table format that would be good to store the location in so that I can get any other location data (say from Foursquare, FB, etc) and convert it to the format of the table before storing.
Geometry is not special: it's just another data type. So add a geometry column to your existing table. Assuming you have installed and enabled PostGIS 2.x:
ALTER TABLE mytable ADD COLUMN geom geometry(Point,4326);
Then populate the geometry data by extracting the data out of the location JSON column (which really depends on how the data are structured within this amorphous column):
UPDATE mytable SET
geom = ST_SetSRID(ST_MakePoint((location->>'lng')::numeric,
(location->>'lat')::numeric), 4326)
And that should be a start. Later steps would be to build GiST indices and do spatial queries with other tables or points of interests. You may also want to consider the geography data type, instead of the geometry data type, depending on your needs.
Hi Stackoverflow people,
I am working on an organisation registration, in which orgs can register their project areas (like all Nevada, entire US, or simply a city e.g. Boston) and users should find all organisations which are covered by the organisation according to their lat & lng.
What is the best way to connect the organisation information with the user searches?
Is the following process ok or do you have any suggestions:
I load the shapefiles of all necessary states, counties , etc. in
my postgis database
If an organisation adds "New York state" to
their coverage area, I would look up the polygon shape for the state
(or the id to the shape) and save it in my coverage table
When I search for the org coverage, I would find all projects where the
user lat & lng is part of
Is that process above ok to connect the user information to the shapefile information?
How can I look up polygons in the shape files? Can I reference them with an ID?
How would the lookup work with cities, since most lists with city name, lat, lng only list the center point of the city? Or is there a table for even city boundaries?
Thank you for your help and suggestions!
When you import a shp file to postgis with shp2pgsql, all the other columns (state name, city name, etc) are imported too, so you can search by name or any other property that the shp file has, or you can search by geometry, if you have a point and you want to search polygons or points(citys in your case) near that point, the query to the database is very simple:
SELECT * from myTable where ST_DWithin(users_point, the_geom, 0.002);
//the distance units of ST_DWithin are in the geometry units.
PS shp2pgsql automatically creates a serial column(unique id)
I am designing a database to store details of a hotel, where i have to classify according to country, state, city, region and its all defined separately as tables. The hotel tables have a foreign key for them and hotel's latitude and longitude.
But i have to define each country,state,city,region with its latitude and longitude too. A simple MIN Latitude/longitude and MAX latitude/longitude isnt enough as some cities may be round or it may not be possible that way without significant error.
How do i define the global position of the city. I have to have a reasonable error rate (say 20%).
I think the concept you are looking for is the centroid. Calculating these on your own would be quite difficult. You should probably use a geocoding api like the one provided by google.
I'm trying to create a Django app that would take an inputted address and return a list of political races that person would vote in. I have maps of all the districts (PDFs). And I know that I can use geopy to convert an inputted address into coordinates. How do I define the voter districts in Django so that I can run a query to see what districts those coordinates fall in?
This is a non-trivial problem too large in scope to answer in specific detail here. In short, you'll need to use GeoDjango (part of contrib). There is a section dedicated to importing spatial data.
Once you have your data loaded, you can use spatial lookups to find what district a particular coordinate intersects.
As to where to get the voter district data, you might start with www.data.gov's geodata catalog.