Does partition strategy helps on Gremlin traversal performance - amazon-web-services

I tried to play around with the partition strategy as what was mentioned here https://tinkerpop.apache.org/docs/current/reference/ .Initially, I expect that when I define a specific partition key for a zone and write some vertices on it, it would index that specific zones and improve the vertex lookup. Eventually, I realize that the partition key is just like another property value define within a vertex. In other words, these codes is nothing more but just a property value lookup which leads to full graph traversal scan:
g.withStrategies(new PartitionStrategy(partitionKey: "_partition", writePartition: "a",
readPartitions: ["a"]));
I'm not sure what are the underlying logic of this partitionstrategy, but it does not seems to be improve the lookup if it really does full graph scan. Correct me if i;m wrong

From TinkerPop's perspective, PartitionStrategy is just automatically modifying your Gremlin to take advantage of particular property in the graph. TinkerPop doesn't know anything about your graph databases's underlying indexing features nor does it implement any. It is up to your graph to optimize such things. Some graphs might do that on their own, some might offer you the opportunity to create indices that would help improve the speed of PartitionStrategy and others might do nothing at all, leaving PartitionStrategy to not work well for all use cases.
Going back to TinkerPop's perspective, the goal of PartitionStrategy (and SubgraphStrategy for that matter) is more to ease the manner with which Gremlin is written for use cases where parts of the graph need to be hidden. Without it, you would have lots and lots of repetitive filters mixed into your traversal which would muddy its readability.
Consider this bit of code:
graph = TinkerGraph.open()
strategy = new PartitionStrategy(partitionKey: "_partition", writePartition: "a", readPartitions: ["a"])
g = traversal().withEmbedded(graph).withStrategies(strategy)
g.addV().addE('link')
g.V().out().out().out()
The traversal is quite readable and straightforward. It is easy to understand the intent - a three step hop. But that's not really the traversal that executed. What executed was:
g.V().out().has('_partition',within("a")).
out().has('_partition',within("a")).
out().has('_partition',within("a"))
If you are using PartitionStrategy then you need to be sure it suits your graph database as well as your use case.

Related

What is the best way to represent a map or train station line as a graph in code?

so I was trying to represent a certain transport system and apply some search algorithms. The system consists of a number of stations, so I think they can act as vertices. while the lines between them are good for the edges. I do have a high level idea of what I wanna do and how the search might work but I am not able to translate that into code.
Do I use a class to represent the stations? and each station an object? a tuple to store the co-ordinates? I would love to get any guidance to how to actually translate implementing the algorithms and writing the program itself.
I am thinking about using C++ for this
you need at least a class Node/Station, then you need either 1. a class Graph that contains a list of weighted connections between 2 Nodes, or 2. a list of nodes and each nodes has a list of weighted connections with node pointers.
Usually you would want your graph API to be able to return the neighbors of a node* sorted by weights, so probably sort that by calling graph.build() after adding all the weights.
There is no real gain in adding "coordinates" for the stations, unless its a real application, as its much easier to just set costs between stations than trying to come up with good positions for stations, otherwise just draw yourself a station map and label the edges yourself.
I'm guessing you want something like graph.path(a, b), then with Dijkstra's algorithm you will easily be able to do that, I would recommend setting up the code to call the algo before coding too much, that way if you're about to represent the data in a bad way, you will know earlier.

What is the best data structure to model a path through an undirected graph?

I'm working on modeling a path search and deduction board game, to practice some concepts I am learning in school. This is a first attempt at analyzing graphs for me, and I would appreciate some advice on what kind of data structure might be appropriate for what I am trying to do.
The game I am modeling presents as a series of ~200 interconnected nodes, as shown below. Given a known starting position for the adversary (node 84, for example, in the figure below) the goal is to identify possible locations of the adversary's hideout. The adversary's moves away from 84 are, naturally unknown.
Fig 1 - Illustrative Sub-Graph with Adversary Initial Position at Node 84
Initially, this leads to a situation like the one below. Given the adversary started at 84, he/she can only be at 66, 86 or 99 after taking their first turn. And so on.
Fig 2 - Possible Locations for Adversary after 1, 2 and 3 Turns (Based on Fig 1 Graph)
So far, I have modeled the board itself as an undirected graph - using an implementation of OCaml's ocamlgraph library. What I am now trying to do is to model the path taken by the adversary through the graph, so as to identify potential locations of the adversary after each turn.
While convenient for illustration purposes, the tree representation implied by the figure above has several drawbacks:
First, keeping track of all possible paths through the network is unnecessary (I care only about terminal location of the adversary's hideout, not the path taken) as well as burdensome: each node is connected to ~7 other nodes, on average. By the time we hit the end of the game's 15 turns, that's a lot of branches!
Second, I suspect pruning would become an issue as well. Indeed, part of the exercise here is to maximally exploit the limited information about the adversary's movements that revealed as the game goes on. This information either states that the adversary "has never been to node X" or "has previously visited node X."
Information of the first type (e.g. "adversary has never been to node 65") would lead me to want to prune the tree "from above" by traveling down through the branches and cutting off any branch that is invalidated by the revealed information.
Fig 3 - Pruning from the Top ("Adversary Has Never Been to Node 65")
Information of the second type (e.g. "Adversary has Visited Node 100") would, however, invite pruning "from below" to cut off any branch that was not consistent with the information.
Fig 4 - Pruning from the Bottom (e.g. "Adversary Has Visited Node 100")
It seems to me that a naive tree approach would be a messy proposition, so I thought I would ask for any suggestions or advice on the best data structure to use here, or how to better approach the problem.
It's really hard to give advice for your case, as any optimization should be preceded by profiling. It sounds like you need a bitset of some sort and/or incidence matrix. For BitSet you can either use Batteries implementation or just implement your own using OCaml arbitrary precision numbers with Zarith library. For incidence matrix, you can opt into trivial _ array array, use the Bigarray module, or, again, use Zarith and implement your own efficient representation using bitwise operations.
And if I were you, I would start with defining the abstraction that you need (i.e., the interface) then start with a drop in implementation, and later optimize based on the real input, by substituting implementations.

Boost Graph Library dynamic edges weights

I'm wondering if it's possible to make dynamic edges weights in BGL? I'm writing public transport navigator so except time as weight it would be nice if I can promote actualy using line instead of change at every stop event if it would be 3 minutes faster - this is just inconvenient.
Thanks for your help
edit:
Or maybe there is better library than can do that which I should use?
I'm not entirely clear on what you mean by dynamic... the weights are presumably stored in edge properties; there's nothing to stop you updating the properties with new values as required.
If you mean that you want the edge weights to be a function-object (or "functor", if you must) rather than "just a value", then see this thread on the BGL users list; haven't tried it myself. Makes me wonder how well various graph algorithms using edge weights deal with the weights changing while they're in progress (if the functor is called more than once and returns a different value each time)...

How can one create cyclic (and immutable) data structures in Clojure without extra indirection?

I need to represent directed graphs in Clojure. I'd like to represent each node in the graph as an object (probably a record) that includes a field called :edges that is a collection of the nodes that are directly reachable from the current node. Hopefully it goes without saying, but I would like these graphs to be immutable.
I can construct directed acyclic graphs with this approach as long as I do a topological sort and build each graph "from the leaves up".
This approach doesn't work for cyclic graphs, however. The one workaround I can think of is to have a separate collection (probably a map or vector) of all of the edges for an entire graph. The :edges field in each node would then have the key (or index) into the graph's collection of edges. Adding this extra level of indirection works because I can create keys (or indexes) before the things they (will) refer to exist, but it feels like a kludge. Not only do I need to do an extra lookup whenever I want to visit a neighboring node, but I also have to pass around the global edges collection, which feels very clumsy.
I've heard that some Lisps have a way of creating cyclic lists without resorting to mutation functions. Is there a way to create immutable cyclic data structures in Clojure?
You can wrap each node in a ref to give it a stable handle to point at (and allow you to modify the reference which can start as nil). It is then possible to possible to build cyclic graphs that way. This does have "extra" indirection of course.
I don't think this is a very good idea though. Your second idea is a more common implementation. We built something like this to hold an RDF graph and it is possible to build it out of the core data structures and layer indices over the top of it without too much effort.
I've been playing with this the last few days.
I first tried making each node hold a set of refs to edges, and each edge hold a set of refs to the nodes. I set them equal to each other in a (dosync... (ref-set...)) type of operation. I didn't like this because changing one node requires a large amount of updates, and printing out the graph was a bit tricky. I had to override the print-method multimethod so the repl wouldn't stack overflow. Also any time I wanted to add an edge to an existing node, I had to extract the actual node from the graph first, then do all sorts of edge updates and that sort of thing to make sure everyone was holding on to the most recent version of the other thing. Also, because things were in a ref, determining whether something was connected to something else was a linear-time operation, which seemed inelegant. I didn't get very far before determining that actually performing any useful algorithms with this method would be difficult.
Then I tried another approach which is a variation of the matrix referred to elsewhere. The graph is a clojure map, where the keys are the nodes (not refs to nodes), and the values are another map in which the keys are the neighboring nodes and single value of each key is the edge to that node, represented either as a numerical value indicating the strength of the edge, or an edge structure which I defined elsewhere.
It looks like this, sort of, for 1->2, 1->3, 2->5, 5->2
(def graph {node-1 {node-2 edge12, node-3 edge13},
node-2 {node-5 edge25},
node-3 nil ;;no edge leaves from node 3
node-5 {node-2 edge52}) ;; nodes 2 and 5 have an undirected edge
To access the neighbors of node-1 you go (keys (graph node-1)) or call the function defined elsewhere (neighbors graph node-1), or you can say ((graph node-1) node-2) to get the edge from 1->2.
Several advantages:
Constant time lookup of a node in the graph and of a neighboring node, or return nil if it doesn't exist.
Simple and flexible edge definition. A directed edge exists implicitly when you add a neighbor to a node entry in the map, and its value (or a structure for more information) is provided explicitly, or nil.
You don't have to look up the existing node to do anything to it. It's immutable, so you can define it once before adding it to the graph and then you don't have to chase it around getting the latest version when things change. If a connection in the graph changes, you change the graph structure, not the nodes/edges themselves.
This combines the best features of a matrix representation (the graph topology is in the graph map itself not encoded in the nodes and edges, constant time lookup, and non-mutating nodes and edges), and the adjacency-list (each node "has" a list of its neighboring nodes, space efficient since you don't have any "blanks" like a canonical sparse matrix).
You can have multiples edges between nodes, and if you accidentally define an edge which already exists exactly, the map structure takes care of making sure you are not duplicating it.
Node and edge identity is kept by clojure. I don't have to come up with any sort of indexing scheme or common reference point. The keys and values of the maps are the things they represent, not a lookup elsewhere or ref. Your node structure can be all nils, and as long as it's unique, it can be represented in the graph.
The only big-ish disadvantage I see is that for any given operation (add, remove, any algorithm), you can't just pass it a starting node. You have to pass the whole graph map and a starting node, which is probably a fair price to pay for the simplicity of the whole thing. Another minor disadvantage (or maybe not) is that for an undirected edge you have to define the edge in each direction. This is actually okay because sometimes an edge has a different value for each direction and this scheme allows you to do that.
The only other thing I see here is that because an edge is implicit in the existence of a key-value pair in the map, you cannot define a hyperedge (ie one which connects more than 2 nodes). I don't think this is a big deal necessarily since most graph algorithms I've come across (all?) only deal with an edge that connects 2 nodes.
I ran into this challenge before and concluded that it isn't possible using truly immutable data structures in Clojure at present.
However you may find one or more of the following options acceptable:
Use deftype with ":unsynchronized-mutable" to create a mutable :edges field in each node that you change only once during construction. You can treat it as read-only from then on, with no extra indirection overhead. This approach will probably have the best performance but is a bit of a hack.
Use an atom to implement :edges. There is a bit of extra indirection, but I've personally found reading atoms to be extremely efficient.

Data structure for a random world

So, I was thinking about making a simple random world generator. This generator would create a starting "cell" that would have between one and four random exits (in the cardinal directions, something like a maze). After deciding those exits, I would generate a new random "cell" at each of those exits, and repeat whenever a player would get near a part of the world that had not yet been generated. This concept would allow a "infinite" world of sorts, all randomly generated; however, I am unsure of how to best represent this internally.
I am using C++ (which doesn't really matter, I could implement any sort of data structure necessary). At first I thought of using a sort of directed graph in which each node would have directed edges to each cell surrounding it, but this probably won't work well if a user finds a spot in the world, backtracks, and comes back to that spot from another direction. The world might do some weird things, such as generate two cells at one location.
Any ideas on what kind of data structure might be the most effective for such a situation? Or am I doing something really dumb with my random world generation?
Any help would be greatly appreciated.
Thanks,
Chris
I recommend you read about graphs. This is exactly an application of random graph generation. Instead of 'cell' and 'exit' you are describing 'node' and 'edge'.
Plus, then you can do things like shortest path analysis, cycle detection and all sorts of other useful graph theory application.
This will help you understand about the nodes and edges:
and here is a finished application of these concepts. I implemented this in a OOP way - each node knew about it's edges to other nodes. A popular alternative is to implement this using an adjacency list. I think the adjacency list concept is basically what user470379 described with his answer. However, his map solution allows for infinite graphs, while a traditional adjacency list does not. I love graph theory, and this is a perfect application of it.
Good luck!
-Brian J. Stianr-
A map< pair<int,int>, cell> would probably work well; the pair would represent the x,y coordinates. If there's not a cell in the map at those coordinates, create a new cell. If you wanted to make it truly infinite, you could replace the ints with an arbitrary length integer class that you would have to provide (such as a bigint)
If the world's cells are arranged in a grid, you can easily give them cartesian coordinates. If you keep a big list of existing cells, then before determining exits from a given cell, you can check that list to see if any of its neighbors already exist. If they do, and you don't want to have 1-way doors (directed graph?) then you'll have to take their exits into account. If you don't mind having chutes in your game, you can still choose exits randomly, just make sure that you link to existing cells if they're there.
Optimization note: checking a hash table to see if it contains a particular key is O(1).
Couldn't you have a hash (or STL set) that stored a collection of all grid coordinates that contain occupied cells?
Then when you are looking at creating a new cell, you can quickly check to see if the candidate cell location is already occupied.
(if you had finite space, you could use a 2d array - I think I saw this in a Byte magazine article back in ~1980-ish, but if I understand correctly, you want a world that could extend indefinitely)