So I am currently working on a quick and dirty Python project that supports a data structure made out of a dictionary with keys being GOIDs from the open biological ontology format. It hashses to another dictionary that contains lists of parent nodes or terms and children nodes or terms that helps me form lists with all children or all ancestors for a given node in the ontology ( working with GO .obo file, if that helps anyone ).
My problem is that I have been looking for an algorithm that will help me return all the same nodes on the same level as a given node id which has to be relative because there could be more than one path to a node ( it is a directed acyclic graph, but there can be multiple parents per node ). I essentially need to look up the parents of a node, store the children of the parents all on a common list, and then repeat this process on every node added without repeating nodes or slowing down the computation significantly.
I'm think this can easily be done using a set to prevent duplicate entries, and just keeping track of which parents I have visited until all parents of siblings have been visited without being able to add a new parent, but my suspicions are this might be terribly inefficient. If anyone has experience with this kind of algorithm, and insights would be highly appreciated! Hope this is clear enough for a response.
Thanks!
Ok guys, so this is what I have developed so far, but it seems to keep giving me wrong values for some strange reason. Is there a minor error anyone can see where I accidentally not terminating correctly?
# A helper function to find generations of a given node
def getGenerationals(self,goid):
quit = False
visitedParents = set()
generation = set()
tempGen = set()
generation.add(goid)
while not quit:
quit = True
generation |= tempGen
tempGen = set()
print "TEMP GEN:",tempGen
for g in generation:
parents = set(self._terms[g]['p'])
for p in parents:
if p not in visitedParents:
visitedParents.add(p)
print "Parent:",p
quit = False
tempGen |= set(self._terms[p]['c'])
raw_input("Break")
return generation
# Working function
def getGeneration(self,goid):
generation = list(self.getGenerationals(goid))
generation.remove(goid)
return list(generation)
Related
I'm having problems with the insertion using gremlin to Neptune.
I am trying to insert many nodes and edges, potentially hundred thousands of nodes and edges, with checking for existence.
Currently, we are using inject to insert the nodes, and the problem is that it is slow.
After running the explain command, we figured out that the problem was the coalesce and the where steps - it takes more than 99.9% of the run duration.
I want to insert each node and edge only if it doesn’t exist, and that’s why I am using the coalesce and where steps.
For example, the query we use to insert nodes with inject:
properties_list = [{‘uid’:’1642’},{‘uid’:’1322’}…]
g.inject(properties_list).unfold().as_('node')
.sideEffect(__.V().where(P.eq('node')).by(‘uid).fold()
.coalesce(__.unfold(), __.addV(label).property(Cardinality.single,'uid','1')))
With 1000 nodes in the graph and properties_list with 100 elements, running the query above takes around 30 seconds, and it gets slower as the number of nodes in the graph increases.
Running a naive injection with the same environment as the query above, without coalesce and where, takes less than 1 second.
I’d like to hear your suggestions and to know what are the best practices for inserting many nodes and edges (with checking for existence).
Thank you very much.
If you have a set of IDs that you want to check for existence, you can speed up the query significantly by also providing just a list of IDs to the query and calculating the intersection of the ones that exist upfront. Then, having calculated the set that need updates you can just apply them in one go. This will make a big difference. The reason you are running into problems is that the mid traversal V has a lot of work to do. In general it would be better to use actual IDs rather than properties (UID in your case). If that is not an option the same technique will work for property based IDs. The steps are:
Using inject or sideEffect insert the IDs to be found as one list and the corresponding map containing the changes to conditionally be applied in a separate map.
Find the intersection of the ones that exist and those that do not.
Using that set of non existing ones, apply the updates using the values in the set to index into your map.
Here is a concrete example. I used the graph-notebook for this but you can do the same thing in code:
Given:
ids = "['1','2','9998','9999']"
and
data = "[['id':'1','value':'XYZ'],['id':'9998','value':'ABC'],['id':'9999','value':'DEF']]"
we can do something like this:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by()
which correctly finds the ones that do not already exist:
{'id': 9998, 'value': 'ABC'}
{'id': 9999, 'value': 'DEF'}
You can use this pattern to construct your conditional inserts a lot more efficiently (I hope :-) ). So to add the new vertices you might do:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by().
addV('test').
property(id,select('d').select('id')).
property('value',select('d').select('value'))
v[9998]
v[9999]
As a side note, we are adding two new steps to Gremlin - mergeV and mergeE that will allow this to be done much more easily and in a more declarative style. Those new steps should be part of the TinkerPop 3.6 release.
This follows exercise 2.8 from the book "Cracking the coding interview". There, they ask to find a loop inside a linked list. They propose a fast runner/slow runner approach, but I found a much smaller solution and would like to confirm if there is any type of problem with my solution.
I decided to create an hash table initially "all False" that keeps track of if a node has been visited. Then I do a cycle that runs until the 'current node' has already been visited which ends the cycle:
class Node():
def __init__(self,data=None,next=None):
if data!=None:
self.data=data
self.next=next
def find_loop(head,hash_table):
node=head
while hash_table[node]==False:
print(node.next.data)
hash_table[node]=True
node=node.next
node_at_beginning_of_loop=node
return node.data
if __name__ == "__main__":
import sys
node3=Node()
node5=Node(11,node3)
node4=Node(5,node5)
node3.data=6
node3.next=node4
node2=Node(2,node3)
node1=Node(9,node2)
hash_table={}
for i in range(1,6):
hash_table[globals()['node%s' % i]]=False
print(find_loop(node1,hash_table))
Your solution will work however this isn't an 'efficient' solution in terms of space. The fast/slow runner is O(n) time and O(1) space. Your solution is O(n) time and O(n) space. Also, you should be using a HashSet / set in python not a HashMap / dictionary.
I have tried to set up graph-tools BFS search on a ubuntu system. Due to a bug I have been forced to use the graph_tools.bfs_search() instead of graph_tools.bfs_iterator.
So I did set up a minimal class as mentioned in the examples inheriting from graph_tools.BFSVisitor.
The goal of this is to track all of the edge source nodes and edge action values pairs reachable from a specific node in a graph and store them in a numpy array, such that the array will have the following dimensions: (num_nodes, num_actions) where num_actions is equal to the maximum out-degree in the graph.
The function does its jobs but accessing the edge PropertyMap self.edge_action[edge] of the graph in order to retrieve the edge action is a huuge bottleneck and significantly slows down my code. But since I tried to use graph-tools only for speed in the first place, I am a little bit stuck right now.
Am I missing on something about the graph-tools library or is there no way to speed this up? Otherwise I might as well just go back to networkx and try to find the fastest way there.
I simply cannot think of a way to avoid using this pythonic loop to access edge actions in order to use the c++ power of graph-tools.
Here my simple class:
class SetIterator(gt.BFSVisitor):
def __init__(self, action, safe_set):
"""
action: gt.PropertyMap
edge property representing edge action
safe_set: np.array
array used to store safe node, action pairs
"""
self.edge_action = action
self.ea = self.edge_action
self.safe_set = safe_set
def discover_vertex(self, u):
"""
Invoked on first encounter of vertex.
Parameters
----------
u: gt.Vertex
"""
self.safe_set[int(u), 0] = True
def examine_edge(self, e):
"""
Called when edge is checked
Parameters
----------
e: gt.Edge
"""
# TODO This one line is bottleneck of code
self.safe_set[int(e.source()), self.ea[e]] = True
I want to create a graph with nodes and edges, where each node will contain n number of values. We would be given with the n values of the starting node, from which we need to generate other nodes where each value in each node would be of the form either:
t_n=t_(n-1)+2
or
t_n=t_(n-1)-1
When such a node is generated, it should create an edge from the old node to the new node.
I know this might be very trivial job, but I have very limited programming knowledge. I have been suggested to use classes in C++ or structure to represent the nodes. Please help me in creating the graph with nodes that would have multiple values and further the next nodes would be generated from the parent node following the above rule. Some C++ code would be very helpful.
Thanks in Advance.
here you have some code but I don't really fully understand your task.
- graph with nodes and edges
- each node has n number of values
- we are given n values of the starting point
- need to generate other nodes where each value in each node would be either
- t_n=t_(n-1)+2
- t_n=t_(n-1)-1
- when such node is generated, it creates an edge from the old node to the new node.
this starting point: do we have to generate a graph from it? what is with the creation of the edge from the old node and the new node? is old node here the starting point?
does n number of values means to where the point is connected to (as a chain of the other edges to which this edge is connected to)? example we are provided a node with a chain of numbers (6, 4, 5) where this means we need to generate extra edges which would be connected x times (first one linked to our starting point would be linked to 6 edges, one of them being the starting point)
will edit my answer when I have more information. could you please draw an example in paint and upload it online and provide the link? it would be easier to imagine.
I have a question concerning Abstract Syntax Trees generated with Boost Spirit Library.
I've found many informations about deleting nodes and subtrees in a Binary Search Tree, but I cannot find the same information for ASTs. I have a node in an AS-Tree, and this node is root to a subtree of a complete tree. Now I want to delete the node and all of its children.
I don't know how to do it, and the Boost Spirit Documentation didn't help either.
Has anyone got any tips for me?
The Tree is generated with (Boost 1.46.1):
tree_parse_info<> info = ast_parse(expression.c_str(), parser, skipparser);
And the Expression is something like this:
(variable_17 OR variable_18) AND function( variable_17) <= 30 OR function( subkey_18) <= 30
I use
tree_match<iterator_t>::tree_iterator tree_it = info.trees.begin();
to get the beginning of the tree, and then I do check if one of the subtrees is redundant (doensn't have anything to do with the deleting itself). `Then I traverse through the tree using
tree_match<iterator_t>::tree_iterator children_it = tree_it->children.begin()
and calling the same function with its children (recursive). I can't post the complete code,but that's the most important part of it. I thought, that i can traverse to the leafnodes of a redundant subtree, and set them to null, or something like this. And then I go up the tree again, and delete all other children one after another. However, nothing has worked so far.
An example for traversing the tree: The Traversing
It's the answer.
If I can't delete any nodes, does anyone has an idea, how to create a new tree, based on the existing one, skiping the redundant parts of it.