Simple path queries on large graphs - data-mining

I have a question about large graph data. Suppose that we have a large graph with nearly 100 million edges and around 5 million nodes, in this case what is the best graph mining platform that you know of that can give all simple paths of lengths <=k (for k=3,4,5) between any two given nodes. The main concern is the speed of getting those paths. Another thing is that the graph is directed, but we would like the program to ignore the directions when computing the paths but still return the actually directed edges once it spots those paths.
For instance:
a -> c <- d -> b is a valid path between nodes 'a' and 'b' of length 3.
Thanks in advance.

So this is a way to do it in networkx. It's roughly based on the solution I gave here. I'm assuming that a->b and a<-b are two distinct paths you want. I'm going to return this as a list of lists. Each sublist is the (ordered) edges of a path.
import networkx as nx
import itertools
def getPaths(G,source,target, maxLength, excludeSet=None):
#print source, target, maxLength, excludeSet
if excludeSet== None:
excludeSet = set([source])
else:
excludeSet.add(source)# won't allow a path starting at source to go through source again.
if maxLength == 0:
excludeSet.remove(source)
return []
else:
if G.has_edge(source,target):
paths=[[(source,target)]]
else:
paths = []
if G.has_edge(target,source):
paths.append([(target,source)])
#neighbors_iter is a big iterator that will give (neighbor,edge) for each successor of source and then for each predecessor of source.
neighbors_iter = itertools.chain(((neighbor,(source,neighbor)) for neighbor in G.successors_iter(source) if neighbor != target),((neighbor,(neighbor,source)) for neighbor in G.predecessors_iter(source) if neighbor != target))
#note that if a neighbor is both a predecessor and a successor, it shows up twice in this iteration.
paths.extend( [[edge] + path for (neighbor,edge) in neighbors_iter if neighbor not in excludeSet for path in getPaths(G,neighbor,target,maxLength-1,excludeSet)] )
excludeSet.remove(source) #when we move back up the recursion, don't want to exclude this source any more
return paths
G=nx.DiGraph()
G.add_edges_from([(1,2),(2,3),(1,3),(1,4),(3,4),(4,3)])
print getPaths(G,1,3,2)
>[[(1, 3)], [(1, 2), (2, 3)], [(1, 4), (4, 3)], [(1, 4), (3, 4)]]
I would expect that by modifying the dijkstra algorithm in networkx you'll arrive at a more efficient algorithm (note that the dijkstra algorithm has a cutoff, but by default it's only going to return the shortest path, and it's going to follow edge direction).
Here's an alternative version of the whole paths.extend thing:
paths.extend( [[edge] + path for (neighbor,edge) in neighbors_iter if neighbor not in excludeSet for path in getPaths(G,neighbor,target,maxLength-1,excludeSet) if len(path)>0 ] )

I would recommend using Gephi easy to handle and learn.
If you found it though Neo4j will do your requirement with a little bit of coding.

Related

How to check for membership in a list in CLINGO?

Having a background in Prolog, I'm struggling to convert this DLV (which has builtin predicates to handle lists similarly to Prolog) program into CLINGO.
path(X,Y,[X|[Y]]) :- rel(X,Y).
path(X,Y,[X|T]) :- rel(X,Z), path(Z,Y,T), not #member(X,T).
rel(X,Y) :- edge(X,Y).
rel(X,Y) :- edge(Y,X).
edge(a,b).
edge(a,c).
edge(b,a).
edge(b,c).
edge(e,c).
So far I managed to achieve this:
path(X,Y,cons(X,cons(Y,empty))) :- edge(X,Y).
path(X,Y,cons(X,L)) :- edge(X,Z), path(Z,Y,L), not member(X,path(Z,Y,L)).
member(X,path(X,T,cons(X,Y))) :- path(X,T,cons(X,Y)).
member(X,path(S,X,cons(S,L))) :- path(S,X,cons(S,L)).
member(X,path(S,T,cons(S,cons(Z,L)))) :- member(X,path(Z,T,cons(Z,L))).
% same edges
but I get the error unsafe variables in in file - at line 7 and column 1-72 and I don't fully understand why. I was wondering if anyone could help.
You never defined what S could be.
Add edge(S,Z) to the rule body on line 7 to get rid of that error. Or if you want to define a vertex predicate you could use vertex(S) as well.
So I fixed your code with the cons-lists. This approach is unusual because lists are not a key feature in asp like they are in prolog. Ok, here is my solution (works for directed graphs):
edge(a,b).
edge(a,c).
edge(b,c).
edge(c,a).
edge(c,d).
edge(d,b).
node(X) :- edge(X,_). % get nodes
node(X) :- edge(_,X). % get nodes
num(N):- {node(X)} == N. % count nodes
step(1..N) :- num(N). % mark possible steps
path(X,Y,2,cons(X,cons(Y,empty))) :- edge(X,Y).
path(A,C,NN+1,cons(A,L)) :- edge(A,B), path(B,C,NN,L), step(NN+1).
member(X,path(X,Y,N,cons(X,L))) :- path(X,Y,N,cons(X,L)).
member(Y,path(X,Y,N,cons(X,L))) :- path(X,Y,N,cons(X,L)).
member(M,path(S,T,NN+1,cons(S,cons(Z,L)))) :- member(M,path(Z,T,NN,cons(Z,L))), path(S,T,NN+1,cons(S,cons(Z,L))).
track(Y,Z,N,L):- {member(X,path(Y,Z,N,L)):node(X)} == N, path(Y,Z,N,L).
#show track/4.
At first you need to know all the of vertices to calculate their number. Also I introduced the predicate step to validate a depth for the path. path also has a depth-counter now as third argument. All possible paths are stored within path/4, cycles are allowed. The member/2 predicate is generated to show all the member vertices within a path/4. In a last step all path's are forwarded to the predicate track/4 if and only if the number of distinct member vertices equals the path length. Since duplicates will not be counted this condition makes sure that only paths without cycles are forwarded. Please note that all of the above steps are forced. There is exactly one answer set for every graph.
So let's have a look at a more ASP like solution. Normally you would ask a specific question ('path from a to b with length n') instead of a generic one ('all possible paths from all nodes to all nodes with every possible length'). The following code needs to have a start node (start/1) and an end node (end/1). The program forces a (random) order by assigning exactly one index number to each vertex within the predicate order/2. The order predicate is copied to the path predicate as long as the index is not larger than the index of the end node (path(S,N):- order(S,N), maxZ(Z), S<=Z.). The only constrains is that within the order of path every 2 neighboring vertices are connected with an edge. The constraint line is read as It can not ne the case that there is a node S1 on position N within a path and a node S2 on position N+1 within a path and there is no edge from S1 to S2.
edge(a,b).
edge(a,c).
edge(b,c).
edge(c,a).
edge(c,d).
edge(d,b).
start(a).
end(d).
node(X) :- edge(X,_). % get nodes
node(X) :- edge(_,X). % get nodes
num(N):- {node(X)} == N. % count nodes
step(1..N) :- num(N). % mark possible steps
order(1,A):- start(A). % start has index 1
maxZ(Z):- end(E), order(Z,E), step(Z). % end has index maxZ
{order(S,N):node(N)} == 1 :- step(S). % exactly one assignment per step
{order(S,N):step(S)} == 1 :- node(N). % exactly one assignment per node
path(S,N):- order(S,N), maxZ(Z), S<=Z. % copy order when index is not largter than end node index
:- path(N, S1), path(N+1, S2), not edge(S1,S2). % successing indices are connected through edges
#show path/2.

determining whether two convex hulls overlap

I'm trying to find an efficient algorithm for determining whether two convex hulls intersect or not. The hulls consist of data points in N-dimensional space, where N is 3 up to 10 or so. One elegant algorithm was suggested here using linprog from scipy, but you have to loop over all points in one hull, and it turns out the algorithm is very slow for low dimensions (I tried it and so did one of the respondents). It seems to me the algorithm could be generalized to answer the question I am posting here, and I found what I think is a solution here. The authors say that the general linear programming problem takes the form Ax + tp >= 1, where the A matrix contains the points of both hulls, t is some constant >= 0, and p = [1,1,1,1...1] (it's equivalent to finding a solution to Ax > 0 for some x). As I am new to linprog() it isn't clear to me whether it can handle problems of this form. If A_ub is defined as on page 1 of the paper, then what is b_ub?
There is a nice explanation of how to do this problem, with an algorithm in R, on this website. My original post referred to the scipy.optimize.linprog library, but this proved to be insufficiently robust. I found that the SCS algorithm in the cvxpy library worked very nicely, and based on this I came up with the following python code:
import numpy as np
import cvxpy as cvxpy
# Determine feasibility of Ax <= b
# cloud1 and cloud2 should be numpy.ndarrays
def clouds_overlap(cloud1, cloud2):
# build the A matrix
cloud12 = np.vstack((-cloud1, cloud2))
vec_ones = np.r_[np.ones((len(cloud1),1)), -np.ones((len(cloud2),1))]
A = np.r_['1', cloud12, vec_ones]
# make b vector
ntot = len(cloud1) + len(cloud2)
b = -np.ones(ntot)
# define the x variable and the equation to be solved
x = cvxpy.Variable(A.shape[1])
constraints = [A*x <= b]
# since we're only determining feasibility there is no minimization
# so just set the objective function to a constant
obj = cvxpy.Minimize(0)
# SCS was the most accurate/robust of the non-commercial solvers
# for my application
problem = cvxpy.Problem(obj, constraints)
problem.solve(solver=cvxpy.SCS)
# Any 'inaccurate' status indicates ambiguity, so you can
# return True or False as you please
if problem.status == 'infeasible' or problem.status.endswith('inaccurate'):
return True
else:
return False
cube = np.array([[1,1,1],[1,1,-1],[1,-1,1],[1,-1,-1],[-1,1,1],[-1,1,-1],[-1,-1,1],[-1,-1,-1]])
inside = np.array([[0.49,0.0,0.0]])
outside = np.array([[1.01,0,0]])
print("Clouds overlap?", clouds_overlap(cube, inside))
print("Clouds overlap?", clouds_overlap(cube, outside))
# Clouds overlap? True
# Clouds overlap? False
The area of numerical instability is when the two clouds just touch, or are arbitrarily close to touching such that it isn't possible to definitively say whether they overlap or not. That is one of the cases where you will see this algorithm report an 'inaccurate' status. In my code I chose to consider such cases overlapping, but since it is ambiguous you can decide for yourself what to do.

Construct an Adjacency List from a List of edges?

In context of graph algorithms, we are usually given a convenient representation of a graph (usually as an adjacency list or an adjacency matrix) to operate on.
My question is, what is an efficient way to construct an Adjacency list from a given list of all edges?
For the purpose of the question, assume that edges are a list of tuples (as in python) and (a,b) denotes a directed edge from a to b.
A combination of itertools.groupby (docs), sorting and dict comprehension could get you started:
from itertools import groupby
edges = [(1, 2), (2, 3), (1, 3)]
adj = {k: [v[1] for v in g] for k, g in groupby(sorted(edges), lambda e: e[0])}
# adj: {1: [2, 3], 2: [3]}
This sorts and groups the edges by their source node, and stores a list of target nodes for each source node. Now you can access all adjacent nodes of 1 via adj[1]

igraph invalid vertex Id

I'm trying to run igraph's fast greedy community detection algorithm using the following code:
G = Graph()
L = []
V = []
for row in cr:
try:
l = []
source = int((row[0]).strip())
target = int((row[1]).strip())
weight = int((row[2]).strip())
l.append(source)
l.append(target)
if l not in L:
L.append(l)
if source not in V:
V.append(source)
if target not in V:
V.append(target)
except ValueError:
print "Value Error"
continue
if weight == 1:
continue
G.add_vertices(max(V))
G.add_edges(L)
cl = G.community_fastgreedy(weights=weight).as_clustering(10);
But this is the error I'm getting:
igraph._igraph.InternalError: Error at type_indexededgelist.c:272: cannot add edges, Invalid vertex id
I found this: Cannot add edges, Invalid vertex ID in IGraph so I tried adding all the vertices and then all the edges but I still get an error.
Does the above code do the same thing as:
tupleMapping = []
for row in cr:
if int(row[2]) < 10:
continue
l = [row[0], row[1], row[2]]
tupleMapping.append(tuple(l))
g = Graph.TupleList(tupleMapping)
cl = g.community_fastgreedy().as_clustering(20)
I dont have to explicitly say the G.community_fastgreedy(weights=weight) right?
Also another problem I was having; when I try to add more clusters in the following way:
cl = g.community_fastgreedy().as_clustering(10)
cl = g.community_fastgreedy().as_clustering(20)
I get two large clusters and the rest of the clusters compose of one element. This happens when I try to make the cluster size 5/10/20, is there any way for me to make the clusters more equally divided? I need more than 2 clusters for my dataset.
This is a small snippet of the data I'm trying to read from the csv file so that I can generate a graph and then run the community detection algorithm:
202,580,11
87,153,7
227,459,6
263,524,11
Thanks.
That's right, the second code does the same. In the first example, the problem is that when you add edges, you refer to igraph's internal vertex IDs, which always start from 0, and go until N-1. Does not matter your own vertex names are integers, you need to translate them to igraph vertex IDs.
The igraph.Graph.TupleList() method is much more convenient here. However, you need to specify that the third element of the tuple is the weight. You can do it either by weights = True or edge_attrs = ['weight'] arguments:
import igraph
data = '''1;2;34
1;3;41
1;4;87
2;4;12
4;5;22
5;6;33'''
L = set([])
for row in data.split('\n'):
row = row.split(';')
L.add(
(row[0].strip(), row[1].strip(), int(row[2].strip()))
)
G = igraph.Graph.TupleList(L, edge_attrs = ['weight'])
You can then create dictionaries to translate between igraph vertex IDs and your original names:
vid2name = dict(zip(xrange(G.vcount()), G.vs['name']))
name2vid = dict((name, vid) for vid, name in vid2name.iteritems())
However, the first is not so much needed, as you can always use G.vs[vid]['name'].
For fastgreedy, I think you should specify the weights, at least the documentation does not tell if it automatically considers the attribute named weight if such attribute exists.
fg = G.community_fastgreedy(weights = 'weight')
fg_clust_10 = fg.as_clustering(10)
fg_clust_20 = fg.as_clustering(20)
If fastgreedy gives you only 2 large clusters, I can only recommend to try other community detection methods. Actually you could try all of them which run within reasonable time (it depends on the size of your graph), and then compare their results. Also because you have a weighted graph, you could take a look at moduland method family, which is not implemented in igraph, but has good documentation, and you can set quite sophisticated settings.
Edit: The comments from OP suggest that the original data describes a directed graph. The fastgreedy algorithm is unable to consider directions, and gives error if called on a directed graph. That's why in my example I created an undirected igraph.Graph() object. If you want to run other methods, some of those might able to deal with directed networks, you should create first a directed graph:
G = igraph.Graph.TupleList(L, directed = True, edge_attrs = ['weight'])
G.is_directed()
# returns True
To run fastgreedy, convert the graph to undirected. As you have a weight attribute for the edges, you need to specify what igraph should do when 2 edges of opposit direction between the same pair of vertices being collapsed to one undirected edge. You can do many things with the weights, like taking the mean, the larger, or the smaller one, etc. For example, to make the combined edges have a mean weight of the original edges:
uG = G.as_undirected(combine_edges = 'mean')
fg = uG.community_fastgreedy(weights = 'weight')
Important: be aware that at this operation, and also when you add or remove vertices or edges, igraph reindexes the vertices and edges, so if you know that vertex id x corresponds to your original id y, after reindexing this won't be valid anymore, you need to recreate the name2vid and vid2name dictionaries.

Can you get the selected leaf from a DecisionTreeRegressor in scikit-learn

just reading this great paper and trying to implement this:
... We treat each individual
tree as a categorical feature that takes as value the
index of the leaf an instance ends up falling in. We use 1-
of-K coding of this type of features. For example, consider
the boosted tree model in Figure 1 with 2 subtrees, where
the first subtree has 3 leafs and the second 2 leafs. If an
instance ends up in leaf 2 in the first subtree and leaf 1 in
second subtree, the overall input to the linear classifier will
be the binary vector [0, 1, 0, 1, 0], where the first 3 entries
correspond to the leaves of the first subtree and last 2 to
those of the second subtree ...
Anyone know how I can predict a bunch of rows and for each of those rows get the selected leaf for each tree in the ensemble? For this use case I don't really care what the node represents, just its index really. Had a look at the source and I could not quickly see anything obvious. I can see that I need to iterate the trees and do something like this:
for sample in X_test:
for tree in gbc.estimators_:
leaf = tree.leaf_index(sample) # This is the function I need but don't think exists.
...
Any pointers appreciated.
The following function goes beyond identifying the selected leaf from the Decision Tree and implements the application in the referenced paper. Its use is the same as the referenced paper, where I use the GBC for feature engineering.
def makeTreeBins(gbc, X):
'''
Takes in a GradientBoostingClassifier object (gbc) and a data frame (X).
Returns a numpy array of dim (rows(X), num_estimators), where each row represents the set of terminal nodes
that the record X[i] falls into across all estimators in the GBC.
Note, each tree produces 2^max_depth terminal nodes. I append a prefix to the terminal node id in each incremental
estimator so that I can use these as feature ids in other classifiers.
'''
for i, dt_i in enumerate(gbc.estimators_):
prefix = (i + 2)*100 #Must be an integer
nds = prefix + dt_i[0].tree_.apply(np.array(X).astype(np.float32))
if i == 0:
nd_mat = nds.reshape(len(nds), 1)
else:
nd_mat = np.hstack((nd, nds.reshape(len(nds), 1)))
return nd_mat
DecisionTreeRegressor has tree_ property which gives you access to the underlying decision tree. It has method apply, which seemingly finds corresponding leaf id:
dt.tree_.apply(X)
Note that apply expects its input to have type float32.