igraph invalid vertex Id - python-2.7

I'm trying to run igraph's fast greedy community detection algorithm using the following code:
G = Graph()
L = []
V = []
for row in cr:
try:
l = []
source = int((row[0]).strip())
target = int((row[1]).strip())
weight = int((row[2]).strip())
l.append(source)
l.append(target)
if l not in L:
L.append(l)
if source not in V:
V.append(source)
if target not in V:
V.append(target)
except ValueError:
print "Value Error"
continue
if weight == 1:
continue
G.add_vertices(max(V))
G.add_edges(L)
cl = G.community_fastgreedy(weights=weight).as_clustering(10);
But this is the error I'm getting:
igraph._igraph.InternalError: Error at type_indexededgelist.c:272: cannot add edges, Invalid vertex id
I found this: Cannot add edges, Invalid vertex ID in IGraph so I tried adding all the vertices and then all the edges but I still get an error.
Does the above code do the same thing as:
tupleMapping = []
for row in cr:
if int(row[2]) < 10:
continue
l = [row[0], row[1], row[2]]
tupleMapping.append(tuple(l))
g = Graph.TupleList(tupleMapping)
cl = g.community_fastgreedy().as_clustering(20)
I dont have to explicitly say the G.community_fastgreedy(weights=weight) right?
Also another problem I was having; when I try to add more clusters in the following way:
cl = g.community_fastgreedy().as_clustering(10)
cl = g.community_fastgreedy().as_clustering(20)
I get two large clusters and the rest of the clusters compose of one element. This happens when I try to make the cluster size 5/10/20, is there any way for me to make the clusters more equally divided? I need more than 2 clusters for my dataset.
This is a small snippet of the data I'm trying to read from the csv file so that I can generate a graph and then run the community detection algorithm:
202,580,11
87,153,7
227,459,6
263,524,11
Thanks.

That's right, the second code does the same. In the first example, the problem is that when you add edges, you refer to igraph's internal vertex IDs, which always start from 0, and go until N-1. Does not matter your own vertex names are integers, you need to translate them to igraph vertex IDs.
The igraph.Graph.TupleList() method is much more convenient here. However, you need to specify that the third element of the tuple is the weight. You can do it either by weights = True or edge_attrs = ['weight'] arguments:
import igraph
data = '''1;2;34
1;3;41
1;4;87
2;4;12
4;5;22
5;6;33'''
L = set([])
for row in data.split('\n'):
row = row.split(';')
L.add(
(row[0].strip(), row[1].strip(), int(row[2].strip()))
)
G = igraph.Graph.TupleList(L, edge_attrs = ['weight'])
You can then create dictionaries to translate between igraph vertex IDs and your original names:
vid2name = dict(zip(xrange(G.vcount()), G.vs['name']))
name2vid = dict((name, vid) for vid, name in vid2name.iteritems())
However, the first is not so much needed, as you can always use G.vs[vid]['name'].
For fastgreedy, I think you should specify the weights, at least the documentation does not tell if it automatically considers the attribute named weight if such attribute exists.
fg = G.community_fastgreedy(weights = 'weight')
fg_clust_10 = fg.as_clustering(10)
fg_clust_20 = fg.as_clustering(20)
If fastgreedy gives you only 2 large clusters, I can only recommend to try other community detection methods. Actually you could try all of them which run within reasonable time (it depends on the size of your graph), and then compare their results. Also because you have a weighted graph, you could take a look at moduland method family, which is not implemented in igraph, but has good documentation, and you can set quite sophisticated settings.
Edit: The comments from OP suggest that the original data describes a directed graph. The fastgreedy algorithm is unable to consider directions, and gives error if called on a directed graph. That's why in my example I created an undirected igraph.Graph() object. If you want to run other methods, some of those might able to deal with directed networks, you should create first a directed graph:
G = igraph.Graph.TupleList(L, directed = True, edge_attrs = ['weight'])
G.is_directed()
# returns True
To run fastgreedy, convert the graph to undirected. As you have a weight attribute for the edges, you need to specify what igraph should do when 2 edges of opposit direction between the same pair of vertices being collapsed to one undirected edge. You can do many things with the weights, like taking the mean, the larger, or the smaller one, etc. For example, to make the combined edges have a mean weight of the original edges:
uG = G.as_undirected(combine_edges = 'mean')
fg = uG.community_fastgreedy(weights = 'weight')
Important: be aware that at this operation, and also when you add or remove vertices or edges, igraph reindexes the vertices and edges, so if you know that vertex id x corresponds to your original id y, after reindexing this won't be valid anymore, you need to recreate the name2vid and vid2name dictionaries.

Related

Can I use vtkCellLocator to find intersected cells by line

I have mesh grid and there is a line passing through several cells. I want to get all the cells which are intersected by line.
I have start and end points of line and I have mesh vertexes coordinates.
What is the fastest way to compute this condition, is there any way, I could use vtk class ?
You can check the vtk class vtkOBBTree, and the method IntersectWithLine. This is python, but you get the idea of how it works here
from vedo import dataurl, Mesh, Line, show
m = Mesh(dataurl+'bunny.obj')
l = Line([-0.15,0.1,0], [0.1,0.1,0], c='k', lw=3)
pts_cellids = m.intersectWithLine(l, returnIds=True)
print(pts_cellids)
show(m, l, axes=1)
You should get cells ids: [3245 1364]

Difficulty Understanding TensorFlow Computations

I'm new to TensorFlow and have difficulty understanding how the computations works. I could not find the answer to my question on the web.
For the following piece of code, the last time I print "d" in the for loop of the "train_neural_net()" function, I'm expecting the values to be identical to when I print "test_distance.eval". But they are way different. Can anyone tell me why this is happening? Isn't TensorFlow supposed to cache the Variable results learned in the for loop and use them when I run "test_distance.eval"?
def neural_network_model1(data):
nn1_hidden_1_layer = {'weights': tf.Variable(tf.random_normal([5, n_nodes_hl1])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl1]))}
nn1_hidden_2_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl2]))}
nn1_output_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl2, vector_size])), 'biasses': tf.Variable(tf.random_normal([vector_size]))}
nn1_l1 = tf.add(tf.matmul(data, nn1_hidden_1_layer["weights"]), nn1_hidden_1_layer["biasses"])
nn1_l1 = tf.sigmoid(nn1_l1)
nn1_l2 = tf.add(tf.matmul(nn1_l1, nn1_hidden_2_layer["weights"]), nn1_hidden_2_layer["biasses"])
nn1_l2 = tf.sigmoid(nn1_l2)
nn1_output = tf.add(tf.matmul(nn1_l2, nn1_output_layer["weights"]), nn1_output_layer["biasses"])
return nn1_output
def neural_network_model2(data):
nn2_hidden_1_layer = {'weights': tf.Variable(tf.random_normal([5, n_nodes_hl1])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl1]))}
nn2_hidden_2_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl2]))}
nn2_output_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl2, vector_size])), 'biasses': tf.Variable(tf.random_normal([vector_size]))}
nn2_l1 = tf.add(tf.matmul(data, nn2_hidden_1_layer["weights"]), nn2_hidden_1_layer["biasses"])
nn2_l1 = tf.sigmoid(nn2_l1)
nn2_l2 = tf.add(tf.matmul(nn2_l1, nn2_hidden_2_layer["weights"]), nn2_hidden_2_layer["biasses"])
nn2_l2 = tf.sigmoid(nn2_l2)
nn2_output = tf.add(tf.matmul(nn2_l2, nn2_output_layer["weights"]), nn2_output_layer["biasses"])
return nn2_output
def train_neural_net():
prediction1 = neural_network_model1(x1)
prediction2 = neural_network_model2(x2)
distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(prediction1, prediction2)), reduction_indices=1))
cost = tf.reduce_mean(tf.multiply(y, distance))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epochs = 500
test_result1 = neural_network_model1(x3)
test_result2 = neural_network_model2(x4)
test_distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(test_result1, test_result2)), reduction_indices=1))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epochs):
_, d = sess.run([optimizer, distance], feed_dict = {x1: train_x1, x2: train_x2, y: train_y})
print("Epoch", epoch, "distance", d)
print("test distance", test_distance.eval({x3: train_x1, x4: train_x2}))
train_neural_net()
Each time you call the functions neural_network_model1() or neural_network_model2(), you create a new set of variables, so there are four sets of variables in total.
The call to sess.run(tf.global_variables_initializer()) initializes all four sets of variables.
When you train in the for loop, you only update the first two sets of variables, created with these lines:
prediction1 = neural_network_model1(x1)
prediction2 = neural_network_model2(x2)
When you evaluate with test_distance.eval(), the tensor test_distance depends only on the variables that were created in the last two sets of variables, which were created with these lines:
test_result1 = neural_network_model1(x3)
test_result2 = neural_network_model2(x4)
These variables were never updated in the training loop, so the evaluation results will be based on the random initial values.
TensorFlow does include some code for sharing weights between multiple calls to the same function, using with tf.variable_scope(...): blocks. For more information on how to use these, see the tutorial on variables and sharing on the TensorFlow website.
You don't need to define two function for generating models, you can use tf.name_scope, and pass a model name to the function to use it as a prefix for variable declaration. On the other hand, you defined two variables for distance, first is distance and second is test_distance . But your model will learn from train data to minimize cost which is only related to first distance variable. Therefore, test_distance is never used and the model which is related to it, will never learn anything! Again there is no need for two distance functions. You only need one. When you want to calculate train distance, you should feed it with train data and when you want to calculate test distance you should feed it with test data.
Anyway, if you want second distance to work, you should declare another optimizer for it and also you have to learn it as you have done for first one. Also you should consider the fact that models are learning base on their initial values and training data. Even if you feed both models with exactly same training batches, you can't expect to have exactly similar characteristics models since initial values for weights are different and this could cause falling into different local minimum of error surface. At the end notice that whenever you call neural_network_model1 or neural_network_model2 you will generate new weights and biases, because tf.Variable is generating new variables for you.

How to apply a mask to a DataFrame in Python?

My dataset named ds_f is a 840x57 matrix which contains NaN values. I want to forecast a variable with a linear regression model but when I try to fit the model, I get this message "SVD did not converge":
X = ds_f[ds_f.columns[:-1]]
y = ds_f['target_o_tempm']
model = sm.OLS(y,X) #stackmodel
f = model.fit() #ERROR
So I've been searching for an answer to apply a mask to a DataFrame. Although I was thinking of creating a mask to "ignore" NaN values and then convert it into a DataFrame, I get the same DataFrame as ds_f, nothing changes:
m = ma.masked_array(ds_f, np.isnan(ds_f))
m_ds_f = pd.DataFrame(m,columns=ds_f.columns)
EDIT: I've solved the problem by writing model=sm.OLS(X,y,missing='drop') but a new problem appears when I display results, I get only NaN:
Are you using statsmodels? If so, you could specify sm.OLS(y, X, missing='drop'), to drop the NaN values prior to estimation.
Alternatively, you may want to consider interpolating the missing values, rather than dropping them.

Top n outliers in ResultWriter

I am dealing with high dimensional and large dataset, so i need to get just Top N outliers from output of ResultWriter.
There is some option in elki to get just the top N outliers from this output?
The ResultWriter is some of the oldest code in ELKI, and needs to be rewritten. It's rather generic - it tries to figure out how to best serialize output as text.
If you want some specific format, or a specific subset, the proper way is to write your own ResultHandler. There is a tutorial for writing a ResultHandler.
If you want to find the input coordinates in the result,
Database db = ResultUtil.findDatabase(baseResult);
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_VARIABLE_LENGTH);
will return the first relation containing numeric vectors.
To iterate over the objects sorted by their outlier score, use:
OrderingResult order = outlierResult.getOrdering();
DBIDs ids = order.order(order.getDBIDs());
for (DBIDIter it = ids.iter(); it.valid(); it.advance()) {
// Output as desired.
}

Can you get the selected leaf from a DecisionTreeRegressor in scikit-learn

just reading this great paper and trying to implement this:
... We treat each individual
tree as a categorical feature that takes as value the
index of the leaf an instance ends up falling in. We use 1-
of-K coding of this type of features. For example, consider
the boosted tree model in Figure 1 with 2 subtrees, where
the first subtree has 3 leafs and the second 2 leafs. If an
instance ends up in leaf 2 in the first subtree and leaf 1 in
second subtree, the overall input to the linear classifier will
be the binary vector [0, 1, 0, 1, 0], where the first 3 entries
correspond to the leaves of the first subtree and last 2 to
those of the second subtree ...
Anyone know how I can predict a bunch of rows and for each of those rows get the selected leaf for each tree in the ensemble? For this use case I don't really care what the node represents, just its index really. Had a look at the source and I could not quickly see anything obvious. I can see that I need to iterate the trees and do something like this:
for sample in X_test:
for tree in gbc.estimators_:
leaf = tree.leaf_index(sample) # This is the function I need but don't think exists.
...
Any pointers appreciated.
The following function goes beyond identifying the selected leaf from the Decision Tree and implements the application in the referenced paper. Its use is the same as the referenced paper, where I use the GBC for feature engineering.
def makeTreeBins(gbc, X):
'''
Takes in a GradientBoostingClassifier object (gbc) and a data frame (X).
Returns a numpy array of dim (rows(X), num_estimators), where each row represents the set of terminal nodes
that the record X[i] falls into across all estimators in the GBC.
Note, each tree produces 2^max_depth terminal nodes. I append a prefix to the terminal node id in each incremental
estimator so that I can use these as feature ids in other classifiers.
'''
for i, dt_i in enumerate(gbc.estimators_):
prefix = (i + 2)*100 #Must be an integer
nds = prefix + dt_i[0].tree_.apply(np.array(X).astype(np.float32))
if i == 0:
nd_mat = nds.reshape(len(nds), 1)
else:
nd_mat = np.hstack((nd, nds.reshape(len(nds), 1)))
return nd_mat
DecisionTreeRegressor has tree_ property which gives you access to the underlying decision tree. It has method apply, which seemingly finds corresponding leaf id:
dt.tree_.apply(X)
Note that apply expects its input to have type float32.