First, thanks for reading and possibly responding to this.
Now, the question:
I am on python 2.7, and I am getting this error when attempting to find communities in my graph using the fastgreedy algorithm:
---------------------------------------------------------------------------
InternalError Traceback (most recent call last)
<ipython-input-180-3b8456851658> in <module>()
----> 1 dendrogram = g_summary.community_fastgreedy(weights=edge_frequency.values())
/usr/local/lib/python2.7/site-packages/igraph/__init__.pyc in community_fastgreedy(self, weights)
959 in very large networks. Phys Rev E 70, 066111 (2004).
960 """
--> 961 merges, qs = GraphBase.community_fastgreedy(self, weights)
962
963 # qs may be shorter than |V|-1 if we are left with a few separated
InternalError: Error at fast_community.c:553: fast-greedy community finding works only on graphs without multiple edges, Invalid value
This is how I created my graph:
import igraph as ig
vertices = words #about 600 words from a number of news articles: ['palestine', 'israel', 'hamas, 'nasa', 'mercury', 'water', ...]
gen = ig.UniqueIdGenerator()
[gen[word] for word in vertices] #generate word-to-integer mapping as each edge has to be between integer ids (words)
edges = []
for ind in xrange(articles.shape[0]): # articles is a pandas dataframe; each row corresponds to an article; one column is 'top_words' which includes the top few words of each article. The above list *words* is the unique union set of top_words for all articles.
words_i = articles['top_words'].values[ind] # for one article, this looks like ['palestine','israel','hamas']
edges.extend([(gen[x[0]],gen[x[1]]) for x in combinations(words_i,2)]) #basically there is an edge for each pair of top_words in a given article. For the example article above, we get edges between israel-palestine, israel-hamas, palestine-hamas.
unique_edges = list(set(edges))
unique_edge_frequency = {}
for e in unique_edges:
unique_edge_frequency[e] = edges.count(e)
g = ig.Graph(vertex_attrs={"label": vertices}, edges=unique_edges, directed=False)
g.es['width'] = np.asarray([unique_edge_frequency[e] for e in unique_edge_frequency.keys()])*1.0/max(unique_edge_frequency.values())
And this is what throws the error:
dendrogram = g.community_fastgreedy(weights=g.es['width'])
What am I doing wrong?
Your graph contains multiple edges (i.e. more than one edge between the same pair of nodes). The fast greedy community detection won't work on such graphs; you have to collapse the multiple edges into single ones with g.simplify().
It also seems like you are trying to set the "width" attibute of the edges based on how many edges there are between the same pair of vertices. Instead of constructing unique_edges and then unique_edge_frequency, you can simply do this:
g = Graph(edges, directed=False)
g.es["width"] = 1
g.simplify(combine_edges={ "width": "sum" })
This will simply create a graph with multiple edges first, then assign a width of 1 to each edge, and finally collapse the multiple edges into single ones while summing up their widths.
Related
I am trying to combine multiple survfit objects on the same plot, using function ggsurvplot_combine from package survminer. When I made a list of 2 survfit objects, it perfectly works. But when I combine 3 survfit objects in different ways, I receive the error:
error in levels - ( tmp value = as.character(levels)): factor level 3 is duplicated
I've read similar posts on combining survivl plots (https://cran.r-project.org/web/packages/survminer/survminer.pdf, https://github.com/kassambara/survminer/issues/195, R plotting multiple survival curves in the same plot, https://rpkgs.datanovia.com/survminer/reference/ggsurvplot_combine.html) and on this specific error, for which solutions are been provided with using 'unique'. However, I do not even understand for which factor variable this error accounts. I do not have the right to share my data or figures, so I'll try to replicate it:
Data:
time: follow-up between untill event or end of follow-up
endpoints: 1= event, 0=no event or censor
Null models:
KM1 <- survfit(Surv(data$time1,data$endpoint1)~1,
type="kaplan-meier", conf.type="log", data=data)
KM2 <- survfit(Surv(data$time2,data$endpoint2)~1, type="kaplan-meier",
conf.type="log", data=data)
KM3 <- survfit(Surv(data$time3,data$endpoint3)~1, type="kaplan-meier",
conf.type="log", data=data)
List null models:
list_that_works <- list(KM1,KM3)
list_that_fails <- list(KM1,KM2,KM3)
It seems as if the list contains of just two arguments: list(PFS=, OS=)
Combine >2 null models in one plot:
ggsurvplot_combine(list_that_works, data=data, conf.int=TRUE, fun="event", combine=TRUE)
This gives the plot I'm looking for, but with 2 cumulative incidence curves.
ggsurvplot_combine(list_that_fails, data=data, conf.int=TRUE, fun="event", combine=TRUE)
This gives error 'error in levels - ( tmp value = as.character(levels)): factor level 3 is duplicated'.
When I try combining 3 plots with using
ggsurvplot(c(KM1,KM2,KM3), data=data, conf.int=TRUE, fun="event", combine=TRUE), it gives the error:
Error: Problem with mutate() 'column 'survsummary'
survsummary = purrr::map2(grouped.d$fit, grouped.d$name, .surv_summary, data=data'. x $ operator is invlid for atomic vectors.
Any help is highly appreciated!
Also another way to combine surv fits is very welcome!
My best bet is that it has something to do with the 'list' function that only contains of two arguments: list(PFS=, OS=)
I fixed it! Instead of removing the post, I'll share my solution, it may be of help for others:
I made a list of the formulas instead of the null models, so:
formulas <- list(
KM1 = Surv(time1, endpoint1)~1,
KM2 = Surv(time2, endpoint2)~1,
KM3 = Surv(time3, endpoint3)~1)
I made a null model of the 3 formulas at once:
fit <- surv_fit(formulas, data=data)
Then I made a plot with this survival fit:
ggsurvplot_combine(fit, data=data)
I'm trying to loop through a number of text documents and create a feature set by recording :
position list in text
Part of speech of keyphrase
Length of each keyphrase (number of words in it)
Frequency of each keyphrase
Code snippet of extraxting features :
#Take list of Keywords
keyword_list = [line.split(':')[1].lower().strip() for line in keywords.splitlines() if ':' in line ]
#Position
position_list = [ [m.start()/float(len(document)) for m in re.finditer(re.escape(kw),document,flags=re.IGNORECASE)] for kw in keyword_list]
#Part of Speech
pos_list = []
for key in keyword_list:
pos_list.append([pos for w,pos in nltk.pos_tag(nltk.word_tokenize(key))])
#Length of each keyword
len_list = [ len(k.split(' ')) for k in keyword_list]
#Text Frequency
freq_list = [ len(pos)/float(len(document)) for pos in position_list]
target.extend(keyword_list)
for i in range(0,len(keyword_list)):
data.append([position_list[i],pos_list[i],len_list[i],freq_list[i]])
Where
target : list of keywords
data : list of features
I passed this through a classifier :
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data,target,test_size=0.25,random_state = 42)
import numpy as np
X_train = np.array(X_train)
y_train = np.array(y_train)
from sklearn import svm
cls = svm.SVC(gamma=0.001,C=100) # Parameter values Matter!
cls.fit(X_train,y_train)
predictions = cls.predict(X_test)
But I get an error :
Traceback (most recent call last):
File "supervised_3.py", line 113, in <module>
cls.fit(X_train,y_train)
File "/Library/Python/2.7/site-packages/sklearn/svm/base.py", line 150, in fit
X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')
File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 373, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence
So, I removed all the list items by changing
data.append([position_list[i],pos_list[i],len_list[i],freq_list[i]])
to
data.append([len_list[i],freq_list[i]])
It worked.
But I need to include position_list and pos_list
I thought it wasn't working because these 2 are lists. So, I tried converting them to arrays :
data.append([np.array(position_list[i]),np.array(pos_list[i]),len_list[i],freq_list[i]])
but I still get the same error.
In the last for loop of the feature extraction code you are trying to append to data a list of four elements, namely position_list[i], pos_list[i], len_list[i], freq_list[i]. The problem is that the first two elements are lists themselves, but individual features have to be escalars (this is why the issue is not solved by converting the sublists to numpy arrays). Each of them requires a different workaround:
position_list[i]
This is a list of float numbers. You could replace this list by some statistics computed from it, for example the mean and the standard deviation.
pos_list[i]
This is a list of tags extracted from the list of tuples of the form (token, tag)* yielded by nltk.pos_tag. The tags (which are strings) can be converted into numbers in a straightforward way by counting their number of occurrences. To keep things simple, I will just add the frequency of 'NN' and 'NNS' tags**.
To get your code working you just need to change the last for loop to:
for i in range(0, len(keyword_list)):
positions_i = position_list[i]
tags_i = pos_list[i]
len_tags_i = float(len(tags_i))
m = np.mean(positions_i)
s = np.std(positions_i)
nn = tags_i.count('NN')/len_tags_i
nns = tags_i.count('NNS')/len_tags_i
data.append([m, s, nn, nns, len_list[i], freq_list[i]])
By doing so the resulting feature vector becomes 6-dimensional. Needless to say, you could use a higher or lower number of statistics and/or tag frequencies, or even a different tagset.
* The identifiers w,pos you use in the for loop that creates pos_list are a bit misleading.
** You could utilize a collections.Counter to count the number of occurrences of each tag more efficiently.
I am learning python programming and machine learning for my academic project and i found interest in number plate recognition.
By executing below code, i am getting error, which is mentioned below after the code
values=
['0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G','H','J','K','L','M','N','P','R','S','T','U','V','W','X','Z']
keys=range(32)
data_map=dict((keys, values))
def get_ann(data_map):
feature_mat=[]
label_mat=[]
for keys in data_map:
path_train="/home/sagar/Project data set/ANPR/ann/%s"%data_map[keys]
filenames=get_imlist(path_train)
perfeature_mat=[]
perlabel_mat=[]
for image in filenames[0]:
raw_image=cv2.imread(image)
raw_image=cv2.cvtColor(raw_image, cv2.COLOR_BGR2GRAY)
#resize the image into 5 cols(width) and 10 rows(height)
raw_image=cv2.resize(raw_image,(5,10), interpolation=cv2.INTER_AREA)
#Do a hard thresholding.
_,th2=cv2.threshold(raw_image, 70, 255, cv2.THRESH_BINARY)
#generate features
horz_hist=np.sum(th2==255, axis=0)
vert_hist=np.sum(th2==255, axis=1)
sample=th2.flatten()
#concatenate these features together
feature=np.concatenate([horz_hist, vert_hist, sample])
# append these features together along with their respective labels
perfeature_mat.append(feature)
perlabel_mat.append(keys)
feature_mat.append(perfeature_mat)
label_mat.append(perlabel_mat)
# These are the final product.
bigfeature_mat=np.vstack(feature_mat)
biglabel_mat=np.hstack(label_mat)
# As usual. We need to convert them into double type for Shogun.
bigfeature_mat=np.array(bigfeature_mat, dtype='double')
biglabel_mat=np.array(biglabel_mat, dtype='double')
#shogun works in a way in which columns are samples and rows are features.
#Hence we need to transpose the observation matrix
obs_matrix=bigfeature_mat.T
#convert the observation matrix and the labels into Shogun RealFeatures and MulticlassLabels structures resp. .
sg_features=RealFeatures(obs_matrix)
sg_labels=MulticlassLabels(biglabel_mat)
#initialize a simple ANN in Shogun with one hidden layer.
layers=DynamicObjectArray()
layers.append_element(NeuralInputLayer(65))
layers.append_element(NeuralLogisticLayer(65))
layers.append_element(NeuralSoftmaxLayer(32))
net=NeuralNetwork(layers)
net.quick_connect()
net.initialize()
net.io.set_loglevel(MSG_INFO)
net.l1_coefficient=3e-4
net.epsilon = 1e-6
net.max_num_epochs = 600
net.set_labels(sg_labels)
net.train(sg_features)
return net
The errors:
AttributeError Traceback (most recent call last)
<ipython-input-28-30225c91fe73> in <module>()
----> 1 net=get_ann(data_map)
<ipython-input-27-809f097ce563> in get_ann(data_map)
59 net=NeuralNetwork(layers)
60 net.quick_connect()
---> 61 net.initialize()
62
63 net.io.set_loglevel(MSG_INFO)
AttributeError: 'NeuralNetwork' object has no attribute 'initialize'
Platform used: Ubuntu 14.04, Python 2.7, opencv-2.4.9, iPython notebook and shogun toolbox.
Can any one please help me in resolving this error? Thanks in advance.
The other code samples are as follows, which have been executed before the above code.
from modshogun import *
def get_vstacked_data(path):
filenames=np.array(get_imlist(path))
#read the image
#convert the image into grayscale.
#change its data-type to double.
#flatten it
vmat=[]
for i in range(filenames[0].shape[0]):
temp=cv2.imread(filenames[0][i])
temp=cv2.cvtColor(temp, cv2.COLOR_BGR2GRAY)
temp=cv2.equalizeHist(temp)
temp=np.array(temp, dtype='double')
temp=temp.flatten()
vmat.append(temp)
vmat=np.vstack(vmat)
return vmat
def get_svm():
#set path for positive training images
path_train='/home/sagar/resized/'
pos_trainmat=get_vstacked_data(path_train)
#set path for negative training images
path_train='/home/sagar/rezize/'
neg_trainmat=get_vstacked_data(path_train)
#form the observation matrix
obs_matrix=np.vstack([pos_trainmat, neg_trainmat])
#shogun works in a way in which columns are samples and rows are features.
#Hence we need to transpose the observation matrix
obs_matrix=obs_matrix.T
#get the labels. Positive training images are marked with +1 and negative with -1
labels=np.ones(obs_matrix.shape[1])
labels[pos_trainmat.shape[0]:obs_matrix.shape[1]]*=-1
#convert the observation matrix and the labels into Shogun RealFeatures and BinaryLabels structures resp. .
sg_features=RealFeatures(obs_matrix)
sg_labels=BinaryLabels(labels)
#Initialise a basic LibSVM in Shogun.
width=2
#kernel=GaussianKernel(sg_features, sg_features, width)
kernel=LinearKernel(sg_features, sg_features)
C=1.0
svm=LibSVM(C, kernel, sg_labels)
_=svm.train()
_=svm.apply(sg_features)
return svm
ocr classification
def validate_ann(cnt):
rect=cv2.minAreaRect(cnt)
box=cv2.cv.BoxPoints(rect)
box=np.int0(box)
output=False
width=rect[1][0]
height=rect[1][1]
if ((width!=0) & (height!=0)):
if (((height/width>1.12) & (height>width)) | ((width/height>1.12) & (width>height))):
if((height*width<1700) & (height*width>100)):
if((max(width, height)<64) & (max(width, height)>35)):
output=True
return output
Probably there are some method deprecations issues with Shogun.
Try to replace:
net.initialize()
With:
net.initialize_neural_network()
I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!
Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
Here's how we construct the unigram model first:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
Now we can test this on two different test sets:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
for which you get the following result:
>>>
49.09452736318415
99.99999999999997
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.
Thanks for the code snippet! Shouldn't:
for word in model:
model[word] = model[word]/float(sum(model.values()))
be rather:
v = float(sum(model.values()))
for word in model:
model[word] = model[word]/v
Oh ... I see was already answered ...
I have a map with a lot of markers on it. And I have a two not intersecting polygons (Box). I want to get all markers which covered by these polygons.
qb_1 = Polygon.from_bbox((-35.19153, -5.84512, -35.24054, -5.78552))
qb_2 = Polygon.from_bbox((64.16016, 50.26125, 61.80359, 52.04911))
q_box = MultiPolygon(qb_1, qb_2)
test1 = Marker.objects.filter(point__contained=qb_1)
test2 = Marker.objects.filter(point__contained=qb_2)
test = Marker.objects.filter(point__contained=q_box)
print "Count of Polygon 1 = %s" % test1.count()
print "Count of Polygon 2 = %s" % test2.count()
print "Count of MultiPolygon = %s" % test.count()
But the results are:
Count of Polygon 1 = 4
Count of Polygon 2 = 12
Count of MultiPolygon = 237
Why Polygon 1 + Polygon 2 is not equal MultiPolygon ?
The secret lies in the words I have highlighted (from the geoqueryset documentation)
contained
Availability: PostGIS, MySQL, SpatiaLite
Tests if the geometry field’s bounding box is completely contained by the lookup geometry’s bounding box
The two polygons you have created happen to be have small areas, and the multipoligon you have created also has a small area, but the same cannot be said about it's bounding box.
qb_1.envelope.area # 0.0029209960000001417
qb_2.envelope.area # 4.213217240200014
qbox.envelope.area # 5754.726987961
as you will see the last one is huge in comparison an it covers a lot more points than the two polygons taken on their own. Thus the whole is greater than the sum of it's parts.
You should be able to get the actual points covered by the two polygons as follows:
from django.db.models import Q
Marker.objects.filter(Q(point__contained=qb_1) | Q(point__contained=qb_1))
But perhaps contains_properly is what you are really looking for? But that's available only in postgresql so contains is a good substitute.
contains
Availability: PostGIS, Oracle, MySQL, SpatiaLite
Tests if the geometry field spatially contains the lookup geometry.
Then your query becomes
Marker.objects.filter(Q(point__contains=qb_1) | Q(point__contains=qb_1))