DL4J transitional probability matrix in Word2vec - dl4j

In DL4J, how to calculate transition probability (E.g. In word2vec, between two words) which leads to probability matrix (set of words)?

Related

silhouette value increasing while the number of clusters increasing

I have a matrix which the row are the brands and the columns are the features of each brand.
First, I calculate the affinity matrix with scikit learn and then apply the spectral clustering on the affinity matrix to do the clustering.
When I calculate the silhouette value with respect to each number of clusters, as long as the number of clusters increasing, the silhouette value is also increasing.
In the end when the number of clusters get bigger and bigger, to calculate the silhouette value, it will give NaN result
#coding utf-8
import pandas as pd
import sklearn.cluster as sk
from sklearn.cluster import SpectralClustering
from sklearn.metrics import silhouette_score
data_event = pd.DataFrame.from_csv('\Data\data_of_events.csv', header=0,index_col=0, parse_dates=True, encoding=None, tupleize_cols=False, infer_datetime_format=False)
data_event_matrix = data_event.as_matrix(columns = ['Furniture','Food & Drinks','Technology','Architecture','Show','Fashion','Travel','Art','Graphics','Product Design'])
#compute the affinity matrix
data_event_affinitymatrix = SpectralClustering().fit(data_event_matrix).affinity_matrix_
#clustering
for n_clusters in range(2,100,2):
print n_clusters
labels = sk.spectral_clustering(data_event_affinitymatrix, n_clusters=n_clusters, n_components=None,
eigen_solver=None, random_state=None, n_init=10, eigen_tol=0.0, assign_labels='kmeans')
silhouette_avg = silhouette_score(data_event_affinitymatrix, labels)
print("For n_clusters =", n_clusters, "The average silhouette_score of event clustering is :", silhouette_avg)
If your intention is to find the optimal number of cluster then you can try using the Elbow method. Multiple variations exists for this method, but the main idea is that for different values of K (no. of clusters) you find the cost function that is most appropriate for you application (Example, Sum of Squared distance of all the points in a cluster to it's centroid for all values of K say 1 to 8, or any other error/cost/variance function. In your case if it is a distance function, then after a certain point number of clusters, you will notice that the difference in values along the y-axis becomes negligible. Based on the graph plotted for number of clusters along x-axis and your metric along y-axis, you choose the value 'k' on x-axis at such a point where the value at y-axis changes abruptly.
You can see in this , that the optimal value of 'K' is 4.
Image Source : Wikipedia.
Another measure that you can use to validate your clusters is V-measure Score. It is a symmetric measure and if often used when the ground truth is unknown. It is defined as the Harmonic mean of Homogenity and Completeness. Here is an example in scikit-learn for your reference.
EDIT: V-measure is basically used to compare two different cluster assignments to each other.
Finally, if you are interested, you can take a look at Normalized Mutual Information Score to validate your results as well.
References :
Biclustering Scikit-Learn
Elbow Method : Coursera
Research Paper on V-Measure
Choosing the right number of clusters
Update : I recently came across this Self Tuning Spectral Clustering. You can give it a try.

How to calculate efficiently and accurately the Fourier transform of a radial function in Fortran

As my question states, I want to calculate the Fourier transform F(q) of a radial function f(r) (defined on [0,infinity[ and which decays like an exponential exp(-Ar +b) at large r) as accurately as possible in Fortran. The function values come from a data file (which I can easily interpolate through cubic interpolation for example and extrapolate since the behaviour at large r is known).
I'm using the "physics" definition of the Fourier transform in 3D, which gives (because f is radial) :
I first tried to calculate this integral for some chosen values of q by using Gauss-Legendre quadrature, by generating some 60 or 100 abscissas and weights via the NAG routine D01BCF (D01BCF link). In the case of Gauss Legendre quadrature, the problem is to choose the interval [0,B] on which to integrate. While the function f loses 4 to 5 orders of magnitude from r=10 to r=20 (example), the choice of B as a strong influence on the result of the calculation... When I compared the result I get to a "nearly exact" calculation (made with matlab but with a veeeery long computation time), I saw that in fact this was only valid for small values of q (of the order of 5, when I have to deal with values as large as 150). A Gauss-Laguerre quadrature does not give any better result, probably because of the oscillatory part of the integrand.
I then tried to compute this Fourier transform for some given values of q with the routine D01ASF (D01ASF link). It is a "one-dimensional quadrature, adaptive, semi-infinite interval, weight function cos(ωx) or sin(ωx) ", which is exactly what I need. The results are quite convincing for q up to 80 or 100 if I input absolute error tolerances of 10E-5. Problems are : I would need to go at larger q, and the Fourier transform F(q) oscillates with a magnitude of ~ 10E-6 at such q's. Lowering the tolerance to 10E-5 already takes some time and even makes the whole thing to output some error message from the subroutine so I don't know if 10E-6 would be feasible.
I'm thus currently wondering if trying to calculate this Fourier transform with FFT wouldn't be a good idea ? The problems I face are that I don't know how to calculate radial wave functions with FFT (and also that I don't even know how to use FFT properly either since the definition of the transform is not even the same (exponent sign and argument) and that I never used it before).
Would you have ideas ? :)
EDIT 2 : I tried by FFT (using the routine C06FAF from NAG library). It works quite well up to some large values of q. The problem I face is that there is always some constant normalising factor to account for. I don't get why. This normalising factor evolves with the number N of points used in the mesh. It has the for of a power law : Normalising Factor F = N^(-0.5) x exp(9.9) approximately (see figure where the black line is the "exact" Fourier Transform and the green, magenta, blue, red and yellow lines are the FFT calculated for different values of N)
EDIT3 : I found the factor to be A*N^(-0.5) where A is the length of the integration mesh

Confusion regarding Conditional Random Fields

http://i.imgur.com/dspFhlO.png
I am trying to label objects in am image using Conditional Random Fields. But I am stuck understanding this formula.
Can anyone tell me the meaning the terms of the formula and how to calculate them.
I am using MS-COCO data set which has labelled images i.e I have segmented images.
Here Z(.)= partition function and P(ci | Sj)= Probability that Sj segment of Image I belongs to class ci and q= no of pairwise spatial relations.
This is in fact the the conditional probability distribution of the labeling c={c1,c2,...,ck} for the image segments, given the segments features S={S1,S2,...,Sk}. p(ci|Si) is the probability of assigning class label ci to segment i, which can be computed using various classifiers like logistic regression, neural network, or SVM. The term B presents the aggregate pairwise function that determines how likely it is for each adjacent pair of {i,j} to take labels {ci,cj}. This term can be reallized by computing the co-occurrence statistics of different class pairs in the dataset, which is described in detail in this paper:
Object Categorization using Co-Occurrence, Location and Appearance

Spatial pyramid matching (SPM) for SIFT then input to SVM in C++

I am trying to classify MRI images of brain tumors into benign and malignant using C++ and OpenCV. I am planning on using bag-of-words (BoW) method after clustering SIFT descriptors using kmeans. Meaning, I will represent each image as a histogram with the whole "codebook"/dictionary for the x-axis and their occurrence count in the image for the y-axis. These histograms will then be my input for my SVM (with RBF kernel) classifier.
However, the disadvantage of using BoW is that it ignores the spatial information of the descriptors in the image. Someone suggested to use SPM instead. I read about it and came across this link giving the following steps:
Compute K visual words from the training set and map all local features to its visual word.
For each image, initialize K multi-resolution coordinate histograms to zero. Each coordinate histogram consist of L levels and each level
i has 4^i cells that evenly partition the current image.
For each local feature (let's say its visual word ID is k) in this image, pick out the k-th coordinate histogram, and then accumulate one
count to each of the L corresponding cells in this histogram,
according to the coordinate of the local feature. The L cells are
cells where the local feature falls in in L different resolutions.
Concatenate the K multi-resolution coordinate histograms to form a final "long" histogram of the image. When concatenating, the k-th
histogram is weighted by the probability of the k-th visual word.
To compute the kernel value over two images, sum up all the cells of the intersection of their "long" histograms.
Now, I have the following questions:
What is a coordinate histogram? Doesn't a histogram just show the counts for each grouping in the x-axis? How will it provide information on the coordinates of a point?
How would I compute the probability of the k-th visual word?
What will be the use of the "kernel value" that I will get? How will I use it as input to SVM? If I understand it right, is the kernel value is used in the testing phase and not in the training phase? If yes, then how will I train my SVM?
Or do you think I don't need to burden myself with the spatial info and just stick with normal BoW for my situation(benign and malignant tumors)?
Someone please help this poor little undergraduate. You'll have my forever gratefulness if you do. If you have any clarifications, please don't hesitate to ask.
Here is the link to the actual paper, http://www.csd.uwo.ca/~olga/Courses/Fall2014/CS9840/Papers/lazebnikcvpr06b.pdf
MATLAB code is provided here http://web.engr.illinois.edu/~slazebni/research/SpatialPyramid.zip
Co-ordinate histogram (mentioned in your post) is just a sub-region in the image in which you compute the histogram. These slides explain it visually, http://web.engr.illinois.edu/~slazebni/slides/ima_poster.pdf.
You have multiple histograms here, one for each different region in the image. The probability (or the number of items would depend on the sift points in that sub-region).
I think you need to define your pyramid kernel as mentioned in the slides.
A Convolutional Neural Network may be better suited for your task if you have enough training samples. You can probably have a look at Torch or Caffe.

What is an ROC curve?

Can someone be kind enough to explain what an ROC curve actually represents with respect to tracking in a test sequence please? An example of an ROC curve is shown in the figure below.
The comments to the original question contain some very useful links to understand ROC curves in general and the discrimination threshold in question. Here is an attempt to understand the reference (Ref. 1) used by the OP and further information specific to the problem of Detecting Pedestrians.
How the ROC Curves are obtained in (Ref. 1) and what is the discrimination threshold in this case:
Motion filters and appearance filters, ("f_i" in eq. (2) p. 156) are evaluated using "integral image" of various time/spatial difference images from video sequences.
Using all these filters the learning algorithm builds the best classifier, (C in eq. (1) p. 156), separating positive examples ( e.g: pedestrians ) from negative examples (e.g: selection of non-pedestrian examples ). The classifier, C, is a thresholded sum of features, F, as given in eq. (1). A feature, F, is a filter, "f_i" thresholded by a feature threshold "t_i".
The thresholds involved (i.e., filter thresholds, "t_i" and classifier threshold "Theta") are computed during AdaBoost training that chooses the features with the lowest weighted error on the training examples.
As in (Ref. 2), a cascade of such classifiers is used to make the detector extremely efficient. During training each stage of the cascade - boosted classifier - is trained using 2250 positive examples (example in Fig. 5 p. 158) and 2250 negative examples.
The final cascade detector is run over a validation sequences to obtain the true positive rate and the false positive rate. This is based on comparing the output of the cascade detector (e.g., presence or absence of a pedestrian) to the ground truth (presence or absence of a pedestrian at the same region based on ground truth or manual review of the video sequence). For a set of threshold values for the entire cascade ("t_i" and "Theta" over all classifier in the cascade) a certain true positive rate and false positive rate is obtained. This will make one point on the ROC curve.
A simple MATLAB example for measuring True Positive Rate and False Positive Rate from a given set of classifier outputs and ground truth can be found here: http://www.mathworks.com/matlabcentral/fileexchange/21212-confusion-matrix---matching-matrix-along-with-precision--sensitivity--specificity-and-model-accuracy
So in this case, each point on the ROC curve will depend on the thresholds chosen for all the cascade layers (hence the discriminative threshold is not a single number in this case). By adjusting these thresholds, one at a time, the output values of true positive rate and false positive rate change (when step 5 is repeated) giving other points on the ROC curve.
This process is repeated for both cases of dynamic and static detectors to obtain the two ROC curves on the figure.
Please go through some more good description and examples through this link on ROC:
ROC curves can be used to compare the performance of classifiers in distinguishing between classes, as example, pedestrian versus non-pedestrian input samples. The area under the ROC curve is used as a measure of the classifier's ability to distinguish between the classes.
Quotes are from:
(Ref. 1) P. Viola, M. J. Jones, and D. Snow, "Detecting Pedestrians Using Patterns of Motion and Appearance", International Journal of Computer Vision 63(2), 153-161, 2005. [online: as of April 2015] http://lear.inrialpes.fr/people/triggs/student/vj/viola-ijcv05.pdf
(Ref. 2) P. Viola and M. J. Jones, "Rapid object detection using a boosted cascade of simple features. In IEEE Conference on Computer Vision and Pattern Recognition. More information at Viola-Jones Algorithm - "Sum of Pixels"?