Word2Vec Output Vectors - word2vec

As I understand it, Word2Vec builds a word dictionary (or, vocabulary) based on a training corpus, and outputs a K-dim vector for each word in the dictionary. My question is, what exactly is the source of those K-Dim vectors? I'm assuming each vector is either a row or column in one of the weight matrices between the input and hidden layer, or the hidden and output layer. However, I haven't been able to find any sources to back this up, and I'm not literate enough in programming languages examine the source code and figure it out myself. Any clarifying remarks on this topic would be greatly appreciated!

what exactly is the source of those K-Dim vectors? I'm assuming each vector is either a row or column in one of the weight matrices between the input and hidden layer, or the hidden and output layer.
In the word2vec model(CBOW, Skip-gram), it outputs a feature matrix of words. This matrix is first weight matrix between input layer and projection layer(in word2vec model has no hidden layer, no activation function in it). Because when we train word in the context(in CBOW Model), we updates this weight matrix.(second - between projection and output layer - matrix also updated. however we are not using it)
in the first matrix, rows mean a vocabulary words and columns mean feature of word(K-Dimension).
if you want more information, explore it
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

word2vec uses machine learning to obtain word representations. It predicts a word using its context (CBOW) or vice versa (skip-gram).
In machine learning, you have a loss function that represents the error you model makes. This error depends on the model's parameters.
Training a model means minimizing the error with respect to the model's parameters.
In word2vec, these embedding matrices are the model's parameters that are being updated during the training. I hope, it helps you to understand where they come from. Indeed, they are first initialized randomly and they are changed during the training process.
You can take a look at this picture from this paper:
The W matrix that maps the input one-hot word representations to that k-dimensional vectors and the W' matrix that maps a k-dimensional representation to the output are both the model's parameters that we optimize during training.

Related

Principal component analysis on proportional data

Is it valid to run a PCA on data that is comprised of proportions? For example, I have data on the proportion of various food items in the diet of different species. Can I run a PCA on this type of data or should I transform the data or do something else beforehand?
I had a similar question. You should search for "compositional data analysis". There are transformation to apply to proportions in order to analyze them with multivariate tecniques such as PCA. You can find also "robust" PCA algorithms to run your analysis in R. Let us know if you find an appropriate solution to your specific problem.
I don't think so.
PCA will give you "impossible" answers. You might get principal components with values that proportions can't have, like negative values or values greater than 1. How would you interpret this component?
In technical terms, the support of your data is a subset of the support of PCA. Say you have $k$ classes. Then:
the support for PCA vectors is $\R^k$
the support for your proportion vectors is the $k$- dimensional simplex. By simplex I mean the set of $p$ vectors of length $k$ such that:
$0 \le p_i \le 1$ where $i = 1, ..., k$
$\sum_{i=1}^k{p_i} = 1$
One way around this is if there's a one to one mapping between the $k$-simplex to all of $\R^k$. If so, you could map from your proportions to $\R^k$, do PCA there, then map the PCA vectors to the simplex.
But I'm not sure the simplex is a self-contained linear space. If you add two elements of the simplex, you don't get an element of the simplex :/
A better approach, I think, is clustering, eg with Gaussian mixtures, or spectral clustering. This is related to PCA. But a nice property of clustering is you can express any element of your data as a "convex combination" of the clusters. If you analyze your proportion data and find clusters, they (unlike PCA vectors) will be within the simplex space, and any mixture of them will be, too.
I also recommend looking into nonnegative matrix factorization. This is like PCA but, as the name suggests, avoids negative components and also negative eigenvectors. It's very useful for inferring structure in strictly positive data, like proportions. But nmf does not give you a basis for simplex space.

Hidden Markov Model using multiple continuous observation variables

I am trying to use HMM for location prediction. I have the coordinates (x,y), speed and direction of motion. I have discretized the entire space into small blocks, that I use as states. The objective is to predict the location (state) of the object after time t, 2t, 3t and so on.
I have read multiple articles on HMM. I still have 2 questions:
Can I use some trajectories to create the transition matrix? My mapping from coordinates to block (i.e. the state) is straightforward, so I can use a few samples to create an initial transition matrix.
How do I define the emission matrix with the continuous observables (i.e Position, speed and direction). If I assume them to be gaussian with mean 0, how do I create the initial emissions matrix.
Can I use Viterbi to predict the location after time t, 2t etc?
I have read too many articles and am really confused now. I would appreciate some help to know if I am going in the right direction.
Also, what would be a good c++ library to use for the purpose?
Mlpack (http://www.mlpack.org/) is a very good and simple C++ library.
I couldn't understand what are your observations and what are your hidden states. if you have simple mapping between them, then maybe you don't need HMM in the first place.

How to train a svm for classifying images of english alphabet?

My objective is to detected text in an image and recognize them.
I have achieved detecting characters using stroke width transform.
What to do to recognize them?
As per my knowledge, I thought of training the svm with my dataset of letters of different fonts[images] by detecting feature point and extracting feature vectors from each and every image.[I have used SIFT Feature vector,did build the dictionary using kmean clusetering and all].
I have detected a character before, i will extract the sift feature vector for this character . and i thought of feeding this into the svm prediction function.
I dont know how to recognize using svm. I am confused! Help me and correct me where ever I went wrong with concept..
I followed this turorial for recognizing part. Can this turotial can be applicable to recognize characters.
http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
SVM is a supervised classifier. To use it, you will need to have training data that is of the type of objects you are trying to recognize.
Step 1 - Prepare training data
The training data consists of pairs of feature vectors and their corresponding class labels. In your case, it appears that you have extracted a SIFT-based "Bag-of-word" (BOW) feature vector for the characters you detected. So, for your training data, you will need to find many examples of the different characters, extract this feature vector for each of them, and associate them with a label (sometimes called a class label, and typically an integer) which you will perhaps map to a textual description (for e.g., the number 0 could be mapped to the character 'a', and so on.)
Step 2 - Training the classifier
The SVM classifier takes in as input an array/Mat of feature vectors (one per row) and their associated labels. Tune the parameters of the SVM (i.e., the regularization parameter C, and if applicable, any other parameters for kernels) on a separate validation set.
Step 3 - Predict for unseen data
At test time, given a sample that was not seen by the SVM during training, you compute a feature vector (your SIFT-based BOW vector) for the sample. Pass this feature vector to the SVM's predict function, and it will return you an integer. Remember earlier when preparing your training data, you have associated an integer with each label? This is the label predicted by the SVM for this sample. You can then map this label to a character. For e.g., if you have associated 0 with 'a', 1 with 'b' etc., you can use a vector/hashmap to map the integer to its textual counterpart.
Additional Notes
You can check out OpenCV's SVM tutorial here for details.
NOTE: Often, for beginners, the hardest part (after getting the data) is tuning the classifier. My advice is first try a simple classifier (for e.g., a linear SVM) which has few parameters to tune. A decent one would be the linear SVM, which only requires you to adjust one parameter C. Once you manage to get somewhat decent results (which gives some assurance that the rest of your code is working) you can move on to more "sophisticated" classifiers.
Lastly, the training data and feature vectors you extract are very important. The training data must be "similar" to the test data you are trying to predict. For e.g., if you are predicting characters found in road signs which comes with different fonts, lighting conditions, and pose differences, then using training data consisting of characters taken from say a newspaper/book archive may not give you good results. This is an issue of domain adaptation in machine learning.

What is `query` and `train` in openCV features2D

Everywhere in features2D classes I see terms query and train. For example matches have trainIdx and queryIdx, and Matchers have train() method.
I know the definition of words train and query in English, but I can't understand the meaning of this properties or methods.
P.S. I understand, that it's very silly question, but maybe it's because English is not my native language.
To complete sansuiso's answer, I suppose the reason for choosing these names should be that in some application we have got a set of images (training images) beforehand, for example 10 images taken inside your office. The features can be extracted and the feature descriptors can be computed for these images. And at run-time an image is given to the system to query the trained database. Hence the query image refers to this image. I really don't like the way they have named these parameters. Where you have a pair of stereo images and you want to match the features, these names don't make sense but you have to chose a convention say always call the left image the query image and the right image as the training image. I did my PhD in computer vision and some naming conventions in OpenCV seem really confusing/silly to me. So if you find these confusing or silly you're not alone.
train: this function builds the classifier inner state in order to make it operational. For example, think of training an SVM, or building a kd-tree from the reference data. Maybe you are confused because this step is often referred to as learning in the literature.
query is the action of finding the nearest neighbors to a set of points, and by extension it also refers to the whole set of points for which yo want a nearest neighbor. Recall that you can ask for the neighbors of 1 point, or a whole lot in the same function call (by stacking the feature points in a matrix).
trainIdxand queryIdx refer to the index of a pint in the reference / query set respectively, i.e. you ask the matcher for the nearest point (stored at the trainIdx position) to some other point (stored at the queryIdxposition). Of course, trainIdxis known after the function call. If your points are stored in a matrix, the index will be the line of the considered feature.
I understand "query" and "train" in a very naive but useful way:
"train": a data or image is preprocessed to get a database
"query": an input data or image that will be queried in the database which we trained before.
Hope it helps u as well.

Analyzing gaze tracking data

I have an image which was shown to groups of people with different domain knowledge of its content. I than recorded gaze fixation data of them watching the image.
I now kind of want to compare the results of the two groups - so what I need to know is, if there is a correlation of the positions of the sampling data between the two groups or not.
I have the original image as well as the fixation coords. Do you have any good idea how to start analyzing the data?
It's more about the idea or the plan so you don't have to be too technical on that one.
Thanks
Simple idea: render all the coordinates on the original image in a 'heat map' like way, one image for each group. You can then visually compare the images for correlation, and you have some nice graphics for in your paper.
There is something like the two-dimensional correlation coefficient. With software like R or Matlab you can do the number crunching for the correlation.
Matlab has a function for this:
Two Dimensional Correlation Function: corr2
Computes two dimensional correlation coefficient between two matrices
and the matrices must be of the same size. r = corr2 (A,B)
In gaze tracking, the most interesting data lies in two areas.
In where all people look, for that you can use the heat map Daan suggests. Make a heat map for all people, and heat maps for separate groups of people.
In when people look there. For that I would recommend you start by making heat maps as above, but for short time intervals starting from the time the picture was first shown. Again, for all people, and for the separate groups you have.
The resulting set of heat-maps, perhaps animated for the ones from the second point, should give you some pointers for further analysis.