Skip-Gram implementation in tensorflow/models - Subsampling of Frequent Words - c++

I have some experiments in mind related to skipgram model. So I have started to study and modify the optimized implementation in tensorflow/models repository in tutorials/embedding/word2vec_kernels.cc. Suddenly I came above the part where corpus subsampling is done.
According to Tomáš Mikolov paper (https://arxiv.org/abs/1310.4546, eq.5), the word should be kept with probability
where t denotes threshold parameter (according to paper chosen as 10^-5), and f(w) frequency of the word w,
but the code in word2vec_kernels.cc is following:
float keep_prob = (std::sqrt(word_freq / (subsample_ * corpus_size_)) + 1) *
(subsample_ * corpus_size_) / word_freq;
which can be transformed into previously presented notation as
What is the motivation behind this change? Is it just to model 'some kind of relation' to corpus size into this formula? Or is it some transformation of the original formula? Was it chosen empirically?
Edit: link to the mentioned file on github
https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec_kernels.cc

Okay so I guess that without corpus_size, the graph looks somewhat the same as the original formula. Corpus size adds a relation to the corpus size to the formula and also "it works with the large numbers", so we can compute discard/keep probability without normalizing word frequency to proper distribution.

Related

Applying word2vec to find all words above a similarity threshold

The command model.most_similar(positive=['france'], topn=100) gives the top 100 most similar words to "france". However, I would like to know if there is a method which will output the most similar words above a similarity threshold to a given word. Is there a method like the following?:
model.most_similar(positive=['france'], threshold=0.9)
No, you'd have to request a large number (or all, with topn=0) then apply the cutoff yourself.
What you request could theoretically be added as an option.
However, the cosine-similarity absolute magnitudes don't necessarily have a stable meaning, like "90% similar" across different model runs. Their distribution can vary based on model training parameters, such as the vector size, and they are most-often interpreted only in ranked-comparison to other pairwise values from the same model.
For example, the composition of the top-100 most-similar words for 'cold' may be very similar in models with different training parameters, but the range of absolute similarity values for the #1 to #100 words can be quite different. So if you were picking an absolute threshold, you'd likely want to vary the cutoff based on observing the model, or along with other model training metaparameters.
Well, let's say you can. Try the following code:
def find_most_similar(model, wrd, threshold=0.75):
res = [item for item in model.wv.most_similar(wrd, topn=len(model.wv.vocab)) if item[1] > threshold]
return res

Random Forest with more features than data points

I am trying to predict whether a particular service ticket raised by client needs a code change.
I have training data.
I have around 17k data points with problem description and tag (Y for code change required and N for no code change)
I did TF-IDF and it gave me 27k features. So I tried to fit RandomForestClassifier (sklearn python) with this 17k x 27k matrix.
I am getting very low scores on test set while training accuracy is very high.
Precision on train set: 89%
Precision on test set: 21%
Can someone suggest any workarounds?
I am using this model now:
sklearn.RandomForestClassifier(n_jobs=3,n_estimators=100,class_weight='balanced',max_features=None,oob_score=True)
Please help!
EDIT:
I have 11k training data with 900 positives (skewed). I tried LinearSVC sparsify but didn't work as well as Truncated SVD (Latent Semantic Indexing). maxFeatures=None performs better on the test set than without it.
I have also tried SVM, logistic (l2 and l1), ExtraTrees. RandonForest still is working best.
Right now, going at 92% precision on positives but recall is 3% only
Any other suggestions would be appreciated!
Update:
Feature engineering helped a lot. I pulled features out of the air (len of chars, len of words, their, difference, ratio, day of week the problem was of reported, day of month, etc) and now I am at 19-20% recall with >95% accuracy.
Food for your thoughts on using word2vec average vectors as deep features for the free text instead of tf-idf or bag of words ???
[edited]
Random forest handles more features than data points quite fine. RF is e.g. used for micro-array studies with e.g. a 100:5000 data point/feature ratio or in single-nucleotide_polymorphism(SNP) studies with e.g 5000:500,000 ratio.
I do disagree with the diagnose provided by #ncfirth, but the suggested treatment of variable selection may help anyway.
Your default random forest is not badly overfitted. It is just not meaningful to pay any attention to a non-cross validated training set prediction performance for a RF model, because any sample will end in the terminal nodes/leafs it has itself defined. But the overall ensemble model is still robust.
[edit] If you would change the max_depth or min_samples_split, the training precision would probably drop, but that is not the point. The non-cross validated training error/precision of a random forest model or many other ensemble models simply does not estimate anything useful.
[I did before edit confuse max_features with n_estimators, sry I mostly use R]
Setting max_features="none" is not random forest, but rather 'bagged trees'. You may benefit from a somewhat lower max_features which improve regularization and speed, maybe not. I would try lowering max_features to somewhere between 27000/3 and sqrt(27000), the typical optimal range.
You may achieve better test set prediction performance by feature selection. You can run one RF model, keep the top ~5-50% most important features and then re-run the model with fewer features. "L1 lasso" variable selection as ncfirth suggests may also be a viable solution.
Your metric of prediction performance, precision, may not be optimal in case unbalanced data or if the cost of false-negative and false-positive is quite different.
If your test set is still predicted much worse than the out-of-bag cross-validated training set, you may have problems with your I.I.D. assumptions that any supervised ML model rely on or you may need to wrap the entire data processing in an outer cross-validation loop, to avoid over optimistic estimation of prediction performance due to e.g. the variable selection step.
Seems like you've overfit on your training set. Basically the model has learnt noise on the data rather than the signal. There are a few ways to combat this, but it seems fairly obvious that you're model has overfit because of the incredibly large number of features you're feeding it.
EDIT:
It seems I was perhaps too quick to jump to the conclusion of overfitting, however this may still be the case (left as an exercise to the reader!). However feature selection may still improve the generalisability and reliability of your model.
A good place to start for removing features in scikit-learn would be here. Using sparsity is a fairly common way to perform feature selection:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Create some data
X = np.random.random((1800, 2700))
# Boolean labels as the y vector
y = np.random.random(1800)
y = y > 0.5
y = y.astype(bool)
lsvc = LinearSVC(C=0.05, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print X_new.shape
Which returns a new matrix of shape (1800, 640). You can tune the number of features selected by altering the C parameter (called the penalty parameter in scikit-learn but sometimes called the sparsity parameter).

How to get vector for a sentence from the word2vec of tokens in sentence

I have generated the vectors for a list of tokens from a large document using word2vec. Given a sentence, is it possible to get the vector of the sentence from the vector of the tokens in the sentence.
There are differet methods to get the sentence vectors :
Doc2Vec : you can train your dataset using Doc2Vec and then use the sentence vectors.
Average of Word2Vec vectors : You can just take the average of all the word vectors in a sentence. This average vector will represent your sentence vector.
Average of Word2Vec vectors with TF-IDF : this is one of the best approach which I will recommend. Just take the word vectors and multiply it with their TF-IDF scores. Just take the average and it will represent your sentence vector.
There are several ways to get a vector for a sentence. Each approach has advantages and shortcomings. Choosing one depends on the task you want to perform with your vectors.
First, you can simply average the vectors from word2vec. According to Le and Mikolov, this approach performs poorly for sentiment analysis tasks, because it "loses the word order in the same way as the standard bag-of-words models do" and "fail[s] to recognize many sophisticated linguistic phenomena, for instance sarcasm". On the other hand, according to Kenter et al. 2016, "simply averaging word embeddings of all words in a text has proven to be a strong baseline or feature across a multitude of tasks", such as short text similarity tasks. A variant would be to weight word vectors with their TF-IDF to decrease the influence of the most common words.
A more sophisticated approach developed by Socher et al. is to combine word vectors in an order given by a parse tree of a sentence, using matrix-vector operations. This method works for sentences sentiment analysis, because it depends on parsing.
It is possible, but not from word2vec. The composition of word vectors in order to obtain higher-level representations for sentences (and further for paragraphs and documents) is a really active research topic. There is not one best solution to do this, it really depends on to what task you want to apply these vectors. You can try concatenation, simple summation, pointwise multiplication, convolution etc. There are several publications on this that you can learn from, but ultimately you just need to experiment and see what fits you best.
It depends on the usage:
1) If you only want to get sentence vector for some known data. Check out paragraph vector in these papers:
Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. Eprint Arxiv,4:1188–1196.
A. M. Dai, C. Olah, and Q. V. Le. 2015. DocumentEmbedding with Paragraph Vectors. ArXiv e-prints,July.
2) If you want a model to estimate sentence vector for unknown(test) sentences with unsupervised approach:
You could check out this paper:
Steven Du and Xi Zhang. 2016. Aicyber at SemEval-2016 Task 4: i-vector based sentence representation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016), San Diego, US
3)Researcher are also looking for the output of certain layer in RNN or LSTM network, recent example is:
http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12195
4)For the gensim doc2vec, many researchers could not get good results, to overcome this problem, following paper using doc2vec based on pre-trained word vectors.
Jey Han Lau and Timothy Baldwin (2016). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.
5) tweet2vec or sent2vec
.
Facebook has SentEval project for evaluating the quality of sentence vectors.
https://github.com/facebookresearch/SentEval
6) There are more information in the following paper:
Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering
And for now you can use 'BERT':
Google release the source code as well as pretrained models.
https://github.com/google-research/bert
And here is an example to run bert as a service:
https://github.com/hanxiao/bert-as-service
You can get vector representations of sentences during training phase (join the test and train sentences in a single file and run word2vec code obtained from following link).
Code for sentence2vec has been shared by Tomas Mikolov here.
It assumes first word of a line to be sentence-id.
Compile the code using
gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops
and run it using
./word2vec -train alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
EDIT
Gensim (development version) seems to have a method to infer vectors of new sentences. Check out model.infer_vector(NewDocument) method in https://github.com/gojomo/gensim/blob/develop/gensim/models/doc2vec.py
I've had good results from:
Summing the word vectors (with tf-idf weighting). This ignores word order, but for many applications is sufficient (especially for short documents)
Fastsent
Google's Universal Sentence Encoder embeddings are an updated solution to this problem. It doesn't use Word2vec but results in a competing solution.
Here is a walk-through with TFHub and Keras.
Deep averaging network (DAN) can provide sentence embeddings in which word bi-grams are averaged and passed through feedforward deep neural network(DNN).
It is found that transfer learning using sentence embeddings tends to outperform word level transfer as it preserves the semantic relationship.
You don't need to start the training from scratch, the pretrained DAN models are available for perusal ( Check Universal Sentence Encoder module in google hub).
let suppose this is current sentence
import gensim
from gensim.models import Word2Vec
from gensim import models
model = gensim.models.KeyedVectors.load_word2vec_format('path of your trainig
dataset', binary=True)
strr = 'i am'
strr2 = strr.split()
print(strr2)
model[strr2] //this the the sentance embeddings.

tanimoto coefficient in the book of Programming Collective Intelligence

I have read the book of Programming Collective Intelligence. For the after-class exercise 1 of chapter 2, could someone please tell me how to calculate the tanimoto coefficient? A specific mathematical formula will be really appreciated.
An extensive search on a related question has given me two formulas:
T(a,b) = N_intersection / (N_a + N_b - N_intersection) found here, which is the same as on Wikipedia in a slightly more readable fashion.
EDIT: As per your comment, this is the one the OP was looking for.
(n_11+n_00) / [n_11+2(n_10+n_01)+n_00], where
n_11: both have attribute,
n_00: both don't have attribute,
n_01 or n_10: only second/first object has the attribute.
For the source of the second equation have a look at http://reference.wolfram.com/language/ref/RogersTanimotoDissimilarity.html and calculate the similarity index from the dissimilarity index as (1-dissimilarity).
I believe that the second formula is commonly used in applied statistics and applied marketing.

Improving classification results with Weka J48 and Naive Bayes Multinomial classifiers

I have been using Weka’s J48 and Naive Bayes Multinomial (NBM) classifiers upon
frequencies of keywords in RSS feeds to classify the feeds into target
categories.
For example, one of my .arff files contains the following data extracts:
#attribute Keyword_1_nasa_Frequency numeric
#attribute Keyword_2_fish_Frequency numeric
#attribute Keyword_3_kill_Frequency numeric
#attribute Keyword_4_show_Frequency numeric
…
#attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}
#data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,12,0,0,0,0,0,20,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
…
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,FCL
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,F
…
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,7
8,0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,SNT
0,0,0,0,0,0,11,0,0,0,0,0,0,0,19,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,17,0,0,0,0,0,0,0,0,0,0,0,0,0,20,0,S
And so on: there’s a total of 570 rows where each one is contains with the
frequency of a keyword in a feed for a day. In this case, there are 57 feeds for
10 days giving a total of 570 records to be classified. Each keyword is prefixed
with a surrogate number and postfixed with ‘Frequency’.
I am using 10 fold x validation for both the J48s and NBM classifiers on a
'black box' basis. Other parameters used are also defaults, i.e. 0.25 confidence
and min number of objects is 2 for the J48s.
So far, my classification rates for an instance of varying numbers of days, date
ranges and actual keyword frequencies with both J28 and NBM results being
consistent in the 50 - 60% range. But, I would like to improve this if possible.
I have reduced the decision tree confidence level, sometimes as low as 0.1 but
the improvements are very marginal.
Can anyone suggest any other way of improving my results?
To give more information, the basic process here involves a diverse collection of RSS feeds where each one belongs to a single category.
For a given date range, e.g. 01 - 10 Sep 2011, the text of each feed's item elements are combined. The text is then validated to remove words with numbers, accents and so on, and stop words (a list of 500 stop words from MySQL is used). The remaining text is then indexed in Lucene to work out the most popular 64 words.
Each of these 64 words is then searched for in the description elements of the feeds for each day within the given date range. As part of this, the description text is also validated in the same way as the title text and again indexed by Lucene. So a popular keyword from the title such as 'declines' is stemmed to 'declin': then if any similar words are found in the description elements which also stem to 'declin', such as 'declined', the frequency for 'declin' is taken from Lucene's indexing of the word from the description elements.
The frequencies shown in the .arff file match on this basis, i.e. on the first line above, 'nasa', 'fish', 'kill' are not found in the description items of a particular feed in the BFE category for that day, but 'show' is found 34 times. Each line represents occurrences in the description items of a feed for a day for all 64 keywords.
So I think that the low frequencies are not due to stemming. Rather I see it as the inevitable result of some keywords being popular in feeds of one category, but which don't appear in other feeds at all. Hence the spareness shown in the results. Generic keywords may also be pertinent here as well.
The other possibilities are differences in the numbers of feeds per category where more feeds are in categories like NCA than S, or the keyword selection process itself is at fault.
You don't mention anything about stemming. In my opinion you could have better results if you were performing word stemming and the WEKA evaluation was based on the keyword stems.
For example let's suppose that your WEKA model is built given a keyword surfing and a new rss feed contains the word surf. There should be a match between these two words.
There are many free available stemmers for several languages.
For the English language some available options for stemming are:
The Porter's stemmer
Stemming based on the WordNet's dictionary
In case you would like to perform stemming using the WordNet's dictionary, there are libraries & frameworks that perform integration with WordNet.
Below you can find some of them:
MIT Java WordNet interface (JWI)
Rita
Java WorNet Library (JWNL)
EDITED after more information was provided
I believe that the keypoint in the specified case is the selection of the "most popular 64 words". The selected words or phrases should be keywords or keyphrases. So the challenge here is the keywords or keyphrases extraction.
There are several books, papers and algorithms written about keywords/keyphrases extraction. The university of Waikato has implemented in JAVA, a famous algorithm called Keyword Extraction Algorithm (KEA). KEA extracts keyphrases from text documents and can be either used for free indexing or for indexing with a controlled vocabulary. The implementation is distributed under the GNU General Public License.
Another issue that should be taken into consideration is the (Part of Speech)POS tagging. Nouns contain more information than the other POS tags. Therefore may you would have better results if you were checking the POS tag and the selected 64 words were mostly nouns.
In addition according to the Anette Hulth's published paper Improved Automatic Keyword Extraction Given More Linguistic Knowledge, her experiments showed that the keywords/keyphrases mostly have or are contained in one of the following five patterns:
ADJECTIVE NOUN (singular or mass)
NOUN NOUN (both sing. or mass)
ADJECTIVE NOUN (plural)
NOUN (sing. or mass) NOUN (pl.)
NOUN (sing. or mass)
In conclusion a simple action that in my opinion could improve your results is to find the POS tag for each word and select mostly nouns in order to evaluate the new RSS feeds. You can use WordNet in order to find the POS tag for each word and as I mentioned above there are many libraries on the web that perform integration with the WordNet's dictionary. Of course stemming is also essential for the classification process and has to be maintained.
I hope this helps.
Try turning off stemming altogether. The Stanford Intro to IR authors provide a rough justification of why stemming hurts, and at the very least does not help, in text classification contexts.
I have tested stemming myself on a custom multinomial naive Bayes text classification tool (I get accuracies of 85%). I tried the 3 Lucene stemmers available from org.apache.lucene.analysis.en version 4.4.0, which are EnglishMinimalStemFilter, KStemFilter and PorterStemFilter, plus no stemming, and I did the tests on small and larger training document corpora. Stemming significantly degraded classification accuracy when the training corpus was small, and left accuracy unchanged for the larger corpus, which is consistent with the Intro to IR statements.
Some more things to try:
Why only 64 words? I would increase that number by a lot, but preferably you would not have a limit at all.
Try tf-idf (term frequency, inverse document frequency). What you're using now is just tf. If you multiply this by idf you can mitigate problems arising from common and uninformative words like "show". This is especially important given that you're using so few top words.
Increase the size of the training corpus.
Try shingling to bi-grams, tri-grams, etc, and combinations of different N-grams (you're now using just unigrams).
There's a bunch of other knobs you could turn, but I would start with these. You should be able to do a lot better than 60%. 80% to 90% or better is common.