word2vec with different grammar - word2vec

what is the effect of word2vec if implemented on different language and different grammar? I mean word2vec is implemented on english corpus for the first time, is there any affect if we used another language corpus?

Word2vec has been applied to many languages – and also as one part of language-to-language translation strategies, where word2vec models are learned on each language separately.
Word2vec is not dependent on any specifics of English grammar. Rather, it simply requires as input sequences of words in their natural ordering.
(Languages where words aren't clearly indicated with intervening whitespace/punctuation may require more complicated tokenization before their word-sequences are passed to word2vec training, but that's outside the word2vec algorithm itself, and once given proper word-tokens word2vec should still be able to learn word-vectors that have the usual desirable arrangments.)

You can't compare word embeddings from models that have been trained on different corpora. You'll get nonsensical results because model A has no knowledge of the contexts that the words from model B has been seen in.
It is possible though to convert the word embeddings from model A into a new vector that exists in the vector space of model B. This is akin to translation, and the new word vector should be close to words in the other model's corpus that are similar in meaning to your original word (assuming that both models have been trained on data covering a similar range of contexts).
I've written a small package: transvec, which will do this for you in Python. It allows you to take words vectors from a pre-trained model for one language and translate them into word vectors in a completely different language. Here's an example where words are converted between English and Russian, using pre-trained Word2Vec models:
import gensim.downloader
from transvec.transformers import TranslationWordVectorizer
# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")
# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]
bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)
# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]

Related

Document classification: Preprocessing and multiple labels

I have a question about the word representation algorithms:
Which one of the algorithms word2Vec, doc2Vec and Tf-IDF is more suitable for handling text classification tasks ?
The corpus used in my supervised learning classification is composed of a list of multiple sentences, with both short length sentences and long length ones. As discussed in this thread, doc2vec vs word2vec choice is a matter of document length. As for Tf-Idf vs. word embedding, it's more a matter of text representation.
My other question is, what if for the same corpus I had more than one label to link to the sentences in it ? If I create multiple entries/labels for the same sentence, it affects the decision of the final classification algorithm. How can I tell the model that every label counts equal for every sentence of the document ?
Thank you in advance,
You should try multiple methods of turning your sentences into 'feature vectors'. There are no hard-and-fast rules; what works best for your project will depend a lot on your specific data, problem-domains, & classification goals.
(Don't extrapolate guidelines from other answers – such as the one you've linked that's about document-similarity rather than classification – as best practices for your project.)
To get initially underway, you may want to focus on some simple 'binary classification' aspect of your data, first. For example, pick a single label. Train on all the texts, merely trying to predict if that one label applies or not.
When you have that working, so you have a understanding of each step – corpus prep, text processing, feature-vectorization, classification-training, classification-evaluation – then you can try extending/adapting those steps to either single-label classification (where each text should have exactly one unique label) or multi-label classification (where each text might have any number of combined labels).

word2vec guesing word embeddings

can word2vec be used for guessing words with just context?
having trained the model with a large data set e.g. Google news how can I use word2vec to predict a similar word with only context e.g. with input ", who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri." The output should be Kasparov or maybe Carlsen.
I'ven seen only the similarity apis but I can't make sense how to use them for this? is this not how word2vec was intented to use?
It is not the intended use of word2vec. The word2vec algorithm internally tries to predict exact words, using surrounding words, as a roundabout way to learn useful vectors for those surrounding words.
But even so, it's not forming exact predictions during training. It's just looking at a single narrow training example – context words and target word – and performing a very simple comparison and internal nudge to make its conformance to that one example slightly better. Over time, that self-adjusts towards useful vectors – even if the predictions remain of wildly-varying quality.
Most word2vec libraries don't offer a direct interface for showing ranked predictions, given context words. The Python gensim library, for the last few versions (as of current version 2.2.0 in July 2017), has offered a predict_output_word() method that roughly shows what the model would predict, given context-words, for some training modes. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.predict_output_word
However, considering your fill-in-the-blank query (also called a 'cloze deletion' in related education or machine-learning contexts):
_____, who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri
A vanilla word2vec model is unlikely to get that right. It has little sense of the relative importance of words (except when some words are more narrowly predictive of others). It has no sense of grammar/ordering, or or of the compositional-meaning of connected-phrases (like 'dominated chess' as opposed to the separate words 'dominated' and 'chess'). Even though words describing the same sorts of things are usually near each other, it doesn't know categories to be able to determine that the blank must be a 'person' and a 'chess player', and the fuzzy-similarities of word2vec don't guarantee words-of-a-class will necessarily all be nearer-each-other than other words.
There has been a bunch of work to train word/concept vectors (aka 'dense embeddings') to be better at helping at such question-answering tasks. A random example might be "Creating Causal Embeddings for Question Answering with Minimal Supervision" but queries like [word2vec question answering] or [embeddings for question answering] will find lots more. I don't know of easy out-of-the-box libraries for doing this, with or without a core of word2vec, though.

How to get vector for a sentence from the word2vec of tokens in sentence

I have generated the vectors for a list of tokens from a large document using word2vec. Given a sentence, is it possible to get the vector of the sentence from the vector of the tokens in the sentence.
There are differet methods to get the sentence vectors :
Doc2Vec : you can train your dataset using Doc2Vec and then use the sentence vectors.
Average of Word2Vec vectors : You can just take the average of all the word vectors in a sentence. This average vector will represent your sentence vector.
Average of Word2Vec vectors with TF-IDF : this is one of the best approach which I will recommend. Just take the word vectors and multiply it with their TF-IDF scores. Just take the average and it will represent your sentence vector.
There are several ways to get a vector for a sentence. Each approach has advantages and shortcomings. Choosing one depends on the task you want to perform with your vectors.
First, you can simply average the vectors from word2vec. According to Le and Mikolov, this approach performs poorly for sentiment analysis tasks, because it "loses the word order in the same way as the standard bag-of-words models do" and "fail[s] to recognize many sophisticated linguistic phenomena, for instance sarcasm". On the other hand, according to Kenter et al. 2016, "simply averaging word embeddings of all words in a text has proven to be a strong baseline or feature across a multitude of tasks", such as short text similarity tasks. A variant would be to weight word vectors with their TF-IDF to decrease the influence of the most common words.
A more sophisticated approach developed by Socher et al. is to combine word vectors in an order given by a parse tree of a sentence, using matrix-vector operations. This method works for sentences sentiment analysis, because it depends on parsing.
It is possible, but not from word2vec. The composition of word vectors in order to obtain higher-level representations for sentences (and further for paragraphs and documents) is a really active research topic. There is not one best solution to do this, it really depends on to what task you want to apply these vectors. You can try concatenation, simple summation, pointwise multiplication, convolution etc. There are several publications on this that you can learn from, but ultimately you just need to experiment and see what fits you best.
It depends on the usage:
1) If you only want to get sentence vector for some known data. Check out paragraph vector in these papers:
Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. Eprint Arxiv,4:1188–1196.
A. M. Dai, C. Olah, and Q. V. Le. 2015. DocumentEmbedding with Paragraph Vectors. ArXiv e-prints,July.
2) If you want a model to estimate sentence vector for unknown(test) sentences with unsupervised approach:
You could check out this paper:
Steven Du and Xi Zhang. 2016. Aicyber at SemEval-2016 Task 4: i-vector based sentence representation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016), San Diego, US
3)Researcher are also looking for the output of certain layer in RNN or LSTM network, recent example is:
http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12195
4)For the gensim doc2vec, many researchers could not get good results, to overcome this problem, following paper using doc2vec based on pre-trained word vectors.
Jey Han Lau and Timothy Baldwin (2016). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.
5) tweet2vec or sent2vec
.
Facebook has SentEval project for evaluating the quality of sentence vectors.
https://github.com/facebookresearch/SentEval
6) There are more information in the following paper:
Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering
And for now you can use 'BERT':
Google release the source code as well as pretrained models.
https://github.com/google-research/bert
And here is an example to run bert as a service:
https://github.com/hanxiao/bert-as-service
You can get vector representations of sentences during training phase (join the test and train sentences in a single file and run word2vec code obtained from following link).
Code for sentence2vec has been shared by Tomas Mikolov here.
It assumes first word of a line to be sentence-id.
Compile the code using
gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops
and run it using
./word2vec -train alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
EDIT
Gensim (development version) seems to have a method to infer vectors of new sentences. Check out model.infer_vector(NewDocument) method in https://github.com/gojomo/gensim/blob/develop/gensim/models/doc2vec.py
I've had good results from:
Summing the word vectors (with tf-idf weighting). This ignores word order, but for many applications is sufficient (especially for short documents)
Fastsent
Google's Universal Sentence Encoder embeddings are an updated solution to this problem. It doesn't use Word2vec but results in a competing solution.
Here is a walk-through with TFHub and Keras.
Deep averaging network (DAN) can provide sentence embeddings in which word bi-grams are averaged and passed through feedforward deep neural network(DNN).
It is found that transfer learning using sentence embeddings tends to outperform word level transfer as it preserves the semantic relationship.
You don't need to start the training from scratch, the pretrained DAN models are available for perusal ( Check Universal Sentence Encoder module in google hub).
let suppose this is current sentence
import gensim
from gensim.models import Word2Vec
from gensim import models
model = gensim.models.KeyedVectors.load_word2vec_format('path of your trainig
dataset', binary=True)
strr = 'i am'
strr2 = strr.split()
print(strr2)
model[strr2] //this the the sentance embeddings.

How to train p(category|title) model with word2vec

Using word2vec, the goal is to maximize the corpus probability p(word|context), and context comes in the form of words.
Suppose given a corpus Titles and their category(such as sport, food...), how to use word2vec to train a model to predict p(category|title).
You could try to do your own naive compositionality by adding the words in the title together to get a vector that "describes" the whole sentence. Once you have that vector, you can train any classifier on it (SVM, logistic regression, k-nearest neighbors, etc).
This method may be simple enough to work, depending on how long these titles are. word2vec embeddings have been shown to exhibit some compositionality under simple vector addition for short phrases (in the word2vec paper, Mikolov et al show vec("Germany") + vec("capital") gets pretty close to vec("Berlin"). So maybe that'll be good enough for you.
Alternatively, if the titles are more like sentence, you could consider using the sentence-level extension of word2vec from Quoc Le & Tomas Mikolov's paper. Gensim has a pretty straightforward-to-use implementation of it called doc2vec.
http://rare-technologies.com/doc2vec-tutorial/
Just like the simpler vector addition idea, doc2vec will produce a fixed-length representation of your title, which you can then feed into standard ML libraries for classification.

Improving classification results with Weka J48 and Naive Bayes Multinomial classifiers

I have been using Weka’s J48 and Naive Bayes Multinomial (NBM) classifiers upon
frequencies of keywords in RSS feeds to classify the feeds into target
categories.
For example, one of my .arff files contains the following data extracts:
#attribute Keyword_1_nasa_Frequency numeric
#attribute Keyword_2_fish_Frequency numeric
#attribute Keyword_3_kill_Frequency numeric
#attribute Keyword_4_show_Frequency numeric
…
#attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}
#data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,12,0,0,0,0,0,20,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
…
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,FCL
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,F
…
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,7
8,0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,SNT
0,0,0,0,0,0,11,0,0,0,0,0,0,0,19,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,17,0,0,0,0,0,0,0,0,0,0,0,0,0,20,0,S
And so on: there’s a total of 570 rows where each one is contains with the
frequency of a keyword in a feed for a day. In this case, there are 57 feeds for
10 days giving a total of 570 records to be classified. Each keyword is prefixed
with a surrogate number and postfixed with ‘Frequency’.
I am using 10 fold x validation for both the J48s and NBM classifiers on a
'black box' basis. Other parameters used are also defaults, i.e. 0.25 confidence
and min number of objects is 2 for the J48s.
So far, my classification rates for an instance of varying numbers of days, date
ranges and actual keyword frequencies with both J28 and NBM results being
consistent in the 50 - 60% range. But, I would like to improve this if possible.
I have reduced the decision tree confidence level, sometimes as low as 0.1 but
the improvements are very marginal.
Can anyone suggest any other way of improving my results?
To give more information, the basic process here involves a diverse collection of RSS feeds where each one belongs to a single category.
For a given date range, e.g. 01 - 10 Sep 2011, the text of each feed's item elements are combined. The text is then validated to remove words with numbers, accents and so on, and stop words (a list of 500 stop words from MySQL is used). The remaining text is then indexed in Lucene to work out the most popular 64 words.
Each of these 64 words is then searched for in the description elements of the feeds for each day within the given date range. As part of this, the description text is also validated in the same way as the title text and again indexed by Lucene. So a popular keyword from the title such as 'declines' is stemmed to 'declin': then if any similar words are found in the description elements which also stem to 'declin', such as 'declined', the frequency for 'declin' is taken from Lucene's indexing of the word from the description elements.
The frequencies shown in the .arff file match on this basis, i.e. on the first line above, 'nasa', 'fish', 'kill' are not found in the description items of a particular feed in the BFE category for that day, but 'show' is found 34 times. Each line represents occurrences in the description items of a feed for a day for all 64 keywords.
So I think that the low frequencies are not due to stemming. Rather I see it as the inevitable result of some keywords being popular in feeds of one category, but which don't appear in other feeds at all. Hence the spareness shown in the results. Generic keywords may also be pertinent here as well.
The other possibilities are differences in the numbers of feeds per category where more feeds are in categories like NCA than S, or the keyword selection process itself is at fault.
You don't mention anything about stemming. In my opinion you could have better results if you were performing word stemming and the WEKA evaluation was based on the keyword stems.
For example let's suppose that your WEKA model is built given a keyword surfing and a new rss feed contains the word surf. There should be a match between these two words.
There are many free available stemmers for several languages.
For the English language some available options for stemming are:
The Porter's stemmer
Stemming based on the WordNet's dictionary
In case you would like to perform stemming using the WordNet's dictionary, there are libraries & frameworks that perform integration with WordNet.
Below you can find some of them:
MIT Java WordNet interface (JWI)
Rita
Java WorNet Library (JWNL)
EDITED after more information was provided
I believe that the keypoint in the specified case is the selection of the "most popular 64 words". The selected words or phrases should be keywords or keyphrases. So the challenge here is the keywords or keyphrases extraction.
There are several books, papers and algorithms written about keywords/keyphrases extraction. The university of Waikato has implemented in JAVA, a famous algorithm called Keyword Extraction Algorithm (KEA). KEA extracts keyphrases from text documents and can be either used for free indexing or for indexing with a controlled vocabulary. The implementation is distributed under the GNU General Public License.
Another issue that should be taken into consideration is the (Part of Speech)POS tagging. Nouns contain more information than the other POS tags. Therefore may you would have better results if you were checking the POS tag and the selected 64 words were mostly nouns.
In addition according to the Anette Hulth's published paper Improved Automatic Keyword Extraction Given More Linguistic Knowledge, her experiments showed that the keywords/keyphrases mostly have or are contained in one of the following five patterns:
ADJECTIVE NOUN (singular or mass)
NOUN NOUN (both sing. or mass)
ADJECTIVE NOUN (plural)
NOUN (sing. or mass) NOUN (pl.)
NOUN (sing. or mass)
In conclusion a simple action that in my opinion could improve your results is to find the POS tag for each word and select mostly nouns in order to evaluate the new RSS feeds. You can use WordNet in order to find the POS tag for each word and as I mentioned above there are many libraries on the web that perform integration with the WordNet's dictionary. Of course stemming is also essential for the classification process and has to be maintained.
I hope this helps.
Try turning off stemming altogether. The Stanford Intro to IR authors provide a rough justification of why stemming hurts, and at the very least does not help, in text classification contexts.
I have tested stemming myself on a custom multinomial naive Bayes text classification tool (I get accuracies of 85%). I tried the 3 Lucene stemmers available from org.apache.lucene.analysis.en version 4.4.0, which are EnglishMinimalStemFilter, KStemFilter and PorterStemFilter, plus no stemming, and I did the tests on small and larger training document corpora. Stemming significantly degraded classification accuracy when the training corpus was small, and left accuracy unchanged for the larger corpus, which is consistent with the Intro to IR statements.
Some more things to try:
Why only 64 words? I would increase that number by a lot, but preferably you would not have a limit at all.
Try tf-idf (term frequency, inverse document frequency). What you're using now is just tf. If you multiply this by idf you can mitigate problems arising from common and uninformative words like "show". This is especially important given that you're using so few top words.
Increase the size of the training corpus.
Try shingling to bi-grams, tri-grams, etc, and combinations of different N-grams (you're now using just unigrams).
There's a bunch of other knobs you could turn, but I would start with these. You should be able to do a lot better than 60%. 80% to 90% or better is common.