When applying word2vec, should I standardize the cosine values within each year, when comparing them across years?

When applying word2vec, should I standardize the cosine values within each year, when comparing them across years? - word2vec

I'm a researcher, and I'm trying to apply NPL to understand the temporal changes of the meaning of some words.
So far I have obtained the trained embeddings (word2vec, sgn) of several years with identical parameters in the training.
For example, if I want to test the change of cosine similarity of word A and word B over 5 years, should I just compute them and plot the cosine values?
The reason I'm asking this is that I found the overall cosine values (mean of all possible pairs within that year) differ across the 5 years. **For example, 1990:0.21, 1991:0.19, 1992:0.31, 1993:0.22, 1994:0.31. Does it mean in some years, all words are more similar to each other than other years??
Base on my limited understanding, I think the vectors are odds in logistic functions, so they shouldn't be significantly affected by the size of the corpus? Is it necessary for me to standardize the cosine values (of all pairs within each year) so I can compare the relative ranking change across years? Or just trust the raw cosine values and compare them across years?

In general you should not think of cosine-similarities as an absolute measure that'd be comparable between models. That is, you should not think of "0.7" cosine-similarity as anything like "70%" similar, and choose some arbitrary "70%" threshold to be used across models.
Instead, it's only a measure within a single model's induced space - with its effective 'scale' affected by all the parameters & the training data.
One small exercise that may help illustrate this: with the exact same data, train a 100d model, then a 200d model. Then look at some word pairs, or words alongside their nearest-neighbors ranked by cosine-similarity.
With enough training/data, generally the same highly-related words will be nearest-neighbors of each other. But the effective ranges of cosine-similarity values will be very different. If you chose a specific threshold in one model as meaning, "close enough to feed some other analysis", the same threshold would not be sufficient in the other. Every model is its own world, induced by the training data & parameters, as well as some sources of explicit or implicit randomness during training. (Several parts of the word2vec algorithm use random sampling, but also any efficient multi-threaded training will encounter arbitray differences in training-order via host OS thread-scheduling vagaries.)
If your parameters are identical, & the corpora very-alike in every measurable internal proportion, these effects might be minimized, but never eliminated.
For example, even if people's intended word meanings were perfectly identical, one year's training data might include more discussion of 'war' or 'politics' or some medical-topic, than another. In that case, the iterative, interleaved tug-of-war in training updates will mean words from that overrepresented domain have far more push-pull influence on the final model word positions – essentially warping subregions of the final space for finer distinctions some places, and thus *coarser distinctions in the less-updated zones.
That is, you shouldn't expect any global-per-model scaling factor (as you've implied might apply) to correct for any model-to-model differences. The influences of different data & training runs are far more subtle, and might affect different 'neighborhoods' of words differently.
Instead, when comparing different models, a more stable grounds for comparison is relative rankings or relative-proportions of words with respect to their closeness-to-others. Did words move into, or out of, each others' top-N neighbors? Did A move more closely to B than C did to D? etc.
Even there, you might want to be careful about differences in the full vocabulary: if A & B were each others' closest neighbors year 1, but 5 other words squeeze between them in year 2, did any word's meaning really change? Or might it simply be because those other words weren't even suitably represented in year 1 to receive any position, or previously had somewhat 'noisier' positions nearby? (As words get rarer their positions from run to run will be more idiosyncratic, based on their few usage examples, and the influences of those other sources of run-to-run 'noise'.)
Limiting all such analyses to very-well-represented words will minimize misinterpreting noise-in-the-models as something meaningful. Re-running models more than once, either with same parameters or slightly-different ones, or slightly-different training data subsets, and seeing which comparisons hold up across such changes, may also help determine which observed changes are robust, versus methodological artifacts such as jitter from run-to-run, or other sampling effects.
A few previous answers on similar questions about comparing word-vectors across different source corpora may have other useful ideas or caveats for you:
how calculate distance between 2 node2vec model
Word embeddings for the same word from two different texts
How to compare cosine similarities across three pretrained models?

Related

Sentiment analysis feature extraction

I am new to NLP and feature extraction, i wish to create a machine learning model that can determine the sentiment of stock related social media posts. For feature extraction of my dataset I have opted to use Word2Vec. My question is:
Is it important to train my word2vec model on a corpus of stock related social media posts - the datasets that are available for this are not very large. Should I just use a much larger pretrained word vector ?

The only way to to tell what will work better for your goals, within your constraints of data/resources/time, is to try alternate approaches & compare the results on a repeatable quantititave evaluation.
Having training texts that are properly representative of your domain-of-interest can be quite important. You may need your representation of the word 'interest', for example, to represent that of stock/financial world, rather than the more general sense of the word.
But quantity of data is also quite important. With smaller datasets, none of your words may get great vectors, and words important to evaluating new posts may be missing or of very-poor quality. In some cases taking some pretrained set-of-vectors, with its larger vocabulary & sharper (but slightly-mismatched to domain) word-senses may be a net help.
Because these pull in different directions, there's no general answer. It will depend on your data, goals, limits, & skills. Only trying a range of alternative approaches, and comparing them, will tell you what should be done for your situation.
As this iterative, comparative experimental pattern repeats endlessly as your projects & knowledge grow – it's what the experts do! – it's also important to learn, & practice. There's no authority you can ask for any certain answer to many of these tradeoff questions.
Other observations on what you've said:
If you don't have a large dataset of posts, and well-labeled 'ground truth' for sentiment, your results may not be good. All these techniques benefit from larger training sets.
Sentiment analysis is often approached as a classification problem (assigning texts to bins of 'positive' or 'negative' sentiment, operhaps of multiple intensities) or a regression problem (assigning texts a value on numerical scale). There are many more-simple ways to create features for such processes that do not involve word2vec vectors – a somewhat more-advanced technique, which adds complexity. (In particular, word-vectors only give you features for individual words, not texts of many words, unless you add some other choices/steps.) If new to the sentiment-analysis domain, I would recommend against starting with word-vector features. Only consider adding them later, after you've achieved some initial baseline results without their extra complexity/choices. At that point, you'll also be able to tell if they're helping or not.

How does word2vec learn word relations?

Which part of the algorithm specifically makes the embeddings to have the king - boy + girl = queen ability? Did they just did this by accident?
Edit :
Take the CBOW as an example. I know about they use embeddings instead of one-hot vectors to encode the words and made the embeddings trainable instead of how we do when using one hot vectors that the data itself is not trainable. Then the output is a one-hot vector for target word. They just average all the surrounding word embeddings at some point then put some lego layers afterwards. So at the end they find the mentioned property by surprise, or is there a training procedure or network structure that gave the embeddings that property?

The algorithm simply works to train (optimize) a shallow neural-network model that's good at predicting words, from other nearby words.
That's the only internal training goal – subject to the neural network's constraints on how the words are represented (N floating-point dimensions), or combined with the model's internal weights to render an interpretable prediction (forward propagation rules).
There's no other 'coaching' about what words 'should' do in relation to each other. All words are still just opaque tokens to word2vec. It doesn't even consider their letters: the whole-token is just a lookup key for a whole-vector. (Though, the word2vec variant FastText varies that somewhat by also training vectors for subwords – & thus can vaguely simulate the same intuitions that people have for word-roots/suffixes/etc.)
The interesting 'neighborhoods' of nearby words, and relative orientations that align human-interpretable aspects to vague directions in the high-dimensional coordinate space, fall out of the prediction task. And those relative orientations are what gives rise to the surprising "analogical arithmetic" you're asking about.
Internally, there's a tiny internal training cycle applied over and over: "nudge this word-vector to be slightly better at predicting these neighboring words". Then, repeat with another word, and other neighbors. And again & again, millions of times, each time only looking at a tiny subset of the data.
But the updates that contradict each other cancel out, and those that represent reliable patterns in the source training texts reinforce each other.
From one perspective, it's essentially trying to "compress" some giant vocabulary – tens of thousands, to millions, of unique words – into a smaller N-dimensional representation - usually 100-400 dimensions when you have enough training data. The dimensional-values that become as-good-as-possible (but never necessary great) at predicting neighbors turn out to exhibit the other desirable positionings, too.

Understanding model.similarity in word2vec

Hello I am fairly new to word2vec, I wrote a small program to teach myself
import gensim
from gensim.models import Word2Vec
sentence=[['Yellow','Banana'],['Red','Apple'],['Green','Tea']]
model = gensim.models.Word2Vec(sentence, min_count=1,size=300,workers=4)
print(model.similarity('Yellow', 'Banana'))
The similarity came out to be:
-0.048776340629810115
My question is why not is the similarity between banana and yellow closer to 1 like .70 or something. What am I missing? Kindly guide me.

Word2Vec doesn't work well on toy-sized examples – it's the subtle push-and-pull of many varied examples of the same words that moves word-vectors to useful relative positions.
But also, especially, in your tiny tiny example, you've given the model 300-dimensional vectors to work with, and only a 6-word vocabulary. With so many parameters, and so little to learn, it can essentially 'memorize' the training task, quickly becoming nearly-perfect in its internal prediction goal – and further, it can do that in many, many alternate ways, that may not involve much change from the word-vectors random initialization. So it is never forced to move the vectors to a useful position that provides generalized info about the words.
You can sometimes get somewhat meaningful results from small datasets by shrinking the vectors, and thus the model's free parameters, and giving the model more training iterations. So you could try size=2, iter=20. But you'd still want more examples than just a few, and more than a single occurrence of each word. (Even in larger datasets, the vectors for words with just a small number of examples tend to be poor - hence the default min_count=5, which should be increased even higher in larger datasets.)
To really see word2vec in action, aim for a training corpus of millions of words.

Does Mikolov 2014 Paragraph2Vec models assume sentence ordering?

In Mikolov 2014 paper regarding paragraph2Vectors, https://arxiv.org/pdf/1405.4053v2.pdf, do the authors assume in both PV-DM and PV-DBOW, the ordering of sentences need to make sense?
Imagine I am handling a stream of tweets, and each tweet is a paragraph. The paragraphs/tweets do not necessarily have ordering relations. After training, does the vector embedding for paragraphs still make sense?

Each document/paragraph is treated as a single unit for training – and there’s no explicit way that the neighboring documents directly affect a document’s vector. So the ordering of documents doesn’t have to be natural.
In fact, you generally don’t want all similar text-examples to be clumped together – for example, all those on a certain topic, or using a certain vocabulary, in the front or back of all training examples. That’d mean those examples are all trained with a similar alpha learning rate, and affect all related words without interleaved offsetting examples with other words. Either of those could make a model slightly less balanced/general, across all possible documents. For this reason, it can be good to perform at least one initial shuffle of the text examples before training a gensim Doc2Vec (or Word2Vec) model, if your natural ordering might not spread all topics/vocabulary words evenly through the training corpus.
The PV-DM modes (default dm=1 mode in gensim) do involve sliding context-windows of nearby words, so word proximity within each example matters. (Don’t shuffle the words inside each text!)

Train doc2vec for company name similarity

I am trying to deduplicate a huge list of companies (40M+) using the name similarities. I have a 500K of company name pairs labelled same/not-same (like I.B.M.=International Business Machines). Model built by logistic regression on vector difference of name pairs has a great f-score (0.98) but the inference (finding the most similar names) is too slow (almost 2 secs per name).
Is it possible to train doc2vec model using name similarity pairs (positive and negative), resulting in similar names has similar vectors so that I can use fast vector similarities algorithms like Annoy?

Searching for the top-N nearest-neighbors in high-dimensional spaces is hard. To get a perfectly accurate top-N typically requires an exhaustive search, which is probably the reason for your disappointing performance.
When some indexing can be applied, as with the ANNOY library, some extra indexing time and index-storage is required, and accuracy is sacrificed because some of the true top-N neighbors can be missed.
You haven't mentioned how your existing vectors are created. You don't need to adopt a new vector-creation method (like doc2vec) to use indexing; you can apply indexing libraries to your existing vectors.
If your existing vectors are sparse (as for example if they are big bag-of-character-n-grams representations, with many dimensions but most 0.0), you might want to look into Facebook's PySparNN library.
If they're dense, in addition to the ANNOY you mentioned, Facebook FAISS can be considered.
But also, even the exhaustive search-for-neighbors is highly parallelizable: split the data into M shards on M different systems, and finding the top-N on each is often close to 1/Nth the time of the same operation on the full index, then merging the M top-N lists relatively quick. So if finding the most-similar is your key bottleneck, and you need the top-N most-similar in say 100ms, throw 20 machines at 20 shards of the problem.
(Similarly, the top-N results for all may be worth batch-calculating. If you're using cloud resources, rent 500 machines to do 40 million 2-second operations, and you'll be done in under two days.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js