Train doc2vec for company name similarity - word2vec

I am trying to deduplicate a huge list of companies (40M+) using the name similarities. I have a 500K of company name pairs labelled same/not-same (like I.B.M.=International Business Machines). Model built by logistic regression on vector difference of name pairs has a great f-score (0.98) but the inference (finding the most similar names) is too slow (almost 2 secs per name).
Is it possible to train doc2vec model using name similarity pairs (positive and negative), resulting in similar names has similar vectors so that I can use fast vector similarities algorithms like Annoy?

Searching for the top-N nearest-neighbors in high-dimensional spaces is hard. To get a perfectly accurate top-N typically requires an exhaustive search, which is probably the reason for your disappointing performance.
When some indexing can be applied, as with the ANNOY library, some extra indexing time and index-storage is required, and accuracy is sacrificed because some of the true top-N neighbors can be missed.
You haven't mentioned how your existing vectors are created. You don't need to adopt a new vector-creation method (like doc2vec) to use indexing; you can apply indexing libraries to your existing vectors.
If your existing vectors are sparse (as for example if they are big bag-of-character-n-grams representations, with many dimensions but most 0.0), you might want to look into Facebook's PySparNN library.
If they're dense, in addition to the ANNOY you mentioned, Facebook FAISS can be considered.
But also, even the exhaustive search-for-neighbors is highly parallelizable: split the data into M shards on M different systems, and finding the top-N on each is often close to 1/Nth the time of the same operation on the full index, then merging the M top-N lists relatively quick. So if finding the most-similar is your key bottleneck, and you need the top-N most-similar in say 100ms, throw 20 machines at 20 shards of the problem.
(Similarly, the top-N results for all may be worth batch-calculating. If you're using cloud resources, rent 500 machines to do 40 million 2-second operations, and you'll be done in under two days.)

Related

How does word2vec learn word relations?

Which part of the algorithm specifically makes the embeddings to have the king - boy + girl = queen ability? Did they just did this by accident?
Edit :
Take the CBOW as an example. I know about they use embeddings instead of one-hot vectors to encode the words and made the embeddings trainable instead of how we do when using one hot vectors that the data itself is not trainable. Then the output is a one-hot vector for target word. They just average all the surrounding word embeddings at some point then put some lego layers afterwards. So at the end they find the mentioned property by surprise, or is there a training procedure or network structure that gave the embeddings that property?
The algorithm simply works to train (optimize) a shallow neural-network model that's good at predicting words, from other nearby words.
That's the only internal training goal – subject to the neural network's constraints on how the words are represented (N floating-point dimensions), or combined with the model's internal weights to render an interpretable prediction (forward propagation rules).
There's no other 'coaching' about what words 'should' do in relation to each other. All words are still just opaque tokens to word2vec. It doesn't even consider their letters: the whole-token is just a lookup key for a whole-vector. (Though, the word2vec variant FastText varies that somewhat by also training vectors for subwords – & thus can vaguely simulate the same intuitions that people have for word-roots/suffixes/etc.)
The interesting 'neighborhoods' of nearby words, and relative orientations that align human-interpretable aspects to vague directions in the high-dimensional coordinate space, fall out of the prediction task. And those relative orientations are what gives rise to the surprising "analogical arithmetic" you're asking about.
Internally, there's a tiny internal training cycle applied over and over: "nudge this word-vector to be slightly better at predicting these neighboring words". Then, repeat with another word, and other neighbors. And again & again, millions of times, each time only looking at a tiny subset of the data.
But the updates that contradict each other cancel out, and those that represent reliable patterns in the source training texts reinforce each other.
From one perspective, it's essentially trying to "compress" some giant vocabulary – tens of thousands, to millions, of unique words – into a smaller N-dimensional representation - usually 100-400 dimensions when you have enough training data. The dimensional-values that become as-good-as-possible (but never necessary great) at predicting neighbors turn out to exhibit the other desirable positionings, too.

When applying word2vec, should I standardize the cosine values within each year, when comparing them across years?

I'm a researcher, and I'm trying to apply NPL to understand the temporal changes of the meaning of some words.
So far I have obtained the trained embeddings (word2vec, sgn) of several years with identical parameters in the training.
For example, if I want to test the change of cosine similarity of word A and word B over 5 years, should I just compute them and plot the cosine values?
The reason I'm asking this is that I found the overall cosine values (mean of all possible pairs within that year) differ across the 5 years. **For example, 1990:0.21, 1991:0.19, 1992:0.31, 1993:0.22, 1994:0.31. Does it mean in some years, all words are more similar to each other than other years??
Base on my limited understanding, I think the vectors are odds in logistic functions, so they shouldn't be significantly affected by the size of the corpus? Is it necessary for me to standardize the cosine values (of all pairs within each year) so I can compare the relative ranking change across years? Or just trust the raw cosine values and compare them across years?
In general you should not think of cosine-similarities as an absolute measure that'd be comparable between models. That is, you should not think of "0.7" cosine-similarity as anything like "70%" similar, and choose some arbitrary "70%" threshold to be used across models.
Instead, it's only a measure within a single model's induced space - with its effective 'scale' affected by all the parameters & the training data.
One small exercise that may help illustrate this: with the exact same data, train a 100d model, then a 200d model. Then look at some word pairs, or words alongside their nearest-neighbors ranked by cosine-similarity.
With enough training/data, generally the same highly-related words will be nearest-neighbors of each other. But the effective ranges of cosine-similarity values will be very different. If you chose a specific threshold in one model as meaning, "close enough to feed some other analysis", the same threshold would not be sufficient in the other. Every model is its own world, induced by the training data & parameters, as well as some sources of explicit or implicit randomness during training. (Several parts of the word2vec algorithm use random sampling, but also any efficient multi-threaded training will encounter arbitray differences in training-order via host OS thread-scheduling vagaries.)
If your parameters are identical, & the corpora very-alike in every measurable internal proportion, these effects might be minimized, but never eliminated.
For example, even if people's intended word meanings were perfectly identical, one year's training data might include more discussion of 'war' or 'politics' or some medical-topic, than another. In that case, the iterative, interleaved tug-of-war in training updates will mean words from that overrepresented domain have far more push-pull influence on the final model word positions – essentially warping subregions of the final space for finer distinctions some places, and thus *coarser distinctions in the less-updated zones.
That is, you shouldn't expect any global-per-model scaling factor (as you've implied might apply) to correct for any model-to-model differences. The influences of different data & training runs are far more subtle, and might affect different 'neighborhoods' of words differently.
Instead, when comparing different models, a more stable grounds for comparison is relative rankings or relative-proportions of words with respect to their closeness-to-others. Did words move into, or out of, each others' top-N neighbors? Did A move more closely to B than C did to D? etc.
Even there, you might want to be careful about differences in the full vocabulary: if A & B were each others' closest neighbors year 1, but 5 other words squeeze between them in year 2, did any word's meaning really change? Or might it simply be because those other words weren't even suitably represented in year 1 to receive any position, or previously had somewhat 'noisier' positions nearby? (As words get rarer their positions from run to run will be more idiosyncratic, based on their few usage examples, and the influences of those other sources of run-to-run 'noise'.)
Limiting all such analyses to very-well-represented words will minimize misinterpreting noise-in-the-models as something meaningful. Re-running models more than once, either with same parameters or slightly-different ones, or slightly-different training data subsets, and seeing which comparisons hold up across such changes, may also help determine which observed changes are robust, versus methodological artifacts such as jitter from run-to-run, or other sampling effects.
A few previous answers on similar questions about comparing word-vectors across different source corpora may have other useful ideas or caveats for you:
how calculate distance between 2 node2vec model
Word embeddings for the same word from two different texts
How to compare cosine similarities across three pretrained models?

vocab size versus vector size in word2vec

I have a data with 6200 sentences(which are triplets of form "sign_or_symptoms diagnoses Pathologic_function"), however the unique words(vocabulary) in these sentence is 181, what would be the appropriate vector size to train a model on the sentences with such low vocabulary. Is there any resource or research on appropriate vector size depending on vocabulary size?
The best practice is to test it against your true end-task.
That's an incredibly small corpus and vocabulary-size for word2vec. It might not be appropriate at all, as it gets its power from large, varied training sets.
But on the bright side, you can run lots of trials with different parameters very quickly!
You absolutely can't use a vector dimensionality as large as your vocabulary (181), or even really very close. In such a case, the model is certain to 'overfit' – just memorizing the effects of each word in isolation, with none of the necessary trading-off 'tug-of-war', forcing words to be nearer/farther to each other, that creates the special value/generality of word2vec models.
My very loose rule-of-thumb would be to investigate dimensionalities around the square-root of the vocabulary size. And, multiples-of-4 tend to work best in the underlying array routines (at least when performance is critical, which it might not be with such a tiny data set). So I'd try 12 or 16 dimensions first, and then explore other lower/higher values based on some quantitative quality evaluation on your real task.
But again, you're working with a dataset so tiny, unless your 'sentences' are actually really long, word2vec may be a very weak technique for you without more data.

Real reason for speed up in fasttext

What is the real reason for speed-up, even though the pipeline mentioned in the fasttext paper uses techniques - negative sampling and heirerchichal softmax; in earlier word2vec papers. I am not able to clearly understand the actual difference, which is making this speed up happen ?
Is there that much of a speed-up?
I don't think there are any algorithmic breakthroughs which make the word2vec-equivalent word-vector training in FastText significantly faster. (And if you're using the character-ngrams option in FastText, to allow post-training synthesis of vectors for unseen words based on substrings shared with training-words, I'd expect the training to be slower, because every word requires training of its substring vectors as well.)
Any speedups in FastText are likely just because the code is well-tuned, with the benefit of more implementation experience.
To be efficient on datasets with a very large number of categories, Fast text uses a hierarchical classifier instead of a flat structure, in which the different categories are organized in a tree (think binary tree instead of list). This reduces the time complexities of training and testing text classifiers from linear to logarithmic with respect to the number of classes. FastText also exploits the fact that classes are imbalanced (some classes appearing more often than other) by using the Huffman algorithm to build the tree used to represent categories. The depth in the tree of very frequent categories is, therefore, smaller than for infrequent ones, leading to further computational efficiency.
Reference link: https://research.fb.com/blog/2016/08/fasttext/

Clarification needed about min/sim hashing + LSH

I have a reasonable understanding of a technique to detect similar documents
consisting in first computing their minhash signatures (from their shingles, or
n-grams), and then use an LSH-based algorithm to cluster them efficiently
(i.e. avoid the quadratic complexity which would entail a naive pairwise
exhaustive search).
What I'm trying to do is to bridge three different algorithms, which I think are
all related to this minhash + LSH framework, but in non-obvious ways:
(1) The algorithm sketched in Section 3.4.3 of Chapter 3 of the book Mining of Massive Datasets (Rajaraman and Ullman), which seems to be the canonical description of minhashing
(2) Ryan Moulton's Simple Simhashing article
(3) Charikar's so-called SimHash algorihm, described in this article
I find this confusing because what I assume is that although (2) uses the term
"simhashing", it's actually doing minhashing in a way similar to (1), but with
the crucial difference that a cluster can only be represented by a single
signature (even tough multiple hash functions might be involved), while two
documents have more chances of being similar with (1), because the signatures
can collide in multiple "bands". (3) seems like a different beast altogether, in
that the signatures are compared in terms of their Hamming distance, and the LSH
technique implies multiple sorting of the signatures, instead of banding them. I
also saw (somewhere else) that this last technique can incorporate a notion of
weighting, which can be used to put more emphasis on certain document parts, and
which seems to lack in (1) and (2).
So at last, my two questions:
(a) Is there a (satisfying) way in which to bridge those three algorithms?
(b) Is there a way to import this notion of weighting from (3) into (1)?
Article 2 is actually discussing minhash, but has erroneously called it simhash. That's probably why it is now deleted (it's archived here). Also, confusingly, it talks about "concatenating" multiple minhashes, which as you rightly observe reduces the chance of finding two similar documents. It seems to prescribe an approach that yields only a single "band", which will give very poor results.
Can the algorithms be bridged/combined?
Probably not. To see why, you should understand what the properties of the different hashes are, and how they are used to avoid n2 comparisons between documents.
Properties of minhash and simhash:
Essentially, minhash generates multiple hashes per document, and when there are two similar documents it is likely that a subset of these hashes will be identical. Simhash generates a single hash per document, and where there are two similar documents it is likely that their simhashes will be similar (having a small hamming distance).
How they solve the n2 problem:
With minhash you index all hashes to the documents that contain them; so if you are storing 100 hashes per document, then for each new incoming document you need to look up each of its 100 hashes in your index and find all documents that share at least (e.g.) 50% of them. This could mean building large temporary tallies, as hundreds of thousands of documents could share at least one hash.
With simhash there is a clever technique of storing each document's hash in multiple permutations in multiple sorted tables, such that any similar hash up to a certain hamming distance (3, 4, 5, possibly as high as 6 or 7 depending on hash size and table structure) is guaranteed to be found nearby in at least one of these tables, differing only in the low order bits. This makes searching for similar documents efficient, but restricts you to only finding very similar documents. See this simhash tutorial.
Because the use of minhash and simhash are so different, I don't see a way to bridge/combine them. You could theoretically have a minhash that generates 1-bit hashes and concatenate them into something like a simhash, but I don't see any benefit in this.
Can weighting be used in minhash?
Yes. Think of the minhashes as slots: if you store 100 minhashes per document, that's 100 slots you can fill. These slots don't all have to be filled from the same selection of document features. You could devote a certain number of slots to hashes calculated just from specific features (care should be taken, though, to always select features in such a way that the same features will be selected in similar documents).
So for example if you were dealing with journal articles, you could assign some slots for minhashes of document title, some for minhashes of article abstract, and the remainder of the slots for minhashes of the document body. You can even set aside some individual slots for direct hashes of things such as author's surname, without bothering about the minhash algorithm.
It's not quite as elegant as how simhash does it, where in effect you're just creating a bitwise weighted average of all the individual feature hashes, then rounding each bit up to 1 or down to 0. But it's quite workable.