doc2vec: any way to fetch closest matching terms for a given vector? - word2vec

The use-case I have is to have a collection of "upvoted" documents and "downvoted" documents and using those to re-order a set of results in a search.
I am using gensim doc2vec and am able to run the most_similar queries for word(s) and fetch matching words. But how would I be able to fetch the matching keywords given a vector fetched by a vector sum of the above doc vectors?

Ohh silly me, I found the answer staring right in my face, posting here in case anyone else has the issue:
similar_by_vector(vector, topn=10, restrict_vocab=None)
This is however found not in the Doc2Vec class, but in the KeyedVector class.

Related

Kibana search with regular expression not working

I am trying to find some logs in Kibana by using Regular Expressions. I am aware that Kibana doesn't support the "classical" RegEx, but rather Lucene Query Syntax. I have read through the documentation of it (https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-regexp-query.html#regexp-syntax) and imo my queries should work, but they don't.
Here is an example log entry that I want to target with my query:
Timings are: sync started at 2019-02-12 19:15:09.402; accounts
downloaded:+760ms/760ms; accounts data downloaded:+1221ms/1981ms;
categorization pushed:+0ms/1981ms; categorization
started:+131ms/2112ms; categorization completed:+123ms/2235ms; in
total:2235ms.
What I want to find in the end is all such log entries where the time of "categorization started" exceeds a certain threshold. However my queries fail already while just trying to approach the final query.
I get results when I query:
message:"/categorization started/"
But already when i modify it to:
message:/categorization started/
i get nothing. Any of the following attemps also give nothing:
message:/categorization\sstarted/
message:/.*categorization\sstarted.*/
message:/.*categorization.*started.*/
At this point I'm already lost - why do all these queries not match anything?
In my mind, the final query that should get what I want should be as follows (finding all entries where categorization started time was 10,000ms or more):
message:/.*categorization started:\+<10000-99999>ms.*/
It goes without saying that this of course also returns nothing, which doesn't surprise me when the above queries already fail.
Can anyone explain to me what I am doing wrong?
Thank you
I suggest you to use
message:*categorization started*

How I can get vector from output matrix in FastText ?

In this study author have found out that, Word2Vec generates the two kinds of embeddings(IN & OUT).
https://arxiv.org/abs/1602.01137
Well, you can easily get that using syn1 attribute in gensim word2vec. But in the case of gensim fastText, the syn1 do exists but as the concept of fastText is sub-word based, it's not possible to get a vector for word from output matrix by matching the indexes. Do you know any other way around to calculate vector using output matrix??
In FastText, the vector for a word is the combination of:
the full-word vector, if it exists; and
all the subword vectors
You can view the gensim method that returns a vector, composed from subwords if necessary, at:
https://github.com/RaRe-Technologies/gensim/blob/2ccc82bf50bcfbee44932c160db076a873cf893e/gensim/models/keyedvectors.py#L1970
(I think this method might have a bug, in comparison to the original FastText approach, in that this gensim method perhaps should also add the subword vectors to the whole-word-vector, even when a whole-word-vector is available.)

How to reduce semantically similar words?

I have a large corpus of words extracted from the documents. In the corpus are words which might mean the same.
For eg: "command" and "order" means the same, "apple" and "apply" which does not mean the same.
I would like to merge the similar words, say "command" and "order" to "command".
I have tried to use word2vec but it doesn't check for semantic similarity of words(it ouputs good similarity for apple and apply since four characters in the words are the same). And when I try using wup similarity, it gives good similarity score if the words have matching synonyms whose results are not that impressive.
What could be the best approach to reduce semantically similar words to get rid of redundant data and merge similar data?
I believe one of the options here is using WordNet. It gives you a list of synonyms for the word, so you may merge them together given you know its part of speech.
However, I'd like to point out that "order" and "command" are not the same, e.g. you do not command food in restaurants and such homonymy is true for many-many words.
Also I'd like to point out that for Word2vec spelling is irrelevant and is not taken into consideration at all, the algorithm considers only concurrent usage. I suppose you might be mixing it with FastText.
However, there should be some problems with your model.
Because in a standard set of embeddings distance between these concepts should be large. MUSE FastText similarity between "apple" and "apply" is only 0.15, which is quite low.
I use Gensim's function
model.similarity("apply", "apple")
So you might need to fix learning parameters or just use a pretrained model.

Clear approach for assigning semantic tags to each sentence (or short documents) in python

I am looking for a good approach using python libraries to tackle the following problem:
I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following:
"500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training.
Any help would be appreciated.
Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.

What exactly is Pairwise Matching and How it works?

I'm working on Multiple Image Stitching and I came around the term Pairwise Matching. I almost searched on every site but am unable to get CLEAR description on what it exactly is and how it works.
I'm working in Visual Studio 2012 with opencv. I have modified stitching_detailed.cpp according to my requirement and am very successful in maintaining the quality with significantly less time, except pairwise matching. I'm using ORB to find feature points. BestOf2NearestMatcher is used in stitching_detailed.cpp for pairwise matching.
What I know about Pairwise Matching and BestOf2NearestMatcher:
(Correct me if I'm wrong somewhere)
1) Pairwise Matching works similarly like other matchers such as Brute Force Matcher, Flann Based Matcher, etc.
2) Pairwise Matching works with multiple images unlike the above matchers. You have to go one by one if you want to use them for multiple images.
3) In Pairwise Matching, the features of one image are matched with every other image in the data set.
4) BestOf2NearestMatcher finds two best matches for each feature and leaves the best one only if the ratio between descriptor distances is greater than the threshold match_conf.
What I want to know:
1) I want to know more details about pairwise matching, if I'm missing some on it.
2) I want to know HOW pairwise matching works, the actual flow of it in detail.
3) I want to know HOW BestOf2NearestMatcher works, the actual flow of it in detail.
4) Where can I find code for BestOf2NearestMatcher? OR Where can I get similar code to BestOf2NearestMatcher?
5) Is there any alternative I can use for pairwise matching (or BestOf2NearestMatcher) which takes less time than the current one?
Why I want to know and what I'd do with it:
1) As I stated in the introduction part, I want to reduce the time pairwise matching takes. If I'm able to understand what actually pairwise matching is and how it works, I can create my own according to my requirement or I can modify the existing one.
Here's where I posted a question in which I want to reduce time for the entire program: here. I'm not asking the same question again, I'm asking about specifics here. There I wanted to know how can I reduce time in pairwise matching as well as other code sections and here I want to know what pairwise matching is and how it works.
Any help is much appreciated!
EDIT: I found the code of pairwise matching in matchers.cpp. I created my own function in the main code to optimize the time. Works good.