Gensim: Word2Vec Recommender accuracy Improvement - word2vec

I am trying to implement something similar in https://arxiv.org/pdf/1603.04259.pdf using awesome gensim library however I am having trouble improving quality of results when I compare to Collaborative Filtering.
I have two models one built on Apache Spark and other one using gensim Word2Vec on grouplens 20 million ratings dataset. My apache spark model is hosted on AWS http://sparkmovierecommender.us-east-1.elasticbeanstalk.com
and I am running gensim model on my local. However when I compare the results I see superior results with CF model 9 out of 10 times(like below example more similar to searched movie - affinity towards Marvel movies)
e.g.:- If I search for Thor movie I get below results
Gensim
Captain America: The First Avenger (2011)
X-Men: First Class (2011)
Rise of the Planet of the Apes (2011)
Iron Man 2 (2010)
X-Men Origins: Wolverine (2009)
Green Lantern (2011)
Super 8 (2011)
Tron:Legacy (2010)
Transformers: Dark of the Moon (2011)
CF
Captain America: The First Avenger
Iron Man 2
Thor: The Dark World
Iron Man
The Avengers
X-Men: First Class
Iron Man 3
Star Trek
Captain America: The Winter Soldier
Below is my model configuration, so far I have tried playing with window, min_count and size parameter but not much improvement.
word2vec_model = gensim.models.Word2Vec(
seed=1,
size=100,
min_count=50,
window=30)
word2vec_model.train(movie_list, total_examples=len(movie_list), epochs=10)
Any help in this regard is appreciated.

You don't mention what Collaborative Filtering algorithm you're trying, but maybe it's just better than Word2Vec for this purpose. (Word2Vec is not doing awful; why do you expect it to be better?)
Alternate meta-parameters might do better.
For example, the window is the max-distance between tokens that might affect each other, but the effective windows used in each target-token training randomly chosen from 1 to window, as a way to give nearby tokens more weight. Thus when some training-texts are much larger than the window (as in your example row), some of the correlations will be ignored (or underweighted). Unless ordering is very significant, a giant window (MAX_INT?) might do better, or even a related method where ordering is irrelevant (such as Doc2Vec in pure PV-DBOW dm=0 mode, with every token used as a doc-tag).
Depending on how much data you have, your size might be too large or small. Different min_count, negative count, greater 'iter'/'epochs', or sample level might work much better. (And perhaps even things you've already tinkered with would only help after other changes are in place.)

Related

Choose the appropriate way to deal with weights in svyset in Stata

I decided to post here a kind information for support I put in Statalist yesterday. I have not yet received a possible hint and thought it could be useful to extend the audience by posting it here.
The link to the original post is the following:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1659627-choose-the-appropriate-way-to-deal-with-weights-in-svyset?view=thread
Dear Members,
I defined a questionnaire to gather respondents' willingness to get vaccinated against COVID-19 via a discrete choice experiment. I relied on a company specialized in political opinion polls and market research to administer the survey. The company computed a weight for each respondent based on 1) the geographical location where the respondent lives (five macroareas of Italy), 2) whether the respondent has a bachelor degree or not, and 3) to which age group she/he pertains (five classes are considered).
The sum of the weights is equal to the number of individuals in the database. The individuals pertaining to the age classes 30-39 and 40-49 are oversampled, as per our request (related to a research hypothesis). The proportion of such two classes within the sample is larger than the actual in the Italian population. Weights are computed in order to take into account for this feature and guarantee that the sample is representative of the characteristics of the Italian population.
I will use the data to estimate a logit model, multinomial logit models and mixed logit models.
The issue I am facing with is the proper path to follow to declare the nature of the weight. I have no experience in the use of Stata to deal with this issue.
I am using Stata 17 on a PC with Windows 10 Pro 64 bit.
Combining the information from the video, the svysvyset manual and the results from the help for "weight" I tried to think what is the most appropriate solution.
I tried to add here the code multiple times as well but I kept receiving an error message on how I formatted it. My apologies

word2vec guesing word embeddings

can word2vec be used for guessing words with just context?
having trained the model with a large data set e.g. Google news how can I use word2vec to predict a similar word with only context e.g. with input ", who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri." The output should be Kasparov or maybe Carlsen.
I'ven seen only the similarity apis but I can't make sense how to use them for this? is this not how word2vec was intented to use?
It is not the intended use of word2vec. The word2vec algorithm internally tries to predict exact words, using surrounding words, as a roundabout way to learn useful vectors for those surrounding words.
But even so, it's not forming exact predictions during training. It's just looking at a single narrow training example – context words and target word – and performing a very simple comparison and internal nudge to make its conformance to that one example slightly better. Over time, that self-adjusts towards useful vectors – even if the predictions remain of wildly-varying quality.
Most word2vec libraries don't offer a direct interface for showing ranked predictions, given context words. The Python gensim library, for the last few versions (as of current version 2.2.0 in July 2017), has offered a predict_output_word() method that roughly shows what the model would predict, given context-words, for some training modes. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.predict_output_word
However, considering your fill-in-the-blank query (also called a 'cloze deletion' in related education or machine-learning contexts):
_____, who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri
A vanilla word2vec model is unlikely to get that right. It has little sense of the relative importance of words (except when some words are more narrowly predictive of others). It has no sense of grammar/ordering, or or of the compositional-meaning of connected-phrases (like 'dominated chess' as opposed to the separate words 'dominated' and 'chess'). Even though words describing the same sorts of things are usually near each other, it doesn't know categories to be able to determine that the blank must be a 'person' and a 'chess player', and the fuzzy-similarities of word2vec don't guarantee words-of-a-class will necessarily all be nearer-each-other than other words.
There has been a bunch of work to train word/concept vectors (aka 'dense embeddings') to be better at helping at such question-answering tasks. A random example might be "Creating Causal Embeddings for Question Answering with Minimal Supervision" but queries like [word2vec question answering] or [embeddings for question answering] will find lots more. I don't know of easy out-of-the-box libraries for doing this, with or without a core of word2vec, though.

Re-Training spaCy's NER v1.8.2 - Training Volume and Mix of Entity Types

I'm in the process of (re-) training spaCy's Named Entity Recognizer and have a couple of doubts that I hope a more experienced researcher/practitioner can help me figure out:
If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?
If I introduce a new label, is it best if the number of the entities of that labeled are roughly the same (balanced) during training?
Regarding the mixing in 'examples of other entity types':
do I just add random known categories/labels to my training set eg: ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], )?
can I use the same text for various labels? e.g. ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(55,64, 'COMMODITY')], )?
on a similar note let's assume I want spaCyto also recognize a second COMMODITY could I then just use the same sentence and label a different region e.g. ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(69,80, 'COMMODITY')], )? Is that how it's supposed to be done?
what ratio between new and other (old) labels is considered reasonable
Thanks
PS I'm working with Python2.7 in Ubuntu 16.04 using spaCy 1.8.2
For a full answer by Matthew Honnibal check out issue 1054 on spaCy's github page. Below are the most important points as they relate to my questions:
Question(Q) 1: If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?
Answer(A): Every machine learning problem will have a different examples/accuracy curve. You can get an idea for this by training with less data than you have, and seeing what the curve looks like. If you have 1,000 examples, then try training with 500, 750, etc, and see how that affects your accuracy.
Q 2: If I introduce a new label, is it best if the number of the entities of that label are roughly the same (balanced) during training?
A: There's trade-off between making the gradients too sparse, and making the learning problem too unrepresentative of what the actual examples will look like.
Q 3: Regarding the mixing in 'examples of other entity types':
do I just add random known categories/labels to my training set:
A: No, one should annotate all the entities in that text, so the example above: ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], ) should be ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG'), (55,64, 'COMMODITY'), (69,80, 'COMMODITY')], )
can I use the same text for various labels?:
A: Not in the way the examples were given. See previous answer.
what ratio between new and other (old) labels is considered reasonable?:
A: See answer Q 2.
PS: Double citations are direct quotes from the github issue answer.

Distinguishing between terms of different domains

What I am trying to do:
I am trying to take a list of terms and distinguish which domain they are coming from. For example "intestine" would be from the anatomical domain while the term "cancer" would be from the disease domain. I am getting these terms from different ontologies such as DOID and FMA (they can be found at bioportal.bioontology.org)
The problem:
I am having a hard time realizing the best way to implement this. Currently I am naively taking the terms from the ontologies DOID and FMA and taking difference of any term that is in the FMA list which we know is anatomical from the DOID list (which contains terms that may be anatomical such as colon carcinoma, colon being anatomical and carcinoma being disease).
Thoughts:
I was thinking that I can get root words, prefixes, and postfixes, for the different term domains and try and match it to the terms in the list. Another idea is to take more information from their ontology such as meta data or something and use this to distinguish between the terms.
Any ideas are welcome.
As a first run, you'll probably have the best luck with bigrams. As an initial hypothesis, diseases are usually noun phrases, and usually have a very English-specific structure where NP -> N N, like "liver cancer", which means roughly the same thing as "cancer of the liver." Doctors tend not to use the latter, while the former should be caught with bigrams quite well.
Use the two ontologies you have there as starting points to train some kind of bigram model. Like Rcynic suggested, you can count them up and derive probabilities. A Naive Bayes classifier would work nicely here. The features are the bigrams; classes are anatomy or disease. sklearn has Naive Bayes built in. The "naive" part means, in this case, that all your bigrams are independent of each other. This assumption is fundamentally false, but it works well in a lot of circumstances, so we pretend it's true.
This won't work perfectly. As it's your first pass, you should be prepared to probe the output to understand how it derived the answer it came upon and find cases that failed on. When you find trends of errors, tweak your model, and try again.
I wouldn't recommend WordNet here. It wasn't written by doctors, and since what you're doing relies on precise medical terminology, it's probably going to add bizarre meanings. Consider, from nltk.corpus.wordnet:
>>> livers = reader.synsets("liver")
>>> pprint([l.definition() for l in livers])
[u'large and complicated reddish-brown glandular organ located in the upper right portion of the abdominal cavity; secretes bile and functions in metabolism of protein and carbohydrate and fat; synthesizes substances involved in the clotting of the blood; synthesizes vitamin A; detoxifies poisonous substances and breaks down worn-out erythrocytes',
u'liver of an animal used as meat',
u'a person who has a special life style',
u'someone who lives in a place',
u'having a reddish-brown color']
Only one of these is really of interest to you. As a null hypothesis, there's an 80% chance WordNet will add noise, not knowledge.
The naive approach - what precision and recall is it getting you? If you setup a test case now, then you can track your progress as you apply more sophisticated methods.
I don't know what initial set you are dealing with - but one thing to try is to get your hands on annotated documents(maybe use mechanical turk). The documents need to be tagged as the domains you're looking for - anatomical or disease.
then count and divide will tell you how likely a word you encounter is to belong to a domain. With that the next step and be to tweak some weights.
Another approach (going in a whole other direction) is using WordNet. I don't know if it will be useful for exactly your purposes, but its a massive ontology - so it might help.
Python has bindings to use Wordnet via nltk.
from nltk.corpus import wordnet as wn
wn.synsets('cancer')
gives output = [Synset('cancer.n.01'), Synset('cancer.n.02'), Synset('cancer.n.03'), Synset('cancer.n.04'), Synset('cancer.n.05')]
http://wordnetweb.princeton.edu/perl/webwn
Let us know how it works out.

Web service or mechanism to detect Person, Place or an Object

Is there a web service or a tool to detect if what a certain text is the name or a person, a place or an object (device).
eg:
Input: Bill Clinton Output: Person
Input: Blackberry Output: Device
Input: New york Output: Place
Accuracy can be low. I have looked at opencyc but I couldnt get it to work. Is there a way I can use WikiPedia for this?
For a start separating a person or a thing will be great.
I think wikipedia would be a very good source. Given the input, you could try and find an entry in wikipedia and scrape the resulting page (if it exists).
Persons and Places should have fairly distinct sets of data - birthdates, locations, etc in the article that you could use to tell them apart, and anything else is an object.
It's worth a shot anyway.
Looking at the output of Wolfram Alpha, it seems that you can possibly identify a person by searching Bill Clinton Birthday or just Bill Clinton, or you can identify a location by searching New York GPS coordinates or just New York, for even better results. Blackberry seems like a tough word for Alpha, because it keeps wanting to interpret it as a fruit. You might have luck searching Froogle to identify a device.
It seems like WA will give you a fairly decent accuracy, at least if you're using famous people/places.
How about using a search engine? Google would be good, and I think Yahoo! has tools for building your own search.
I googled:
Results 1 - 10 of about 27,100,000 for "bill clinton" person
Results 1 - 10 of about 6,050,000 for "bill clinton" place
Results 1 - 10 of about 601,000 for "bill clinton" device
He's a person!
Results 1 - 10 of about 391,000,000 for "new york" place.
Results 1 - 10 of about 280,000,000 for "new york" person.
Results 1 - 10 of about 84,100,000 for "new york" device.
It's a place!
Results 1 - 10 of about 11,000,000 for "blackberry" person
Results 1 - 10 of about 36,600,000 for "blackberry" place
Results 1 - 10 of about 28,000,000 for "blackberry" device
Unfortunately, blackberry is a place as well. :-/
Note that only in the case of 'blackberry' did "device" even get close. Maybe you need to weight the page hit values. What is your application? Do you have any idea which "devices" you'd have to classify? What is the possible range of inputs?
Maybe you want to combine the results you get from different sources.
I think the basic task you're trying to accomplish is more formally known as named entity recognition. This task is nontrivial, and by only inputting the name stripped of any context, you're making it even harder.
For example, we'd like to think examples such as "Bill Clinton" and "New York" are obviously unambiguous, but looking at their disambiguation pages in Wikipedia shows that there are several potential entities they may refer to. "New York" is both a state, city, and movie title. "Bill Clinton" is a bit less ambiguous if you're only looking at Wikipedia, but I'm sure you'll find dozens of Bill Clintons in any phonebook. It might also be the name of someone's sailboat or pet dog. What if someone inputs "Washington"? That could be both a U.S. President, state, district, city, lake, street, island, movie, one of several U.S. navy ships, bridge, as well as other things. Determining which is the "correct" usage you'd want the webservice to return could become very complicated.
As much as Cyc knows, I think you'll find it's still not as comprehensive as Wikipedia. However, the main downside to Wikipedia is that it's essentially unstructured. Personally, I find Cyc's API so convoluted and poorly documented, that parsing Wikipedia's natural language almost seems easier.
If I had to implement such a webservice from scratch, I'd start by downloading a snapshot of Wikipedia, and then writing a parser that would read through all the articles, and generate a named entity index based on article titles. You could manually "classify" a few dozen examples as person/place/object, and train a classifier (Bayesian,Maxent,SVM) to automatically classify other examples based on the word frequencies of their articles.