How to measure semantic relationship between two webpages - data-mining

Let's assume, I am visiting a University webpage. There are many teacher profile there. Though these pages are not syntactically related, these are semantically related. How can I measure this type of relationship? Actually on which parameter should I focus to find the relation?

This SO post answers how to compute semantic similarity between phrases.
In your case you just need to represent the different pages as documents and follow the same approach.
In your case you algo can exploit more information such as the links between pages or publications (in case of researchers). I hope the link helps a bit...

Here a simple but very good algorithm:
Surely the page for each teacher, and the linked pages, contain text that characterize semantically this professor.
Suppose you create a set of words composed by the concatenation of the text on the page of the professor and on the linked pages (you can continue to concatenate text continuing to follow the links up to an arbitrary depth).
Now, you can clustering professors on the basis of information extracted using a vector space model:
each professor is represented by a vector whose components are the words contained in the extracted pages and values ​​related term-frenquency.
The cosine-similarity will do the rest of the job.

Related

Google Natural Language Sentiment Analysis incorrect result

We have Google Natural AI integrated into our product for Sentiment Analysis (https://cloud.google.com/natural-language). One of the customers complained that when they write "BAD" then it shows a positive sentiment.
On further investigation, we found that when google Sentiment Analysis Natural Language API is called with input as BAD or Bad (pls see its in all caps or first letter caps ), it identifies text as an entity (a location or consumer good) & sends back the result as Positive while when we write "bad" in all small case, it sends negative.
Has anyone faced a similar problem? How did you solve it?
One obvious way looks like converting text into a small case but that may break some other use cases (maybe where entities do not get analyzed due to a small case text). Another way which we are building is to use our own dictionary of words with sentiments before calling google APIs but that doesn't answer the said problem, which may occur with any other text.
Inputs will help us. Thank you!
The NLP API uses an underlying model that is neural in nature. The knowledge comes from training on real world text. It is normal to get different results for different capitalizations as they can relate to different uses of the same trigram, e.g. Mike (person), mike (microphone, slang), MIKE (military alphabet entry).
The second key aspect is that the model is tuned and meant to be used on larger pieces of text and not on single words, hence good results can not be expected in this case.

How to get most similar words to a document in gensim doc2vec?

I have built a gensim Doc2vec model. Let's call it doc2vec. Now I want to find the most relevant words to a given document according to my doc2vec model.
For example, I have a document about "java" with the tag "doc_about_java". When I ask for similar documents, I get documents about other programming languages and topics related to java. So my document model works well.
Now I want to find the most relevant words to "doc_about_java".
I follow the solution from the closed question How to find most similar terms/words of a document in doc2vec? and it gives me seemingly random words, the word "java" is not even among the first 100 similar words:
docvec = doc2vec.docvecs['doc_about_java']
print doc2vec.most_similar(positive=[docvec], topn=100)
I also tried like this:
print doc2vec.wv.similar_by_vector(doc2vec["doc_about_java"])
but it didn't change anything. How can I find the most similar words to a given document?
Not all Doc2Vec modes even train word-vectors. In particular, the PV-DBOW mode dm=0, which often works very well for doc-vector comparisons, leaves word-vectors at randomly-assigned (and unused) positions.
So that may explain why the results of your initial attempt to get a list-of-related-words seem random.
To get word-vectors, you'd need to use PV-DM mode (dm=1), or add optional concurrent word-vector training to PV-DBOW (dm=0, dbow_words=1).
(If this isn't the issue, there maybe other problems in your training setup, so you should show more detail about your data source, size, and code.)
(Separately, your alternate attempt code-line, by using doc2vec["doc_about_java"] is retrieving a word-vector for "doc_about_java" (which may not be present at all). To get the doc-vector, use doc2vec.docvecs["doc_about_java"], as in your first code block.)

Compare two strings and find how closely they are related by meaning

Problem:
I have two strings, say, "Billie Jean" and "Thriller". I need to programmatically compare them and find how closely they are related. Those are both songs of the same artist, hence, they should give a higher score (probability, percentage etc) than say, "Brad Pitt" and "Jamaican Farewell".
One way of doing this is an open source Java tool named WikipediaMiner which compares using the Wikipedia data dump, checking links, descriptions etc.
Question:
Please suggest a better alternative, that uses any or all of Wikipepdia, DBpedia, Freebase and their cousins, or combines a different approach. I would really prefer open source software that can be downloaded and set up on a server (eg. Apache Mahout), rather than a paid web service.
It's not so much a matter of programming, but of data.
So it's not really a question for StackOverflow.
What you really want is to use WordNet I guess. That is really meant as a database for reasoning about the meaning of words. So for example, the data explicitely states that data mining is a form of data processing. And which is a physical entity...
You see, the reasoning will be only as good as your data is.
DBPedia may also include a mapping from WordNet to Wikipedia maybe?
You can't tell that "Thriller" is a song, not a music video or film genre or Lambchop album without additional context.
After you've identified what your items are, it's "simply" a matter of traversing the graph of connections in Freebase, MusicBrainz, or whatever other information sources you are using.
You'll need to decide how you're going to weight things for scoring though. Are two Michael Jackson songs more closely related because they share the same type or are they more closely related to the artist Michael Jackson because they're directly connect to him?

Latex, having two lists of something on the one page

I am writing my thesis right now and I have fallowing problem, I have couple of lists. List of figures, list of algorithms, list of listings... etc. Most of them are very short, but each of them takes whole page. So I have couple of lists which only list few things on one page, and the rest of the page is blank.
How can I put two lists on one page?
I have found the answer here:
http://www.cse.iitd.ernet.in/~anup/homepage/UNIX/latex.html#lotandlof
So I wrote something like that:
\begin{minipage}[b]{1\linewidth}
\listofalgorithms
\end{minipage}
\begin{minipage}[b]{1\linewidth}
\listoffigures
\end{minipage}
Right now if there is enough space two lists can be put on one page.
Just write the lists one after the other. If enough space is available, they will be put on the same page: lists don’t include page breaks.
You say this is for a thesis? Most thesis styles require each list to be on separate pages. I would first double check your thesis style guide. Further, many Universities provide thesis class files (some_thesis_name.cls) which automatically follow the style guide. Check it out, you may save a lot of time and worry.

Is there a way to build an easy related posts app in django

It seems to by my nightmare for the last 4 weeks,
I can't come up with a solution for a "related posts" app in django/python in which it takes the users input and comes out with a related post that matches closely with the original input. I've tried using like statements but it seems that they are not sensitive enough.
Such as which i need typos to also be taken into consideration.
is there a library that could save me from all my pain and suffering?
Well, I suppose there are a few different ways to normalize the user input to produce desirable results (although I'm not sure to what extent libraries exist for them). One of the easiest ways to get related posts would be to compare the tags present on that post (granted your posts have tags). If you wanted to go another route, I would take the following steps: remove stop words from the subject, use some kind of stemmer on the remainder, and finally treat the remaining words as "tags" to compare with other posts. For the sake of efficiency, it would probably be a good idea to run these steps in a batch process on all of your current posts and store off the resulting "tags." As far as typos, I'm sure there are a multitude of spelling corrector libraries exist (I found this one after a few seconds with Google).