How does Yelp create the "Review Highlights" section? - web-services

Take the following link as an example: http://www.yelp.com/biz/chef-yu-new-york.
In the section called 'Review Highlights', there are 3 phrases (spicy diced chicken, happy hour, lunch specials) that are highlighted based on reviews submitted by users. Obviously, these are the phrases that appeared most often, or longest phrases that appeared often, or some other logic.
Their official explanation is this:
In their reviews, Yelpers mentioned the linked phrases below a lot.
And these aren't any old common phrases, they're also the ones that
our Yelp Robots have determined are unique and good, quick ways to
describe this business. Click any of the phrases to see all the
reviews that mention it.
My question is, what did they use to mine the text input to get these data points? Is it some algorithm based on Lempel Ziv, or some kind of map reduce? I was not a CS major, so probably am missing something foundational here. Would love some help, theories, etc.
Thanks!

I don't have any insight on the exact algorithm Yelp is using but this is a common problem in natural language processing. Essentially you want to extract the most relevant collocations (http://en.wikipedia.org/wiki/Collocation).
A simple way to do this is to extract a list of n-grams with the highest PMI (pointwise mutual information). This SO question explains how to do this using Python and the nltk library:
How to extract common / significant phrases from a series of text entries

Lempel-Ziv is a data compression algorithm, and map-reduce is a technique for data processing. The former is probably not involved, and the latter is generally useful but not relevant here.
Without knowing the details of Yelp's code, it's impossible to say for sure, but it seems likely that their "review highlights" are simply based on tabulating all phrases that appear in reviews for this business, then displaying ones which are more common in reviews for this business than for other businesses. Some amount of natural language processing is likely to be involved to ensure that it picks noun phrases.

Related

How to refine text data?

I built many spiders to get news articles from different websites and i have an api to convert the text to audio clips, but i need a framework or python tools to refine the articles' text such as:
removing anything related to the source. removing any dates formats.
removing urls. change acronyms such as CEO to chief excution officer
for example. removing special characters and typos.
making sure that the sentence is written correctly after all the edits.
use the previously edited articles as a reference for the new articles.
I am using python, nltk and re, but it's exhausting and each time i think i covered all the cases, i find new cases to add and i think i am stuck in an infinite loop.
Any suggestions?
First of all, expanding acronyms to their full form is non-trivial and should probably not be considered part of scraping but rather part of a second step of processing (cf. IBM's The Art of Tokenization).
Cleaning scraped data is tedious, unfortunately: There is no magical solution because everyone is interested in scaping something different than what you are — some might be interested only in URLs, for example. Nevertheless, have you not tried using BeautifulSoup? — it's a Python library which offers a very nice API for handling many common scraping-related tasks.

How to decide if a webpage is about a specific topic or not? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am trying to write a code which can take the source html of a web page then decide what kind of web page it is. I am intrested in deciding if the web page is about academic courses or not. A naive first approach that I have is to check if the text has words which can be related like (course, instructor, teach,...) and decide that it is about an academic course if it achieves enough hits.
Even though, I need some ideas how to achieve that more efficiently.
Any ideas would be appreciated.
Thanks in advance :)
Sorry for my English.
There are many approaches to classifying a text, but first: a web page should be converted to plain text either using a dump way of removing all the HTML tags and reading what's left, or using smarter ways of identifying the main parts of the page that would contain all the useful text, in the latter case you can use some HTML5 elements like <article>, read about the HTML5 structural elements here.
Then you can try any of the following methods, depending on really how far you are willing to go with your implementation:
Like you mentioned, a simple search for relative words, but that would give you a very low success rate.
Improve the solution above by passing the tokens of the texts to a lexical analyzer and focus on the nouns, nouns usually have the highest value - I will try to find the resource of this but I'm sure I read it somewhere while implementing a similar project -, this might improve the rate a little.
Improve more by looking at the origin of the word, you can use a Morphological Analyzer to do so, and this way you can tell that the word "papers" is the same as "paper". That can improve a little.
You can also use an ontology of words like Word Net, and you can then start looking whether the words in the document are descendants of one of the words you're looking for, or the other way around but going up means genaralizing which would affect the precision. e.g. you can tell that the word "kitten" is related to the word "cat" and so you can assume that since the document talks about "kittens" then it talks about "cats".
All the above depends on you setting a defined list of keywords that you would base your decision on. But life doesn't work that way usually, that's why we use machine learning. And the basic idea would be that you would get a set of documents and manually tag/categorize/classify them, and then feed those documents to your program as a training set and let your program learn on them, afterwards your program would be able to apply what it learned in tagging other untagged documents. If you decide to go with this option then you can check this SO question and this Quora question and the possibilities are endless.
And assuming you speak Arabic I would share a paper of the project I worked on here if you're interested, but it is in Arabic and deals with the challenges of classifying Arabic text.
I know nothing about web programming as a c language programmer, but I would make sure it checks for different domain name suffixes. .edu is one most universities use, .gov for government pages and so on, then no need to scan a page. But surly the way to achieve highest accuracy is o use these methods, but create a way for users to correct the app, this info can be hosted on a webserver and a page can be cross referenced against that data base. Its always great to use your customer as an improvement tool!
another way would be to see if you can cross reference it with search engines that categorise in their index. For example google collates academic abstracts in google scholar. You could see if the web age is present in that data base?
Hope this helped! If I have any other ideas you will be the first to know!
Run text thru sequence-finding algorithm.
Basics of algorithm: you take some amount of definitely academic course related web-pages, clean them and search them for frequently met word sequences (2-5 words). Then by hand remove common word sequences, that are not related to academic course directly. By examining how much of that sequences are met in some web-page you can with some precission find out, if it's contents is well-related to source of test word sequences.
Note: Testet web pages must be properly cleaned up. Clean page contents from anything unrelated - delete link, script tags&contents, remove tags itself (but leave text in image's alt/title attributes) and so on. Context to examine should be title, meta keywords & description + cleaned contents of page. Next step is to stem text.

Classification using text mining - by values versus keywords

I have a classification problem that is highly correlated to economics by city. I have unstructured data in free text such as population, median income, employment, etc. Is it possible to use text mining to understand the values in the text and make a classification. Most text mining articles if have read use keyword or phrase count to make classification. I would like to be able to make classifications by the meaning of the text versus the frequency of the text. Is this possible?
BTW, I currently use RapidMiner and R. Not sure if this would work with either of these?
Thanks in advance,
John
Yes, this probably is possible.
But no, I cannot give you a simple solution, you will have to collect a lot of experience and experiment yourself. There is no push-button magic solution that works for everybody.
As your question is overly broad, I don't think there will be a better answer than "Yes, this might be possible", sorry.
You could think of these as two separate problems.
Extract information from unstructured data.
Classification
There are several approaches to mine specific features from the text. On the other hand you could also use directly use bag of words approach for classification directly and see the results. Depending on your problem, a classifier could potentially learn from just the text features.
You could also use PCA or something similar to find all the important features and then run mining process to extract those features.
All of this depends on your problem which is too broad and vague.

how to design full text search algorithm where keyword quantity is huge (like Google Alerts)?

I am building something very similar to Google Alerts. If you don't know what it is, consider the following scenario,
Thousands of new textual articles, blog posts influx everyday
Each user has a list of favorite "keywords" that he'd like to subscribe to
There are million users with million keywords
We scan every article/blog post looking for every keyword
Notify each user if a specific keyword matches.
For one keyword, doing a basic full text search against thousands of articles is easy, but how do make a full text search effectively with million keywords?
Since I don't have a strong CS backtround, the only idea I came of is compiling all keywords into regex, or automata, will this work well? (Like Google's re2)
I think I am missiong some thing important here. Like compiling those keywords into some advanced data structure. Since many keywords are alike (e.g. plural form, simple AND, NOT logic, etc). Are there any prior theory I need to know before head into this?
All suggestions are welcome, thanks in advance!
I can think of the following: (1) Make sure each search query is really fast. Millisecond performance is very important. (2) Group multiple queries with the same keywords and do a single query for each group.
Since different queries are using different keywords and AND/OR operations, I don't see other ways to group them.

Fuzzy queries to database

I'm curious about how works feature on many social sites today.
For example, you enter list of movies you like and system suggests other movies you may like (based on movies that like other people who likes the same movies that you). I think doing it straight-sql way (join list of my movies with movies-users join with user-movies group by movie title and apply count to it ) on large datasets would be just impossible to implement due to "heaviness" of such query.
At the same time we don't need exact solution, approximate would be enough. I wonder is there way to implement something like fuzzy query to traditional RDBMS that would be fast to execute but has some infelicity. Or how such features implemented on real systems.
that's collaborative filtering, or recommendation
unless you need something really complex the slope one predictor is one of the more simple ones it's like 50 lines of python, Bryan O’Sullivan’s Collaborative filtering made easy, the paper by Daniel Lemire et al. introducing "Slope One Predictors for Online Rating-Based Collaborative Filtering"
this one has a way of updating just one user at a time when they change without in some cases for others that need to reprocess the whole database just to update
i used that python code to do predict the word count of words not occurring in documents but i ran into memory issues and such and i think i might write an out of memory version maybe using sqlite
also the matrix used in that one is triangular the sides along the diagonal are mirrored so only one half of the matrix needs to be stored
The term you are looking for is "collaborative filtering"
Read Programming Collective Intelligence, by O'Reilly Press
The simplest methods use Bayesian networks. There are libraries that can take care of most of the math for you.