I am interested in doing a project on document classification and have been looking for books that could be useful for the theoretical parts in text mining related to this or examples of articles describing the process of going from training data with documents classified (with subcategories) to a system which predicts the class of a document. There seem to be some (rather expensive!) titles available but these are conference proceedings with articles on smaller very specific topics. Can someone suggest books from the data mining literature that provides a good theoretical basis for a project on text mining, specifically document classification or articles with an overview of this process ?
Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze have a free information retrieval book. Try chapter 13 - Text classification & Naive Bayes.
See also the companion site for Manning and Schütze's nlp book, specifically links for the text categorization chapter.
Fabrizio Sebastiani wrote a useful tutorial about text categorization(PDF) and review paper of machine learning for text categorization (PDF).
Related
I'm working on a news app. On the home page, the user sees a list of headlines and then he can click one to read the article and comment.
I would like to offer an option for "recommended articles" based on his history. For example, if he read an article - I'll feed the algorithm with the headline keywords so it will learn what this user likes to read.
My problem with what I've read about bayesian filters is that you need to train them with good input and bad input (such as good emails and spam emails). The difference in my case is that there are no bad examples. If the user didn't read an article - it doesn't mean it's a bad classification (since he still might read it in the future), but only if he read one - it's more likely that he'll read similar articles in the future.
Basically, I'm looking for an algorithm to help me recommend articles to a specific user - based on what he read in the past. It will run on a mobile device, so any implementation (C/C++/Obj-C) will work.
Thanks.
You can treat this as a binary classification problem. It is either an article he likes to read or an article he possibly doesn't like to read.
You can use the dlib C++ library for the binary classifier algorithm.
hi i'am a newbie to data mining. My task is to automatically classify text documents using n-grams method.
I could not find proper resources on this topic, kindly help me how to proceed in this topic, where can i find tutorials based on n-gram classification.
i need java source code on this topic for my understanding.
thanks in advance.
I highly recommend Stanford's online NLP course by Dan Jurafsky & Chris Manning. Chapter 4 addresses n-grams, but all the chapters before it give a great background.
Stanford also has some great open source software you can use for text classification, from tokenizing to part of speech tagging.
i found better tutorial with documentation in
http://textcat.sourceforge.net/README.txt
http://textcat.sourceforge.net/doc/index.html
Are there to day any concept mining open source tools available? I have only be coming across like Leximancer, which although seem to fit the role is not open source and quite expensive for a undergraduate student. I have been unsuccessful so far since the word 'concept' on both google and google scholar seems to be un-matching what I want.
It seems to me you need a text mining tool for clustering. RapidMiner has an open-source, Java based Community Edition which has several extensions (Text Mining, R, etc.). In addition you can develop and integrate your own algorithms too.
Moreover Rexer Analytics offers a comprehensive data mining survey annually, you can call for reports for free.
I come from a multimedia background as opposed to a pure-CS background so I would find a heavy CS-paper about subjects like algorithms hard to review.
I'm interested mainly in web technologies, particularly areas like web standards, push technologies (comet, web hooks etc..), social graphing, online data portability. Other topic suggestions are welcome too.
The problem is that any papers that I can find on these topics couldn't really be called seminal because they are quite recent and consequently haven't been cited by very many other papers.
I'd love to hear suggestions about research topics or recommended papers in a chosen topic.
Probably too late, but you never know...
Trust metrics are a topic that is fairly abstract and algorithmic, but with important applications to important trends in web services, particularly recommender networks. I wrote a brief annotated bibliography on some research on trust metrics that are relevant to Raph Levien's idea of attack-resistant trust metrics.
The original Google papers, The PageRank citation ranking: Bringing order to the Web. http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1999-66&format=pdf&compression=. and The Anatomy of a Large-Scale Hypertextual Web Search Engine are excellent for discussion. Levien has explained how PageRank can be seen as an attack-resistant trust metric.
I think the seminal paper for the hypertext concept is As We May Think, by Vannevar Bush.
Another highly influential...system? paper? is the Xanadu approach.
Of course, the Web's seminal paper is Tim Berners-Lee's initial proposal.
I think there's a vast...unacademicalness (made that word up) here. Who wrote the academic paper for Facebook or other social media? What about Mosaic? Livejournal is rumored to have been written in bits and pieces for student projects initially. Maybe there are papers that got written for them - I don't know. Doesn't seem like it tho'.
None of your areas of interest is really "foundational", which is probably why you have received so few responses. But here are a couple of ideas of topics on which you should be able to find seminal papers with some help from a reference librarian, and they have some connection to networks and computing:
Dijkstra's seminal algorithm for finding shortest paths.
Minimum spanning tree—I'd have a look at Bernard Chazelle's paper on doing MST in inverse-Ackermann time; he'll have some older references.
One suggestion that may be further from your interests, but one I've actually read and is a stunningly good paper:
Don Knuth's original paper on LR parsing.
I've noticed an increasing number of jobs that are asking for experience with data mining and business intelligence technologies. This sounds like an incredibly broad topic but where would one go if they wanted to develop at least a partial understanding of this stuff if it were to come up in an interview?
A very good book with practical examples is the
Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran.
Go read Data Mining: Practical Machine Learning Tools and Techniques (Second Edition).
Then use Weka on a pet project. Despite the name, this is a good book, and the Weka package has several levels of entry into the data-m... er machine-learning world.
Consider reading Ralph Kimball's books for an introduction to Business Intelligence.
Also, try to not stick to one technolgy-vendor, every company has its own biased vision of BI, you'll need a 360 overview.
Maybe you can also try to work with real BI - it is almost impossible to get in contact with data-filled and running SAS, MS, Oracle etc. I work in a team which integrates BI BellaDati for enterprises. For try-out and personal purposes it is free with some datastore limitations ( http://www.trgiman.eu/en/belladati/product/personal ).
BellaDati is also used as a learning tool on technical universities focused on practical application of data mining and analysis. The final manager-level dashboards examples of BellaDati can be seen at http://mercato.belladati.com/bi/mercato/show/worldexchanges
You can work here with SQL datasources, flat files, web services and play. From my own experience - to show real samples of market analysis practise (like case study etc.) is good for an interview.
I wish you luck,
Peter