Automatic text classification using n-gram model - data-mining

hi i'am a newbie to data mining. My task is to automatically classify text documents using n-grams method.
I could not find proper resources on this topic, kindly help me how to proceed in this topic, where can i find tutorials based on n-gram classification.
i need java source code on this topic for my understanding.
thanks in advance.

I highly recommend Stanford's online NLP course by Dan Jurafsky & Chris Manning. Chapter 4 addresses n-grams, but all the chapters before it give a great background.
Stanford also has some great open source software you can use for text classification, from tokenizing to part of speech tagging.

i found better tutorial with documentation in
http://textcat.sourceforge.net/README.txt
http://textcat.sourceforge.net/doc/index.html

Related

similar articles suggestion based on article read by a user

I am looking for the best algorithm to use for article suggestion in my projects. We have bunch of 1000 articles. I would like to recommend similar articles to users based on the article he is reading. Which algorithm best suits this. I tried content based recommendation, which involves training the model. In my case it can be simple text based similarity to the article the user is reading and not the history of articles read by users
maybe look at what karpathy has done with arxiv sanity.
https://github.com/karpathy/arxiv-sanity-preserver

embedding rnn seq2seq and basic rnn seq2seq

after reading the documentation and digging for a couple days on the web i still confused about :
what is the difference between basic_rnn_seq2seq and
embedding_rnn_seq2seq?
And when to use each one of them? Thanks in advance
Unfortunately, TensorFlow lacks detailed documentations and examples of their APIs.
I made a brief example for functions in legacy_seq2seq.
https://github.com/j-min/tf_tutorial_plus/tree/master/RNN_seq2seq/legacy_seq2seq

Bayesian classification or similar technique for recommendation system

I'm working on a news app. On the home page, the user sees a list of headlines and then he can click one to read the article and comment.
I would like to offer an option for "recommended articles" based on his history. For example, if he read an article - I'll feed the algorithm with the headline keywords so it will learn what this user likes to read.
My problem with what I've read about bayesian filters is that you need to train them with good input and bad input (such as good emails and spam emails). The difference in my case is that there are no bad examples. If the user didn't read an article - it doesn't mean it's a bad classification (since he still might read it in the future), but only if he read one - it's more likely that he'll read similar articles in the future.
Basically, I'm looking for an algorithm to help me recommend articles to a specific user - based on what he read in the past. It will run on a mobile device, so any implementation (C/C++/Obj-C) will work.
Thanks.
You can treat this as a binary classification problem. It is either an article he likes to read or an article he possibly doesn't like to read.
You can use the dlib C++ library for the binary classifier algorithm.

Book and article references sought for starting out with document classification

I am interested in doing a project on document classification and have been looking for books that could be useful for the theoretical parts in text mining related to this or examples of articles describing the process of going from training data with documents classified (with subcategories) to a system which predicts the class of a document. There seem to be some (rather expensive!) titles available but these are conference proceedings with articles on smaller very specific topics. Can someone suggest books from the data mining literature that provides a good theoretical basis for a project on text mining, specifically document classification or articles with an overview of this process ?
Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze have a free information retrieval book. Try chapter 13 - Text classification & Naive Bayes.
See also the companion site for Manning and Schütze's nlp book, specifically links for the text categorization chapter.
Fabrizio Sebastiani wrote a useful tutorial about text categorization(PDF) and review paper of machine learning for text categorization (PDF).

How to implement speech recognition and text-to-speech in C++?

I want to know about various techniques to do speech recognition and text to speech conversion.
Also please let me know about any resources like links, tutorials ,ebooks etc. on it.
Which is the most efficient technique to achieve it ?
I'm going to answer the part about speech recognition (since I don't know much about text-to-speech):
http://ecx.images-amazon.com/images/I/4190SZC61CL._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA240_SH20_OU01_.jpg
This book, "Statistical Methods for Speech Recognition" is a classic that explains the mathematical foundations of statistical speech recognition, written by the founder of that area, Frederick Jelinek.
The most important concept you have to know is Hidden Markov Models. People have been using them in speech recognition for decades. A recent approach uses Conditional Random Fields, see the paper (PDF) and the associated software toolkit SCARF.
It is fairly hard to write your own speech recognizer. It's an active research area with several scientific conferences, e.g. ASRU, Interspeech, ICASSP.
Both are very wide areas.
About recognition: In this this schema you will find how to build a basic automatic speech recognition system. It isn't by any means close to the start of the art, but it is something achievable and it works. If you want to do something more advanced, read about cepstral coefficients and Hidden Markov Models. Have a look into HTK, it is a widely used toolkit for Hidden Markov Models.
About text to speech: I'd have a look at Festival.
There are multiple sphinx's. The main active ones are pocketsphinx and sphinx4.
Sphinx4 is written in Java. It is better for desktop and web applications.
Pocketsphinx is written in C. It is better for embedded devices. There are iphone/android apps that use it.
Sounds like you want pocketsphinx. Try out this tutorial:
http://www.speech.cs.cmu.edu/sphinx/tutorial.html
A better place to ask pocketsphinx/sphinx4 questions is on CMU's sourceforge forum.
Also you should provide more info like what you intend to make.
As for books, the bible of speech recognition is "Spoken Language Processing"
Since you mentioned MS -
You should just look at the Microsoft Speech site. It contains many resources for dealing with speech, including TTS and speech recognition.
If you're looking for some actual code, check out Sphinx, an open source speech recognition project from CMU. It's not written in C++, but if you're interested in algorithms, it's implemented a bunch of stuff you can learn from. (I'd like to echo #dehmann's point, too: read up on hidden markov models.)
If you are curious about what to do with your fancy speech recognition you should read:
Voice Interaction Design by Randy Allen Harris
It provides some great advice about when to use Voice and how to use it in an application.