how to make a vectorized file in python. I need to convert tweets to vector form inorder to run a code in bayesian network - python-2.7

Is it possible to make a dataset atleast? I am doing sentiment analysis and is getting polarity of the message
I was following this tutorial. But it is not the data set required.
http://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
It would be great if anyone could explain the csv file given here.

Basically, the process of converting a collection of text documents into numerical feature vectors is called vectorization. There are several techniques or concepts that can be used to vectorize text documents(for eg. word embeddings, bag of words, etc.).
Bag of words is one of the simplest ways to vectorize text into numerical features. TfIdf is an effective vectorization technique based on the bag of words concept.
On a very basic level, TfIdf uses a set of unigrams or bigrams(n-grams in general) from the entire text corpus and uses them as the features for all your text documents(tweets in your case). So if you imagine your text corpus as a table of numerical values then each row would be a text document(a tweet in your case) and each column would be a unigram(which is basically a word) and the value of each cell (i,j) in the table would depend on the term frequency of unigram j in the tweet i(the number of times that the particular unigram occurs in the tweet) and the inverse of the document frequency of the unigram j(the number of tweets that the particular unigram occurs in all the tweets combined). Hence, you would have a list of tweets as vectors which would have a numerical tfidf values corresponding to each feature(unigram).
For more information on how to implement tfidf look at the following links:
http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Related

Document classification: Preprocessing and multiple labels

I have a question about the word representation algorithms:
Which one of the algorithms word2Vec, doc2Vec and Tf-IDF is more suitable for handling text classification tasks ?
The corpus used in my supervised learning classification is composed of a list of multiple sentences, with both short length sentences and long length ones. As discussed in this thread, doc2vec vs word2vec choice is a matter of document length. As for Tf-Idf vs. word embedding, it's more a matter of text representation.
My other question is, what if for the same corpus I had more than one label to link to the sentences in it ? If I create multiple entries/labels for the same sentence, it affects the decision of the final classification algorithm. How can I tell the model that every label counts equal for every sentence of the document ?
Thank you in advance,
You should try multiple methods of turning your sentences into 'feature vectors'. There are no hard-and-fast rules; what works best for your project will depend a lot on your specific data, problem-domains, & classification goals.
(Don't extrapolate guidelines from other answers – such as the one you've linked that's about document-similarity rather than classification – as best practices for your project.)
To get initially underway, you may want to focus on some simple 'binary classification' aspect of your data, first. For example, pick a single label. Train on all the texts, merely trying to predict if that one label applies or not.
When you have that working, so you have a understanding of each step – corpus prep, text processing, feature-vectorization, classification-training, classification-evaluation – then you can try extending/adapting those steps to either single-label classification (where each text should have exactly one unique label) or multi-label classification (where each text might have any number of combined labels).

Clear approach for assigning semantic tags to each sentence (or short documents) in python

I am looking for a good approach using python libraries to tackle the following problem:
I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following:
"500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training.
Any help would be appreciated.
Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.

Building Speech Dataset for LSTM binary classification

I'm trying to do binary LSTM classification using theano.
I have gone through the example code however I want to build my own.
I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.
My problem now is: How do I use this text file to train the LSTM?
I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.
Any simpler way using LSTM's would also be welcome.
There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.
Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.
Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.
Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False
To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,...,1], [0,0,....,2]]
In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.
To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,1], [0,1,0]]
Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.
Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.
Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.
You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.

Is is possible to boost some words using StringToWordVector

I'm using StringToWordVector Naive Bayes and StringToWordVector to classify some text.
I'm also using TD/IDF to put score on words.
Is there a simple way to increase the score of some words (chosen by myself) during the training to increase the weight of this words in the model for a given class?
So if this words are present in a new document the classifier would know there is more chance that the document belongs to this class.
Thanks!
You want to increase the probability that documents containing certain words will be classified as a certain kind of document.
What you can do, is to simply train your classifier with "hand made" documents that contain exactly these words, and then mark these documents as belonging to a specific class.

Improving classification results with Weka J48 and Naive Bayes Multinomial classifiers

I have been using Weka’s J48 and Naive Bayes Multinomial (NBM) classifiers upon
frequencies of keywords in RSS feeds to classify the feeds into target
categories.
For example, one of my .arff files contains the following data extracts:
#attribute Keyword_1_nasa_Frequency numeric
#attribute Keyword_2_fish_Frequency numeric
#attribute Keyword_3_kill_Frequency numeric
#attribute Keyword_4_show_Frequency numeric
…
#attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}
#data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,12,0,0,0,0,0,20,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
…
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,FCL
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,F
…
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,7
8,0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,SNT
0,0,0,0,0,0,11,0,0,0,0,0,0,0,19,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,17,0,0,0,0,0,0,0,0,0,0,0,0,0,20,0,S
And so on: there’s a total of 570 rows where each one is contains with the
frequency of a keyword in a feed for a day. In this case, there are 57 feeds for
10 days giving a total of 570 records to be classified. Each keyword is prefixed
with a surrogate number and postfixed with ‘Frequency’.
I am using 10 fold x validation for both the J48s and NBM classifiers on a
'black box' basis. Other parameters used are also defaults, i.e. 0.25 confidence
and min number of objects is 2 for the J48s.
So far, my classification rates for an instance of varying numbers of days, date
ranges and actual keyword frequencies with both J28 and NBM results being
consistent in the 50 - 60% range. But, I would like to improve this if possible.
I have reduced the decision tree confidence level, sometimes as low as 0.1 but
the improvements are very marginal.
Can anyone suggest any other way of improving my results?
To give more information, the basic process here involves a diverse collection of RSS feeds where each one belongs to a single category.
For a given date range, e.g. 01 - 10 Sep 2011, the text of each feed's item elements are combined. The text is then validated to remove words with numbers, accents and so on, and stop words (a list of 500 stop words from MySQL is used). The remaining text is then indexed in Lucene to work out the most popular 64 words.
Each of these 64 words is then searched for in the description elements of the feeds for each day within the given date range. As part of this, the description text is also validated in the same way as the title text and again indexed by Lucene. So a popular keyword from the title such as 'declines' is stemmed to 'declin': then if any similar words are found in the description elements which also stem to 'declin', such as 'declined', the frequency for 'declin' is taken from Lucene's indexing of the word from the description elements.
The frequencies shown in the .arff file match on this basis, i.e. on the first line above, 'nasa', 'fish', 'kill' are not found in the description items of a particular feed in the BFE category for that day, but 'show' is found 34 times. Each line represents occurrences in the description items of a feed for a day for all 64 keywords.
So I think that the low frequencies are not due to stemming. Rather I see it as the inevitable result of some keywords being popular in feeds of one category, but which don't appear in other feeds at all. Hence the spareness shown in the results. Generic keywords may also be pertinent here as well.
The other possibilities are differences in the numbers of feeds per category where more feeds are in categories like NCA than S, or the keyword selection process itself is at fault.
You don't mention anything about stemming. In my opinion you could have better results if you were performing word stemming and the WEKA evaluation was based on the keyword stems.
For example let's suppose that your WEKA model is built given a keyword surfing and a new rss feed contains the word surf. There should be a match between these two words.
There are many free available stemmers for several languages.
For the English language some available options for stemming are:
The Porter's stemmer
Stemming based on the WordNet's dictionary
In case you would like to perform stemming using the WordNet's dictionary, there are libraries & frameworks that perform integration with WordNet.
Below you can find some of them:
MIT Java WordNet interface (JWI)
Rita
Java WorNet Library (JWNL)
EDITED after more information was provided
I believe that the keypoint in the specified case is the selection of the "most popular 64 words". The selected words or phrases should be keywords or keyphrases. So the challenge here is the keywords or keyphrases extraction.
There are several books, papers and algorithms written about keywords/keyphrases extraction. The university of Waikato has implemented in JAVA, a famous algorithm called Keyword Extraction Algorithm (KEA). KEA extracts keyphrases from text documents and can be either used for free indexing or for indexing with a controlled vocabulary. The implementation is distributed under the GNU General Public License.
Another issue that should be taken into consideration is the (Part of Speech)POS tagging. Nouns contain more information than the other POS tags. Therefore may you would have better results if you were checking the POS tag and the selected 64 words were mostly nouns.
In addition according to the Anette Hulth's published paper Improved Automatic Keyword Extraction Given More Linguistic Knowledge, her experiments showed that the keywords/keyphrases mostly have or are contained in one of the following five patterns:
ADJECTIVE NOUN (singular or mass)
NOUN NOUN (both sing. or mass)
ADJECTIVE NOUN (plural)
NOUN (sing. or mass) NOUN (pl.)
NOUN (sing. or mass)
In conclusion a simple action that in my opinion could improve your results is to find the POS tag for each word and select mostly nouns in order to evaluate the new RSS feeds. You can use WordNet in order to find the POS tag for each word and as I mentioned above there are many libraries on the web that perform integration with the WordNet's dictionary. Of course stemming is also essential for the classification process and has to be maintained.
I hope this helps.
Try turning off stemming altogether. The Stanford Intro to IR authors provide a rough justification of why stemming hurts, and at the very least does not help, in text classification contexts.
I have tested stemming myself on a custom multinomial naive Bayes text classification tool (I get accuracies of 85%). I tried the 3 Lucene stemmers available from org.apache.lucene.analysis.en version 4.4.0, which are EnglishMinimalStemFilter, KStemFilter and PorterStemFilter, plus no stemming, and I did the tests on small and larger training document corpora. Stemming significantly degraded classification accuracy when the training corpus was small, and left accuracy unchanged for the larger corpus, which is consistent with the Intro to IR statements.
Some more things to try:
Why only 64 words? I would increase that number by a lot, but preferably you would not have a limit at all.
Try tf-idf (term frequency, inverse document frequency). What you're using now is just tf. If you multiply this by idf you can mitigate problems arising from common and uninformative words like "show". This is especially important given that you're using so few top words.
Increase the size of the training corpus.
Try shingling to bi-grams, tri-grams, etc, and combinations of different N-grams (you're now using just unigrams).
There's a bunch of other knobs you could turn, but I would start with these. You should be able to do a lot better than 60%. 80% to 90% or better is common.