I'm using StringToWordVector Naive Bayes and StringToWordVector to classify some text.
I'm also using TD/IDF to put score on words.
Is there a simple way to increase the score of some words (chosen by myself) during the training to increase the weight of this words in the model for a given class?
So if this words are present in a new document the classifier would know there is more chance that the document belongs to this class.
Thanks!
You want to increase the probability that documents containing certain words will be classified as a certain kind of document.
What you can do, is to simply train your classifier with "hand made" documents that contain exactly these words, and then mark these documents as belonging to a specific class.
Related
I have a question about the word representation algorithms:
Which one of the algorithms word2Vec, doc2Vec and Tf-IDF is more suitable for handling text classification tasks ?
The corpus used in my supervised learning classification is composed of a list of multiple sentences, with both short length sentences and long length ones. As discussed in this thread, doc2vec vs word2vec choice is a matter of document length. As for Tf-Idf vs. word embedding, it's more a matter of text representation.
My other question is, what if for the same corpus I had more than one label to link to the sentences in it ? If I create multiple entries/labels for the same sentence, it affects the decision of the final classification algorithm. How can I tell the model that every label counts equal for every sentence of the document ?
Thank you in advance,
You should try multiple methods of turning your sentences into 'feature vectors'. There are no hard-and-fast rules; what works best for your project will depend a lot on your specific data, problem-domains, & classification goals.
(Don't extrapolate guidelines from other answers – such as the one you've linked that's about document-similarity rather than classification – as best practices for your project.)
To get initially underway, you may want to focus on some simple 'binary classification' aspect of your data, first. For example, pick a single label. Train on all the texts, merely trying to predict if that one label applies or not.
When you have that working, so you have a understanding of each step – corpus prep, text processing, feature-vectorization, classification-training, classification-evaluation – then you can try extending/adapting those steps to either single-label classification (where each text should have exactly one unique label) or multi-label classification (where each text might have any number of combined labels).
I am looking for a good approach using python libraries to tackle the following problem:
I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following:
"500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training.
Any help would be appreciated.
Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.
I am using H2O machine learning package to do natural language predictions, including the functions h2o.word2vec and h2o.transform. I need sentence level aggregation, which is provided by the AVERAGE parameter value:
h2o.transform(word2vec, words, aggregate_method = c("NONE", "AVERAGE"))
However, in my case I strongly wish to avoid equal weighting of "the" and "platypus" for example.
Here's a scheme I concocted to achieve custom word-weightings. If H2O's word2vec "AVERAGE" option uses all the words including duplicates that might appear, then I could effect a custom word weighting when calling h2o.transform by adding additional duplicates of certain words to my sentences, when I want to weight them more heavily than other words.
Can any H2O experts confirm that that the word2vec AVERAGE parameter is using all the words rather than just the unique words when computing AVERAGE of the words in sentence?
Alternatively, is there a better way? I tried but I find myself unable to imagine any correct math to multiply the sentence average by some factor, after it was already computed.
Yes, h2o.transform will consider each occurrence of a word for the averaging, not just the unique words. Your trick will therefore work.
There is currently no direct way to provide user defined weights. You could probably do an ugly hack and weight directly the word embeddings but that won't be a straightforward solution I could recommend.
We can add this feature to H2O. I would love to hear what API would work for you (how would you like to provide the weights).
Is it possible to make a dataset atleast? I am doing sentiment analysis and is getting polarity of the message
I was following this tutorial. But it is not the data set required.
http://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
It would be great if anyone could explain the csv file given here.
Basically, the process of converting a collection of text documents into numerical feature vectors is called vectorization. There are several techniques or concepts that can be used to vectorize text documents(for eg. word embeddings, bag of words, etc.).
Bag of words is one of the simplest ways to vectorize text into numerical features. TfIdf is an effective vectorization technique based on the bag of words concept.
On a very basic level, TfIdf uses a set of unigrams or bigrams(n-grams in general) from the entire text corpus and uses them as the features for all your text documents(tweets in your case). So if you imagine your text corpus as a table of numerical values then each row would be a text document(a tweet in your case) and each column would be a unigram(which is basically a word) and the value of each cell (i,j) in the table would depend on the term frequency of unigram j in the tweet i(the number of times that the particular unigram occurs in the tweet) and the inverse of the document frequency of the unigram j(the number of tweets that the particular unigram occurs in all the tweets combined). Hence, you would have a list of tweets as vectors which would have a numerical tfidf values corresponding to each feature(unigram).
For more information on how to implement tfidf look at the following links:
http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
What is StringToWordVector? All I know about it is that it converts a string attribute to multiple attributes. But what is the advantage of doing so and how an object of StringToWordVector class serves as a filter for FilteredClassifier? How has it become a filter?
StringTOWordVector is the filter class in weka which filters strings into N-grams using WOrdTokenizer class. This helps us to provide strings as N-grams to classifier. Besides just tokenizing, it also provide other functionalities like removing Stopwords, weighting words with TFIDF, output word count rather than just indicating word is present or not, pruning rate, stemming, Lowercase conversion of words, etc. Detailed explanation of this class can be found at http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVecing.html So Basically it provides basic functionalities which helps us to fine tune the training set according to requirements before training.
However, if someone, who wants to perform testing along with training, must use batchfiltering or Filtered classifier for ensuring compatability of train & test Set. This is because if we pass train & test separately through StringToWordVector then it will generate different vocabulary for train & test set. To decide which technique should be opted out of batch filltering & Filtered classifier, follow the post by Nihil Obstat at http://jmgomezhidalgo.blogspot.in/2013/01/text-mining-in-weka-chaining-filters.html
Hope this helps.