What is StringToWordVector? All I know about it is that it converts a string attribute to multiple attributes. But what is the advantage of doing so and how an object of StringToWordVector class serves as a filter for FilteredClassifier? How has it become a filter?
StringTOWordVector is the filter class in weka which filters strings into N-grams using WOrdTokenizer class. This helps us to provide strings as N-grams to classifier. Besides just tokenizing, it also provide other functionalities like removing Stopwords, weighting words with TFIDF, output word count rather than just indicating word is present or not, pruning rate, stemming, Lowercase conversion of words, etc. Detailed explanation of this class can be found at http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVecing.html So Basically it provides basic functionalities which helps us to fine tune the training set according to requirements before training.
However, if someone, who wants to perform testing along with training, must use batchfiltering or Filtered classifier for ensuring compatability of train & test Set. This is because if we pass train & test separately through StringToWordVector then it will generate different vocabulary for train & test set. To decide which technique should be opted out of batch filltering & Filtered classifier, follow the post by Nihil Obstat at http://jmgomezhidalgo.blogspot.in/2013/01/text-mining-in-weka-chaining-filters.html
Hope this helps.
Related
I have a question about the word representation algorithms:
Which one of the algorithms word2Vec, doc2Vec and Tf-IDF is more suitable for handling text classification tasks ?
The corpus used in my supervised learning classification is composed of a list of multiple sentences, with both short length sentences and long length ones. As discussed in this thread, doc2vec vs word2vec choice is a matter of document length. As for Tf-Idf vs. word embedding, it's more a matter of text representation.
My other question is, what if for the same corpus I had more than one label to link to the sentences in it ? If I create multiple entries/labels for the same sentence, it affects the decision of the final classification algorithm. How can I tell the model that every label counts equal for every sentence of the document ?
Thank you in advance,
You should try multiple methods of turning your sentences into 'feature vectors'. There are no hard-and-fast rules; what works best for your project will depend a lot on your specific data, problem-domains, & classification goals.
(Don't extrapolate guidelines from other answers – such as the one you've linked that's about document-similarity rather than classification – as best practices for your project.)
To get initially underway, you may want to focus on some simple 'binary classification' aspect of your data, first. For example, pick a single label. Train on all the texts, merely trying to predict if that one label applies or not.
When you have that working, so you have a understanding of each step – corpus prep, text processing, feature-vectorization, classification-training, classification-evaluation – then you can try extending/adapting those steps to either single-label classification (where each text should have exactly one unique label) or multi-label classification (where each text might have any number of combined labels).
I am looking for a good approach using python libraries to tackle the following problem:
I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following:
"500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training.
Any help would be appreciated.
Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.
I have two datasets regarding whether a sentence contains a mention of a drug adverse event or not, both the training and test set have only two fields the text and the labels{Adverse Event, No Adverse Event} I have used weka with the stringtoWordVector filter to build a model using Random Forest on the training set.
I want to test the model built with removing the class labels from the test data set, applying the StringToWordVector filter on it and testing the model with it. When I try to do that it gives me the error saying training and test set not compatible probably because the filter identifies a different set of attributes for the test dataset. How do I fix this and output the predictions for the test set.
The easiest way to do this for a one off test is not to pre-filter the training set, but to use Weka's FilteredClassifier and configure it with the StringToWordVector filter, and your chosen classifier to do the classification. This is explained well in this video from the More Data Mining with Weka online course.
For a more general solution, if you want to build the model once then evaluate it on different test sets in future, you need to use InputMappedClassifier:
Wrapper classifier that addresses incompatible training and test data
by building a mapping between the training data that a classifier has
been built with and the incoming test instances' structure. Model
attributes that are not found in the incoming instances receive
missing values, so do incoming nominal attribute values that the
classifier has not seen before. A new classifier can be trained or an
existing one loaded from a file.
Weka requires a label even for the test data. It uses the labels or „ground truth“ of the test data to compare the result of the model against it and measure the model performance. How would you tell whether a model is performing well, if you don‘t know whether its predictions are right or wrong. Thus, the test data needs to have the very same structure as the training data in WEKA, including the labels. No worries, the labels are not used to help the model with its predictions.
The best way to go is to select cross validation (e.g. 10 fold cross validation) which automatically will split your data into 10 parts, using 9 for training and the remaining 1 for testing. This procedure is repeated 10 times so that each of the 10 parts has once been used as test data. The final performance verdict will be an average of all 10 rounds. Cross validation gives you a quite realistic estimate of the model performance on new, unseen data.
What you were trying to do, namely using the exact same data for training and testing is a bad idea, because the measured performance you end up with is way too optimistic. This means, you‘ll get very impressive figures like 98% accuracy during testing - but as soon as you use the model against new unseen data your accuracy might drop to a much worse level.
i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a specific class, with all the comment (9551) i've done a preprocess obtaining with the filter "stringtowordvector" a vector of tokens, and then i've used the simple kmeans to obtain a number of cluster.
So the question is: if a user post a new comment can i predict with those data if it belong to a category of comment?
sorry if my question is a little bit confused but so am i.
thank you
Trivial Training-validation-test
Create two datasets from your labelled instances. One will be training set and the other will be validation set. The training set will contain about 60% of the labelled data and the validation will contain 40% of the labelled data. There is no hard and fast rule for this split, but a 60-40 split is a good choice.
Use K-means (or any other clustering algorithm) on your training data. Develop a model. Record the model's error on training set. If the error is low and acceptable, you are fine. Save the model.
For now, your validation set will be your test dataset. Apply the model you saved on your validation set. Record the error. What is the difference between training error and validation error? If they both are low, the model's generalization is "seemingly" good.
Prepare a test dataset where you have all the features of your training and test dataset but the class/cluster is unknown.
Apply the model on the test data.
10-fold cross validation
Use all of your labelled data instances for this task.
Apply K-means (or any other algorithm of your choice) with a 10-fold CV setup.
Record the training error and CV error. Are they low? Is the difference between the errors is low? If yes, then save the model and apply it on the test data whose class/cluster is unknown.
NB: The training/test/validation errors and their differences will give you an "very initial" idea of overfitting/underfitting of your model. They are sanity tests. You need to perform other tests like learning curves to see if your model overfits or underfits or perfect. If there appears to be an overfitting and underfitting problem, you need to try many different techniques to overcome them.
I'm using StringToWordVector Naive Bayes and StringToWordVector to classify some text.
I'm also using TD/IDF to put score on words.
Is there a simple way to increase the score of some words (chosen by myself) during the training to increase the weight of this words in the model for a given class?
So if this words are present in a new document the classifier would know there is more chance that the document belongs to this class.
Thanks!
You want to increase the probability that documents containing certain words will be classified as a certain kind of document.
What you can do, is to simply train your classifier with "hand made" documents that contain exactly these words, and then mark these documents as belonging to a specific class.