How can I analyze a nonstructured text? - data-mining

I use TF-IDF to affect weight that can help me to construct my dictionary. but my model is not really good enough because I have unstructured text.
Any suggestions about TF-IDF similar algorithms?

When you say, your model is not good enough, does it mean that your generated dictionary is not good enough? Extracting key terms and constructing the dictionary using TF-IDF weight is actually feature selection step.
To extract or select features for your model, you can follow other approaches like principle component analysis, latent semantic analysis etc. Lot of other feature selection techniques in machine learning can be useful too!
But I truly believe for sentiment classification task, TF-IDF should be a very good approach to construct the dictionary. I rather suggest you to tune your model parameters when you are training it rather than blaming the feature selection approach.
There are many deep learning techniques as well that are applicable for your target task.

Related

Sentiment analysis feature extraction

I am new to NLP and feature extraction, i wish to create a machine learning model that can determine the sentiment of stock related social media posts. For feature extraction of my dataset I have opted to use Word2Vec. My question is:
Is it important to train my word2vec model on a corpus of stock related social media posts - the datasets that are available for this are not very large. Should I just use a much larger pretrained word vector ?
The only way to to tell what will work better for your goals, within your constraints of data/resources/time, is to try alternate approaches & compare the results on a repeatable quantititave evaluation.
Having training texts that are properly representative of your domain-of-interest can be quite important. You may need your representation of the word 'interest', for example, to represent that of stock/financial world, rather than the more general sense of the word.
But quantity of data is also quite important. With smaller datasets, none of your words may get great vectors, and words important to evaluating new posts may be missing or of very-poor quality. In some cases taking some pretrained set-of-vectors, with its larger vocabulary & sharper (but slightly-mismatched to domain) word-senses may be a net help.
Because these pull in different directions, there's no general answer. It will depend on your data, goals, limits, & skills. Only trying a range of alternative approaches, and comparing them, will tell you what should be done for your situation.
As this iterative, comparative experimental pattern repeats endlessly as your projects & knowledge grow – it's what the experts do! – it's also important to learn, & practice. There's no authority you can ask for any certain answer to many of these tradeoff questions.
Other observations on what you've said:
If you don't have a large dataset of posts, and well-labeled 'ground truth' for sentiment, your results may not be good. All these techniques benefit from larger training sets.
Sentiment analysis is often approached as a classification problem (assigning texts to bins of 'positive' or 'negative' sentiment, operhaps of multiple intensities) or a regression problem (assigning texts a value on numerical scale). There are many more-simple ways to create features for such processes that do not involve word2vec vectors – a somewhat more-advanced technique, which adds complexity. (In particular, word-vectors only give you features for individual words, not texts of many words, unless you add some other choices/steps.) If new to the sentiment-analysis domain, I would recommend against starting with word-vector features. Only consider adding them later, after you've achieved some initial baseline results without their extra complexity/choices. At that point, you'll also be able to tell if they're helping or not.

What is considered a good accuracy for trained Word2Vec on an analogy test?

After training Word2Vec, how high should the accuracy be during testing on analogies? What level of accuracy should be expected if it is trained well?
The analogy test is just a interesting automated way to evaluate models, or compare algorithms.
It might not be the best indicator of how well word-vectors will work for your own project-specific goals. (That is, a model which does better on word-analogies might be worse for whatever other info-retrieval, or classification, or other goal you're really pursuing.) So if at all possible, create an automated evaluation that's tuned to your own needs.
Note that the absolute analogy scores can also be quite sensitive to how you trim the vocabulary before training, or how you treat analogy-questions with out-of-vocabulary words, or whether you trim results at the end to just higher-frequency words. Certain choices for each of these may boost the supposed "correctness" of the simple analogy questions, but not improve the overall model for more realistic applications.
So there's no absolute accuracy rate on these simplistic questions that should be the target. Only relative rates are somewhat indicative - helping to show when more data, or tweaked training parameters, seem to improve the vectors. But even vectors with small apparent accuracies on generic analogies might be useful elsewhere.
All that said, you can review a demo notebook like the gensim "Comparison of FastText and Word2Vec" to see what sorts of accuracies on the Google word2vec.c `questions-words.txt' analogy set (40-60%) are achieved under some simple defaults and relatively small training sets (100MB-1GB).

Choosing the best subset of features

I want to choose the best feature subset available that distinguish two classes to be fed into a statistical framework that I have built, where features are not independent.
After looking at the feature selection methods on machine learning it seems that it fall into three different categories: Filter,wrapper and Embedded methods. and the filter methods can be either: univariate or multivariate. It does make sense to use either Filter(multivariate) or wrapper methods because both -as I understood- looks for the best subset, however, as I am not using a classifier how can use it ?
Does it make sense to apply such methods (e.g. Recursive feature
elimination ) to DT or Random Forest classifier where the features
have rules there, and then take the resulted best subset and fed it
into my framework ?**
Also, as most of the algorithms provided by Scikit-learn are
univariate algorithms, Is there any other python-based libraries
that provide more subset feature selection algorithms ?
I think the statement that "most of the algorithms provided by Scikit-learn are univariate algorithms" is false. Scikit-learn handles multi-dimensional data very nicely. The RandomForestClassifier that they provide will give you an estimate of feature importance.
Another way to estimate feature importance is to choose any classifier that you like, train it and estimate performance on a validation set. Record the accuracy and this will be your baseline. Then take that same train/validation split and randomly permute all values along one feature dimension. Then train and validate again. Record the difference of this accuracy from your baseline. Repeat this for all feature dimensions. The results will be a list of numbers for each feature dimension that indicates its importance.
You can extend this to use pairs, or triples of feature dimensions, but the computational cost will grow quickly. If you're features are highly correlated you may benefit from doing this for at least the pairwise case.
Here is the source document of where I learned that trick: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
(It should work for classifiers other than Random Forests.)

How to evaluate my own text classifier

I have written my own text classifier, based on some linguistic theory. Final outcome of the classifier is a tuple pair of an article title and the binary category.
I also used the NB classifier on my Golden standard corpus and evaluated its performance with CV, using Sci-kit learn library in Python. However, I am struggling to figure out how to evaluate performance of my own classifier. :S
I would really appreciate your ideas, since I am not experienced machine learner.
Thanks,
Guzdeh
To evaluate a classifier, the most common metric is accuracy, but there is no rule of thumb for all possible scenarios, so I would suggest that you read a bit about evaluation metric for classifiers. Also read about evaluation methodology.
If you are out of time, stick to accuracy and cross validation for now, but be sure to understand what a given metric means, what your methodology means, how to read a confusion matrix, each metric and methodology pros and cons, and specially its limitations.
Scikit Learn's Reference Page for its metrics: Link
Scikit Learn's User Guide for cross-validation: Link
You stated you have your golden standard. You said you have your model. You then only need to choose a metric and an evaluation methodology.
Your model will predict a class/target given an input (a set of features). The prediction will then be compared to your ground truth/golden standard.

dimension reduction in spam filtering

I'm performing an experiment in which I need to compare classification performance of several classification algorithms for spam filtering, viz. Naive Bayes, SVM, J48, k-NN, RandomForests, etc. I'm using the WEKA data mining tool. While going through the literature I came to know about various dimension reduction methods which can be broadly classified into two types-
Feature Reduction: Principal Component Analysis, Latent Semantic Analysis, etc.
Feature Selection: Chi-Square, InfoGain, GainRatio, etc.
I have also read this tutorial of WEKA by Jose Maria in his blog: http://jmgomezhidalgo.blogspot.com.es/2013/02/text-mining-in-weka-revisited-selecting.html
In this blog he writes, "A typical text classification problem in which dimensionality reduction can be a big mistake is spam filtering". So, now I'm confused whether dimensionality reduction is of any use in case of spam filtering or not?
Further, I have also read in the literature about Document Frequency and TF-IDF as being one of feature reduction techniques. But I'm not sure how does it work and come into play during classification.
I know how to use weka, chain filters and classifiers, etc. The problem I'm facing is since I don't have enough idea about feature selection/reduction (including TF-IDF) I am unable to decide how and what feature selection techniques and classification algorithms I should combine to make my study meaningful. I also have no idea about optimal threshold value that I should use with chi-square, info gain, etc.
In StringToWordVector class, I have an option of IDFTransform, so does it makes sence to set it to TRUE and also use a feature selection technique, say InfoGain?
Please guide me and if possible please provide links to resources where I can learn about dimension reduction in detail and can plan my experiment meaningfully!
Well, Naive Bayes seems to work best for spam filtering, and it doesn't play nicely with dimensionality reduction.
Many dimensionality reduction methods try to identify the features of the highest variance. This of course won't help a lot with spam detection, you want discriminative features.
Plus, there is not only one type of spam, but many. Which is likely why naive Bayes works better than many other methods that assume there is only one type of spam.