python k-means clustering text - python-2.7

I am trying to find an example to assist me to cluster some textual data I have. The data is in the form:
A,B,3
C,D,5
A,D,57
The two first entries are the members of a pair, the number is how often this pair occurs in the dataset. I have over 200,000 unique pairs.
Any tips? Thanks!!

Don't use k-means on such data.
It will not work.
What you have is a similarity matrix, not continuous vectors as needed for k-means. You can try hierarchical clustering (with a sparse similarity, not a distance; no, I won't write the code for you).

Related

Classify K-means in Text Mining

The goal is to create a computer-generated news site that aggregates headlines from different news sources around the world:
Taking a look at the centroid table results I want to Understand the following:
https://ibb.co/n1mvnbk
I used K=5
and I am using TF-IDF
Explain what those numbers mean?
When an attribute is zero in multiple clusters, what does it mean?
When I sort the centroid table by each cluster at a descending order, I find some words or attributes that have a higher value with this cluster while zero values in other clusters. Does this mean that these words occur more or less frequently in this cluster?
How can I discuss the clustering model
Do all the clusters make sense and why?
Do you think k=5 is a good choice for this dataset? or I need to choose 3? How can I classify that?
I believe K=5 denotes number of cluster you are looking into current Dataset. On the basis 5 centroid will be placed in data will be around them.
Do you think k=5 is a good choice for this dataset? Its hard to predict this way. It is all done by mathematical combination and permutation.
You might use Elbow Method to identify correct number of cluster needed for any given dataset. This methodology is based on WCSS(Within Cluster Sums of Squares) which find distance between points and provide centroid points.
Those numbers are the average tf-idf of the cluster. So a 0 means that the word is not in the cluster, and the highest valued words are most characteristic words for the cluster.
Note that for text you'll want to use spherical k-means rather than regular k-means.
Choosing k is a big problem. Forget the elbow method, it never works except for you examples. Experiment with different k and choose the one that is most convincing or most useful. None of the usual heuristics for choosing the k in k-means will work here I fear (VRC is IMHO the best). The main reason is that the data cannot be well partitioned into k clusters. There is no reason to assume there are exactly k topics in the world, nor that every document only contains one topic. Instead, topics will be a complex structure itself. For example there is Trump, but there also is the Trump Erdogan meeting, and there is the impeachment. These are not disjoint. But you will also have articles that don't fit into any of these topics. This leads to the effect that the true best k would likely be very very large, as large as the number of articles (and hence not useful).

How does Weka calculate the output predictions in J48 and other classifier?

I have used the output predictions of J48 classifier in Weka and got the results with predictions (probability). As I need to use these predictions number in my research, I need to know how the weka calculates these numbers? What is the formula? Is it specified for each classifier?
In addition to Jan Eglinger answer.
The J48 classifier is Weka's implementation of the infamous C4.5 decision tree classifier, which is a classification algorithm based on ID3 that classifies using information entropy.
The training data is a set S = {s_1, s_2, ...} of already classified samples. Each sample s_i consists of a p-dimensional vector (x_{1,i}, x_{2,i}, ...,x_{p,i}) , where the x_j represent attribute values or features of the sample, as well as the class in which s_i falls.
At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.
This algorithm has a few base cases.
All the samples in the list belong to the same class. When this
happens, it simply creates a leaf node for the decision tree saying
to choose that class.
None of the features provide any information gain. In this case,
C4.5 creates a decision node higher up the tree using the expected
value of the class.
Instance of previously-unseen class encountered. Again, C4.5 creates
a decision node higher up the tree using the expected value.
You can find the information Gain and entropy in the Weka Api package. For that you need to start dubbing the java weka api and go through each step.
In general, if you don't worry about how algorithm works internally using high level mathematics. Try to calculate InformationGain and entropy and explain them in your research apart from decision trees, you have methods for both of these to calculate their value.
What is the formula?
Weka's J48 classifier is an implementation of the C4.5 algorithm.
I need to know how the weka calculates these numbers?
You can find implementation details in J48.java and in the weka.classifiers.trees.j48 package.

SGDClassifier with HashingVectorizer and TfidfTransformer

I would like to understand if it is possible to train an online SGDClassifier (with partial_fit) using HashingVectorizer and TfidfTransformer. Simply joining them in a Pipeline will not work as TfidfTransformer is stateful so that would break the online learning process. This post says it's not possible to use tf-idf in an online fashion but a comment on this post suggests that it may somehow be possible: "In particular if you use stateful transformers as TfidfTransformer you will need to do several passes on your data". Is that possible without loading the whole training set into memory? If so, how? If not, is there an alternative solution to combine HashingVectorizer with tf-idf on large datasets?
Is that possible without loading the whole training set into memory?
No. TfidfTransformer needs to have the entire X matrix in memory. You'll need to roll your own tf-idf estimator, use that to compute per-term document frequencies in one pass over the data, then do another pass to produce tf-idf features and fit a classifier to them.

Handle multi-label dataset in classification using j48 tree

I'm trying to use j48 tree to perform a text categorization task. I read a lot of papers and websites that explain how to use classification having datasets whose data are single labeled.
In my case I have only multi-labeled data in my training set, what can I have to treat these data in a single decision tree? Or the only solution is generating many trees as many as the number of the labels?
You can use a tree with adapted entropy formula. You must define beforehand if your dataset has hierachical labels:
papers and code

Regression Tree Forest in Weka

I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.