How to interpret the output of supervised discretization in Weka? - weka

The dataset is the weather.numeric.arff.
Where are the intervals from discretization? Why 'all'?

I do not know why you get this result. I tried other settings with the supervised Discretize filter with the same outcome.
I thought it would perhaps be because the data set is small, but I triplicated it (so there's 42 instances, not 14), and the same thing happened.
BUT, if I use the unsupervised Discrete filter, it works.
That line is:
Discretize -B 10 -M -1.0 -R first-last
This gives 4 intervals for the 14 cases.

Related

Train program to understand high and low value in machine learning

I am generating alerts by reading dataset for KPI (key performance indicator) . My algorithm is looking into historical data and based on that I am able to capture if there's sudden spike in data. But I am generating false alarms . For example KPI1 is historically at .5 but reaches value 12, which is kind of spike .
Same way KPI2 also reaches from .5 to 12. But I know that KPI reaching from .5 to 12 is not a big deal and I need not to capture that . same way KPI2 reaching from .5 to 12 is big deal and I need to capture that.
I want to train my program to understand what is high value , low value or normal value for each KPI.
Could you experts tell me which is best ML algorithm is for this and any package in python I need to explore?
This is the classification problem. You can use classic logistic regression algorithm to classify any given sample into either high value, low value or normal value.
Quoting from the Wikipedia,
In statistics, multinomial logistic regression is a classification
method that generalizes logistic regression to multiclass problems,
i.e. with more than two possible discrete outcomes. That is, it is
a model that is used to predict the probabilities of the different
possible outcomes of a categorically distributed dependent variable,
given a set of independent variables (which may be real-valued,
binary-valued, categorical-valued, etc.)
To perform multi-class classification in python, sklearn library can be useful.
http://scikit-learn.org/stable/modules/multiclass.html

Imbalance between errors in data summary and tree visualization in Weka

I tried to run a simple classification on the iris.arff dataset in Weka, using the J48 algorithm. I used cross-validation with 10 folds and - if I'm not wrong - all the default settings for J48.
The result is a 96% accuracy with 6 incorrectly classified instances.
Here's my question: according to this the second number in the tree visualization is the number of the wrongly classified instances in each leaf, but then why their sum isn't 6 but 3?
EDIT: running the algorithm with different test options I obtain different results in terms of accuracy (and therefore number of errors), but when I visualize the tree I get always the same tree with the same 3 errors. I still can't explain why.
The second number in the tree visualization is not the number of the wrongly classified instances in each leaf - it's the total weight of those wrongly classified instances.
Did you, by any chance, weigh some of those instances with 0.5 instead of 1?
Another option is that you are actually executing two different models. One where you use the full training set to build the classifier (classifier.buildClassifier(instances)) and another one where you run Cross-validation (eval.crossValidateModel(...)) with 10 train/test folds. The first model will produce the visualised tree with less errors (larger trainingset) while the second model from CV produces the output statistics with more errors. This would explain why you get different stats when changing the test set but still the same tree that is built on the full set.
For the record: if you train (and visualise) the tree with the full dataset, you will appear to have less errors, but your model will actually be overfitted and the obtained performance measures will probably not be realistic. As such, your results from CV are much more useful and you should visualise the tree from that model.

Python K-Means Z-Transform

I want to use the k-means to cluster my results, but I have a lot of question.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans
My input data looks like this:
ID ABC XYZ UVW MSE
10 A X U 102000
12 B Y V 9000
Is it possible to cluster different types of input data with the K-Means? Like in my case charakters and numbers?
K-means choose a random centre for the clustering process. If i run the clustering often will my results are changing or is the output a stable result?
I want to know, which ID is in which cluster. How i get this information out of software?
EDIT:
If I only Cluster my MSE and afterwards I check, which attributes are effected, is this solution, which makes sense?
K-means tries to minimize variance (=squared errors).
What would the squared error of abc and def be?
Only use it for continuous data. And don't expect it to do magic, what you get is usually only a very bad approximation of what you'd be looking for. Running it multiple times will usually give you different results because there exists no 'good' result.

Finding a correlation between variable and class variable

I have a dataset which contains 7 numerical attributes and one nominal which is the class variable. I was wondering how I can the best attribute that can be used to predict the class attribute. Would finding the largest information gain by each attribute be the solution?
So the problem you are asking about falls under the domain of feature selection, and more broadly, feature engineering. There is a lot of literature online regarding this, and there are definitely a lot of blogs/tutorials/resources online for how to do this.
To give you a good link I just read through, here is a blog with a tutorial on some ways to do feature selection in Weka, and the same blog's general introduction on feature selection. Naturally there are a lot of different approaches, as knb's answer pointed out.
To give a short description though, there are a few ways to go about it: you can assign a score to each of your features (like information gain, etc) and filter out features with 'bad' scores; you can treat finding the best parameters as a search problem, where you take different subsets of the features and assess the accuracy in turn; and you can use embedded methods, which kind of learn which features contribute most to the accuracy as the model is being built. Examples of embedded methods are regularization algorithms like LASSO and ridge regression.
Do you just want that attribute's name, or do you also want a quantifiable metric (like a t-value) for this "best" attribute?
For a qualitative approach, you can generate a classification tree with just one split, two leaves.
For example, weka's "diabetes.arff" sample-dataset (n = 768), which has a similar structure as your dataset (all attribs numeric, but the class attribute has only two distinct categorical outcomes), I can set the minNumObj parameter to, say, 200. This means: create a tree with minimum 200 instances in each leaf.
java -cp $WEKA_JAR/weka.jar weka.classifiers.trees.J48 -C 0.25 -M 200 -t data/diabetes.arff
Output:
J48 pruned tree
------------------
plas <= 127: tested_negative (485.0/94.0)
plas > 127: tested_positive (283.0/109.0)
Number of Leaves : 2
Size of the tree : 3
Time taken to build model: 0.11 seconds
Time taken to test model on training data: 0.04 seconds
=== Error on training data ===
Correctly Classified Instances 565 73.5677 %
This creates a tree with one split on the "plas" attribute. For interpretation, this makes sense, because indeed, patients with diabetes have an elevated concentration of glucose in their blood plasma. So "plas" is the most important attribute, as it was chosen for the first split. But this does not tell you how important.
For a more quantitative approach, maybe you can use (Multinomial) Logistic Regression. I'm not so familiar with this, but anyway:
In the Exlorer GUI Tool, choose "Classify" > Functions > Logistic.
Run the model. The odds ratio and the coefficients might contain what you need in a quantifiable manner. Lower odds-ratio (but > 0.5) is better/more significant, but I'm not sure. Maybe read on here, this answer by someone else.
java -cp $WEKA_JAR/weka.jar weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -t data/diabetes.arff
Here's the command line output
Options: -R 1.0E-8 -M -1
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Class
Variable tested_negative
============================
preg -0.1232
plas -0.0352
pres 0.0133
skin -0.0006
insu 0.0012
mass -0.0897
pedi -0.9452
age -0.0149
Intercept 8.4047
Odds Ratios...
Class
Variable tested_negative
============================
preg 0.8841
plas 0.9654
pres 1.0134
skin 0.9994
insu 1.0012
mass 0.9142
pedi 0.3886
age 0.9852
=== Error on training data ===
Correctly Classified Instances 601 78.2552 %
Incorrectly Classified Instances 167 21.7448 %

Meaning of correctly classified instances weka

I recently started using weka and I'm trying to classify tweets into positive or negative using Naive Bayes. So I have a training set with tweets that I gave the label for and a test set with tweets that all have the label "positive". When I ran Naive Bayes, I get the following results:
Correctly classified instances: 69 92%
Incorrectly classified instances: 6 8%
Then if I change the labels of the tweets in the test set to "negative" and ran again Naive Bayes, the results are inversed:
Correctly classified instances: 6 8%
Incorrectly classified instances: 69 92%
I thought that correctly classified instances show the accuracy of Naive Bayes and that it should be the same no matter the labels of the tweets in test set. Is there something wrong with my data or I don't understand correctly the meaning of correctly classified instances?
Thanks a lot for your time,
Nantia
The labels on the test set are supposed to be the actual correct classification. Performance is computed by asking the classifier to give its best guess about the classification for each instance in the test set. Then the predicted classifications are compared to the actual classifications to determine accuracy. Therefore, if you flip the 'correct' values that you give it, the results will be flipped as well.
Based on your training set, 69.92% of your instances are classified as positive. If the labels for the test set, that is the correct answers, indicate that they are all positive, then that makes 69.92% correct. If the test set (and thus the classification) is the same, but you switch the correct answers, then of course, the percentage correct will also be the opposite.
Keep in mind that in order to evaluate a classifier, you need the true labels of the test set. Otherwise you can't compare the classifier's answers with the true answers. It seems to me that you might have misunderstood this. You can obtain the labels for unseen data, if that is what you want, but in that case you can't evaluate classifier accuracy.