Consider the following example:
Train:
...
#attribute FOO {a,b}
...
Test:
...
#attribute FOO {a,b,c}
...
We have a trained model against a train set in the aforementioned format. This model has already been computed using lots of data. We do not want to retrain it for the new test set to classify as it requires a remarkable computational effort.
For this reason we wonder about how can we manipulate the new test set in way of discarding this additional value for the attribute but still preserve the whole entry. What would you do? I also read about the question mark.. if we replace each 'c' with a question mark and we update the header removing the 'c' from the enum for the attribute FOO, it works but I am not very sure about its outcome.
Thank you
Related
I want to calculate confusion matrix, f1 score, roc etc. But the Weka output is showing this. How can I get the confusion matrix, f1 score, roc, etc?
First of all, your dataset seems to have a numeric class attribute. Correlation coefficient is a statistic generated for regression models. A confusion matrix (which you want) is only computed for classification models.
Secondly, you are using ZeroR as classifier, which is not a very useful classifier (only for determining a baseline). ZeroR either predicts the mean class value (numeric class attribute) or the majority class (nominal class attribute).
Solutions:
Ensure that you are using the right attribute for your class. Assuming that you are using the Weka Explorer, check the combobox on the Classify panel that it has the right attribute selected. On the command-line, use the -c flag to specify the index of the class attribute (1-based index, first and last can be used as well).
If you imported your data from a CSV file and the class attribute column contains only numeric values, then Weka will have left it as numeric (it doesn't know that this column represents a nominal attribute). In that case, make sure that you convert your class attribute to a nominal one, e.g., by using the NumericToNominal filter in the Preprocess panel.
Choose a different classifier, like RandomForest or J48, which tend to generate reasonable models with just the default parameters.
I have created a new prediction model based on a dataset that was given to me. It predicts a nominal (binary) class attribute (positive/negative) based on a number of numerical attributes.
Now I have been asked to use this prediction model to predict classes for a new dataset. This dataset has all the same attributes except for the class column, which does not exist yet. How do I apply my model to this new data? I have tried adding an empty class column to my new dataset and then doing the following:
Simply loading the new dataset in WEKA's explorer and loading the model. It tells me there is no training data.
Opening my training set in WEKA's explorer and then opening my training model, then choosing my new data as a 'supplied test set'. It runs but does not output any predictions.
I should note that the model works fine when testing on the training data for cross validation. It also works fine with a subset of the training data I separated ages ago for test/eval use. I think it may be a problem with how I am adding a new class column, maybe?
For making predictions, Weka requires the two datasets, training and the one for making predictions, to have the exact same structure, down to the order of labels. That also means, that you need to have a class attribute with the correct labels present. In terms of values for your class attribute, simply use the missing value (denoted by a question mark).
See the FAQ How do i make predictions with a trained model? on the Weka wiki for more information on how to make predictions.
I have read many solution about this error. But my problem is definitely different from the others: I have a "train" dataset(arff) and a "test" dataset(arff), both these two arff have an attribute "id"(string). It works well if I 'remove' "id" of these two arff at the same time(if I don't remove the id in "test" I will get an error); what confuse me is that my friend can do it by remove only the "id" in "train", so his output will contains the "id".
(since he didn't remove the "id" in the "test", the number of attribute will not be the same, and this is against what I read that the number of attribute should be exactly the same).
I really need an output that can contain the "id".
Maybe I did something wrong with the "remove"? I read somewhere said that the test feature may be superior to that of train. And also a paragraph talking about how to remove:"Instead of using a nominal ID attribute, declare it as STRING
attribute. With this you don't have to declare each possible value
like with NOMINAL attributes and it therefore doesn't matter what
strings are used in the test set that you're trying to use the trained
model on. In order to be able to work with this STRING ID attribute
you have to use the FilteredClassifier in conjunction with the Remove
filter (package weka.filters.unsupervised.attribute) and your original
base classifier. This setup will remove the ID attribute for the
learning process (i.e., the base classifier), but you'll still be able
to use it outside for tracking instances. "
http://weka.8497.n7.nabble.com/use-saved-model-td22857.html
Anyone have an idea?
Any help will be appreciated.
my 2 arff, left: train; right: test
left: output of myfriend with id such as test_subject1005 ; right: my output
Finally I got my solution. Just click directly the "supplied test set" and in the prompt interface click "Yes". That all! (It seems that I did not see this prompt before, so I did not try)
i don't understand what it means.
in database a tuple means a field value and a attribute means a table field?
am i correct?
and what is a Class label in Data Mining?
Very short answer: class label is the discrete attribute whose value you want to predict based on the values of other attributes. (Do read the rest of the answer.)
The term class label is usually used in the contex of supervised machine learning, and in classification in particular, where one is given a set of examples of the form (attribute values, classLabel) and the goal is to learn a rule that computes the label from the attribute values. The class label always takes on a finite (as opposed to inifinite) number of different values.
For a concrete example, we might be given a set of adult people and we'd like to predict whether they're homeless or not. Suppose the attributes were highest educational level achieved and origin (examples are of the from (origin, educationalLevel; isHomeless):
(Manhattan, PhD; no)
(Brooklyn, Primary school; yes)
...
In this particular case, isHomeless is the class label. The goal is to learn a function that computes whether the person with a given attribute values is homeless or not. (More specifically, to learn a function that makes as little mistakes as possible under a certain quantification of the number of mistakes.)
The Wikipedia article Supervised learning gives a good description.
Regarding the other question: no, a tuple means the whole set of values of the attributes in a given row. For example, if you had a table Table person(id, name, surname) then a tuple representing the first row could be (0, 'Akhil', 'Mohan').
Basically a class label (in classification) can be compared to a response variable (in regression): a value we want to predict in terms of other (independent) variables.
Difference is that a class labels is usually a discrete/Categorcial variable (eg-Yes-No, 0-1, etc.), whereas a response variable is normally a continuous/real-number variable.
You can find more about Regression and Classification related to Response variables and Class lables at https://math.stackexchange.com/questions/141381/regression-vs-classification.
Take an example of email spam filter, it classifies that an email is a spam or not, for which we define 2 classes which are spam(class 1) and not spam(class 2). Both of these are class labels or you can say that, if an email have some certain attributes then it belongs to spam class or not spam class
I have created an arff file for a data set that I would like to use in Weka. The file is formatted as a sparse arff file. Anyway, I have successfully loaded in the data. I then switch to the Association tab and set my parameters. However, the Start button won't become enabled, so I can't click it to start the association generation. Why is this? Has anyone run into this issue before and know how to solve it?
Here is a screenshot:
all variables must be Nominal; is the a priori method does not work on numeric variables
You may want to check the attribute types in your arff file. Weka is very particular about types when they are used for associations and even though it may let you set parameters, the routine will not run.
Try using an attribute declaration like the following:
#attribute "attr1" {t}
And define your rows as follows:
{1957 "t", 9163 "t", 10143 "t"}