Attribute selection with Filtered classifier for saved model Weka

Attribute selection with Filtered classifier for saved model Weka - weka

I trained my model on a FilteredClassifier with Attribute selection in Weka. Now, I am unable to use the serialized model for Test data classification, I searched a lot but really couldn't figure out. This is what I am doing at the moment:
java -cp $CLASSPATH weka.filters.supervised.attribute.AddClassification\
-serialized Working.model \
-classification \
-remove-old-class \
-i full_data.arff \
-c last
It gives me an error saying
weka.core.WekaException: Training header of classifier and filter dataset don't match
But they aren't supposed to right? Since the Test data shouldn't have the class in the header. How should I use it? Also, I hope the selected attributes will be serialized and saved in the model, since the same attribute selection needs to be done on the test data.
I prefer not using Batch classifier since it defeats the point of a feature of saving the model and needs me to run the whole training each time.

One easy way to get it to work is by adding the nominal class to the ARFF file you created with a random class with dummy values, and then removing it with the -remove-old-class option.
So your command would remain the same, but your ARFF file will have the class this time.

Related

Confusion matrix in Weka

I want to calculate confusion matrix, f1 score, roc etc. But the Weka output is showing this. How can I get the confusion matrix, f1 score, roc, etc?

First of all, your dataset seems to have a numeric class attribute. Correlation coefficient is a statistic generated for regression models. A confusion matrix (which you want) is only computed for classification models.
Secondly, you are using ZeroR as classifier, which is not a very useful classifier (only for determining a baseline). ZeroR either predicts the mean class value (numeric class attribute) or the majority class (nominal class attribute).
Solutions:
Ensure that you are using the right attribute for your class. Assuming that you are using the Weka Explorer, check the combobox on the Classify panel that it has the right attribute selected. On the command-line, use the -c flag to specify the index of the class attribute (1-based index, first and last can be used as well).
If you imported your data from a CSV file and the class attribute column contains only numeric values, then Weka will have left it as numeric (it doesn't know that this column represents a nominal attribute). In that case, make sure that you convert your class attribute to a nominal one, e.g., by using the NumericToNominal filter in the Preprocess panel.
Choose a different classifier, like RandomForest or J48, which tend to generate reasonable models with just the default parameters.

How do I apply my model to a new dataset in WEKA?

I have created a new prediction model based on a dataset that was given to me. It predicts a nominal (binary) class attribute (positive/negative) based on a number of numerical attributes.
Now I have been asked to use this prediction model to predict classes for a new dataset. This dataset has all the same attributes except for the class column, which does not exist yet. How do I apply my model to this new data? I have tried adding an empty class column to my new dataset and then doing the following:
Simply loading the new dataset in WEKA's explorer and loading the model. It tells me there is no training data.
Opening my training set in WEKA's explorer and then opening my training model, then choosing my new data as a 'supplied test set'. It runs but does not output any predictions.
I should note that the model works fine when testing on the training data for cross validation. It also works fine with a subset of the training data I separated ages ago for test/eval use. I think it may be a problem with how I am adding a new class column, maybe?

For making predictions, Weka requires the two datasets, training and the one for making predictions, to have the exact same structure, down to the order of labels. That also means, that you need to have a class attribute with the correct labels present. In terms of values for your class attribute, simply use the missing value (denoted by a question mark).
See the FAQ How do i make predictions with a trained model? on the Weka wiki for more information on how to make predictions.

Alternatives to dynamically creating model fields

I'm trying to build a web application where users can upload a file (specifically the MDF file format) and view the data in forms of various charts. The files can contain any number of time based signals (various numeric data types) and users may name the signals wildly.
My thought on saving the data involves 2 steps:
Maintain a master table as an index, to save such meta information as file names, who uploaded it, when, etc. Records (rows) are added each time a new file is uploaded.
Create a new table (I'll refer to this as data tables) for each file uploaded, within the table each column will be one signal (first column being timestamps).
This brings the problem that I can't pre-define the Model for the data tables because the number, name, and datatype of the fields will differ among virtually all uploaded files.
I'm aware of some libs that help to build runtime dynamic models but they're all dated and questions about them on SO basically get zero answers. So despite the effort to make it work, I'm not even sure my approach is the optimal way to do what I want to do.
I also came across this Postgres specifc model field which can take nested arrays (which I believe fits the 2-D time based signals lists). In theory I could parse the raw uploaded file and construct such an array and basically save all the data in one field. Not knowing the limit of size of data, this could also be a nightmare for the queries later on, since to create the charts it usually takes only a few columns of signals at a time, compared to a total of up to hundreds of signals.
So my question is:
Is there a better way to organize the storage of data? And how?
Any insight is greatly appreciated!

If the name, number and datatypes of the fields will differ for each user, then you do not need an ORM. What you need is a query builder or SQL string composition like Psycopg. You will be programatically creating a table for each combination of user and uploaded file (if they are different) and programtically inserting the records.
Using postgresql might be a good choice, you might also create a GIN index on the arrays to speed up queries.
However, if you are primarily working with time-series data, then using a time-series database like InfluxDB, Prometheus makes more sense.

Why using "weka.filters.supervised.attribute.AttributeSelection" in Weka deletes the "##class##" attribute

I am using Weka Explorer in data mining project,
I have too many attributes and I want to reduce them by using Ranking in weka.filters.supervised.attribute.AttributeSelection but when I apply it the Attribute "##Class##" is deleted which is must be used in the classifier in next step !
Why attribute "##Class##" is deleted ?
how to solve this problem ?
Any suggestions?

Most likely you have not set the ##Class## attribute as the class when performing attribute selection. Ensure you are setting the appropriate class index using -c in the command line or void setClassIndex(int classIndex) in your Java code.

Weka setDataset(), do I need to use a full data set

Do I need to use my full training data set, or can I use a data set with only the attribute descriptions built from an arff file with the exact same attributes and say one instance?
I am using a classifier on an EC2 instance so I don't want to have the entire data-set on the EC2 instance as it is very large and grows.
Dose weka require the entire data-set or only the description from the arff file?

The setDataset() method only takes the attribute descriptions for your Instance from the Instances object (your dataset) that you have defined previously. Therefore it does not matter how large the dataset to which you are referring via the setDataset() method is.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js