How do I apply my model to a new dataset in WEKA? - data-mining

I have created a new prediction model based on a dataset that was given to me. It predicts a nominal (binary) class attribute (positive/negative) based on a number of numerical attributes.
Now I have been asked to use this prediction model to predict classes for a new dataset. This dataset has all the same attributes except for the class column, which does not exist yet. How do I apply my model to this new data? I have tried adding an empty class column to my new dataset and then doing the following:
Simply loading the new dataset in WEKA's explorer and loading the model. It tells me there is no training data.
Opening my training set in WEKA's explorer and then opening my training model, then choosing my new data as a 'supplied test set'. It runs but does not output any predictions.
I should note that the model works fine when testing on the training data for cross validation. It also works fine with a subset of the training data I separated ages ago for test/eval use. I think it may be a problem with how I am adding a new class column, maybe?

For making predictions, Weka requires the two datasets, training and the one for making predictions, to have the exact same structure, down to the order of labels. That also means, that you need to have a class attribute with the correct labels present. In terms of values for your class attribute, simply use the missing value (denoted by a question mark).
See the FAQ How do i make predictions with a trained model? on the Weka wiki for more information on how to make predictions.

Related

Vertex AI batch predictions - getting feature names for DataFrame

I'm using batch predictions with custom trained models. Generally, one would want to write a line like,
df = pd.DataFrame(instances)
...as one of the first steps prior to doing any custom preprocessing of features. However, this doesn't work with batch predictions - the resulting DataFrame will not have column names as expected. It appears to be a numpy array.
Is there a decent or canonical approach to retrieving the feature (column) names, in case the table changes? (It's better not to assume that the table's columns and their positions all stay the same.)
I'm initiating the batch prediction job with the python client. I based my model off of this example.

Confusion matrix in Weka

I want to calculate confusion matrix, f1 score, roc etc. But the Weka output is showing this. How can I get the confusion matrix, f1 score, roc, etc?
First of all, your dataset seems to have a numeric class attribute. Correlation coefficient is a statistic generated for regression models. A confusion matrix (which you want) is only computed for classification models.
Secondly, you are using ZeroR as classifier, which is not a very useful classifier (only for determining a baseline). ZeroR either predicts the mean class value (numeric class attribute) or the majority class (nominal class attribute).
Solutions:
Ensure that you are using the right attribute for your class. Assuming that you are using the Weka Explorer, check the combobox on the Classify panel that it has the right attribute selected. On the command-line, use the -c flag to specify the index of the class attribute (1-based index, first and last can be used as well).
If you imported your data from a CSV file and the class attribute column contains only numeric values, then Weka will have left it as numeric (it doesn't know that this column represents a nominal attribute). In that case, make sure that you convert your class attribute to a nominal one, e.g., by using the NumericToNominal filter in the Preprocess panel.
Choose a different classifier, like RandomForest or J48, which tend to generate reasonable models with just the default parameters.

Does the Google BigQuery ML automatically makes the time series data stationary?

So I am a newbie to Google BigQuery ML and was wondering if the auto.arima automatically makes my time series data stationary ?
Suppose, I have a data that is not stationary and if I give the data as is to the auto arima model using Google BigQuery ML, will it first makes my data stationary before taking it as input ?
but that's part of the modeling procedure
From the documentation that explains What's inside a BigQuery ML time series model, it does appear that auto.ARIMA will make the data stationary.
However, I would not expect it to alter the source data table; it won't make that stationary, but in the course of building candidate models it may alter the input data prior to actual model fitting (transforms; e.g. box-cox, make stationary, etc.)

Attribute selection with Filtered classifier for saved model Weka

I trained my model on a FilteredClassifier with Attribute selection in Weka. Now, I am unable to use the serialized model for Test data classification, I searched a lot but really couldn't figure out. This is what I am doing at the moment:
java -cp $CLASSPATH weka.filters.supervised.attribute.AddClassification\
-serialized Working.model \
-classification \
-remove-old-class \
-i full_data.arff \
-c last
It gives me an error saying
weka.core.WekaException: Training header of classifier and filter dataset don't match
But they aren't supposed to right? Since the Test data shouldn't have the class in the header. How should I use it? Also, I hope the selected attributes will be serialized and saved in the model, since the same attribute selection needs to be done on the test data.
I prefer not using Batch classifier since it defeats the point of a feature of saving the model and needs me to run the whole training each time.
One easy way to get it to work is by adding the nominal class to the ARFF file you created with a random class with dummy values, and then removing it with the -remove-old-class option.
So your command would remain the same, but your ARFF file will have the class this time.

what does the attribute selection in preprocess tab do in weka?

I cant seem to find out what attribute selection filter does in pre process tab? someone could please tell me in simple language as im new to weka
when i apply it to my dataset it seems to remove a couple of attributes but im unsure why
A real data set may contain many attributes. Applying any data mining process on this data set (e.g. finding clusters, generating a classification model ...) may take very long time.
Instead of that, we can select some attributes(dimensions) which is called the most discriminative attributes. These attributes can almost describe the data set with lower number of attributes and this will speed up any process done on the data.
Attribute selection tab contains many different methods for selecting these attributes. One of them is CFS Feature Set Evaluation This filter gives you the attributes that have higher correlation with the class label which makes them discriminative attributes.