How to create on-demand cost file in weka? - data-mining

I am using weka data mining tool. In weka I am trying to use metacost classifier to classify my data. But while executing it, there is a error popup saying "On-demand cost file doesn't exist".
Anyone know how to create on-demand cost file?

If you are using GUI,
You need to create a cost matrix first. Use "use explicit cost matrix" option, create a cost matrix from "cost matrix" option. You can simply use this option to train your base classifier.
If you want cost file on demand option, save this cost file as "relation name of the training data(arff file) ".cost" file.
When you select "Load matrix on demand" in Weka GUI, provide it with a directory path on "onDemand Directory" for the path where you have saved the cost file.
Remember that your name of the cost file must be "relation name of the training data plus ".cost". Now you can train your base classifier with on-demand cost.
Edit :For more details: http://weka.sourceforge.net/doc/weka/classifiers/meta/MetaCost.html
Its similar in weka API too.

Related

VERTEX AI, BATCH PREDICTION

Does anyone know how I can choose the option "Files on Cloud Storage (file list) in the program? How can I [enter image description here choose the list format?
You can't enter multiple files in the UI. You point it to one text file in cloud storage, and that file should contain the paths of all the images you want to run predictions on. Here is the file format.

Auto labeling for Text Data with Amazon Sagemaker ground truth

What is the minimum number of text rows needed for ground truth to do auto-labelling ? I have text file which contains 1000 rows, is this good enough to get started with auto-labelling by sagemaker ground truth ?
I'm a product manager on the Amazon SageMaker Ground Truth team, and I'm happy to help you with this question. The minimum system requirement is 1,000 objects. In practice with text classification, we typically see meaningful results (% of data auto-labeled) only once you have 2,000 to 3,000 text objects. Remember performance is variable and depends on your dataset and the complexity of your task.
From the documentation,
You should use automated data labeling only on large datasets. The neural networks used with active learning require a significant amount of data for every new dataset. With larger datasets there is more potential to automatically label the data and therefore reduce the total cost of labeling. We recommend that you use thousands of data objects when using automated data labeling. You must use at least 5,000 data objects
https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html

How to use TimeSeriesForecasting in KnowledgeFlow?

Weka Explorer provides Time Series Forecasting perspective and it is easy to use.
However, what should I do, if I want to use KnowledgeFlow for time series forecast?
what if I want to save original dataset with predictions?
Solutions (Thanks to the help from people from WekaList, especially, Mark Hall, Eibe Frank)
Open knowledgeFlow, load dataset with ArffLoader
go to setting, check time series forecasting perspective, right-click ArffLoader to send to all perspective
go to time series forecasting perspective to set up a model
run the model and copy the model to clipboard
ctrl + v, and click to paste model to Data mining process canvas
save prediction along with original data with ArffSaver

Weka: Src and Dest differ in # of attributes after I do feature selection on the training set

I am trying to use weka to classify text. What I do is this:
I create on big ARFF file with all of the data: all_of_it.arff.
I split that data into training and test:train.arff and test.arff
I do feature selection on the training set and output a new training file:train_fs.arff
I build a classifier with only those selected features.
And the problem is.....
I don't quite know how to standardize the test set to only use the features I selected from the training set. Something like create new test file from test.arff according to train_fs.arff
*I tried using
java -cp weka.jar weka.filters.unsupervised.attribute.Standardize -b -i train_fs.arff -o train2.arff -r test.arff -s test2.arff
but I got the infamous Src and Dest differ in # of attributes.
Is there any way to normalize/standardize the sets according to an arff file (namely my new training data with few features) I don't see how to do this with the Standardize or StringToWordVector filter.
Batch filtering is one solution to your problem.
Pros:
It will apply the same filter to your test dataset as you apply to your training dataset. When you perform feature selection, the two datasets will be compatible
Cons:
It is only availabe from the command line interface or Weka's Java API
The two datasets must be filtered at the same time
You can read more about Batch filtering here.
You may also want to look into InputMappedClassifier. It is a wrapper classifier that addresses incompatible training and testing data.

Building compatible datasets for Weka for large, evolving data

I have a largish dataset that I am using Weka to explore. It goes like this: today I will analyze as much data as I can, and create a trained classifier. I'll save this model as a file. Then tomorrow I will acquire a new batch of data, and want to use the saved model to predict the class for the new data. This repeats every day. Eventually I will update the saved model, but for now assume that it is static.
Due to the size and frequency of this task, I want to run this automatically, which means the command line or similar. However, my problem exists in the Explorer, as well.
My question has to do with the fact that, as my dataset grows, the list of possible labels for attributes also grows. Weka says such attribute lists cannot change, or the training set and test set are said to be incompatible (see: http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F). But in my world there is no way that I could possibly know today all the attribute labels that I will stumble across next week.
To rectify the situation, it is suggested that I run batch filtering (http://weka.wikispaces.com/How+do+I+generate+compatible+train+and+test+sets+that+get+processed+with+a+filter%3F). Okay, that appears to mean that I need to re-build my model with the refiltered training data each day.
At this point the whole thing seems difficult enough that I fear I am making a horrible, simple newbie mistake, and so I ask for help.
DETAILS:
The model was created by
java -Xmx1280M weka.classifiers.meta.FilteredClassifier ^
-t .\training.arff -d .\my.model -c 15 ^
-F "weka.filters.supervised.attribute.Discretize -R first-last" ^
-W weka.classifiers.trees.J48 -- -C 0.25 -M 2
Naively, to predict I would try:
java -Xmx1280M weka.core.converters.DatabaseLoader ^
-url jdbc:odbc:(database) ^
-user (user) ^
-password (password) ^
-Q "exec (my_stored_procedure) '1/1/2012', '1/2/2012' " ^
\> .\NextDay.arff
And then:
java -Xmx1280M weka.classifiers.trees.J48 ^
-T .\NextDay.arff ^
-l .\my.model ^
-c 15 ^
-p 0 ^
\> .\MyPredictions.txt
this yields:
java.lang.Exception: training and test set are not compatible
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1035)
at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
at weka.classifiers.trees.J48.main(J48.java:948)
A related question is asked at kdkeys.net/training-and-test-set-are-not-compatible-weka/
An associated problem is that the command-line version of the database extraction requires generation of a temporary .arff file, and it appears JDBC-generated arff files do not handle "date" data correctly. My database generates dates of the ISO-8601 format "yyyy-MM-dd'T'HH:mm:ss" but both Explorer and generated .arff files from JDBC data represent these as type NOMINAL. And so the list of labels for date attributes in the header is very, very long and never the same from dataset to dataset.
I'm not a java or python programmer, but if that's what it takes, I'll go buy some books! Thanks in advance.
I think you can use Incremental classifiers. But only few classifier can support for this option. Like SMO, J48 classifiers wont support this. So you will use some other classifier to classify.
To know more visit
http://weka.wikispaces.com/Classifying+large+datasets
http://wiki.pentaho.com/display/DATAMINING/Handling+Large+Data+Sets+with+Weka
There is a bigger problem with your plan too, it seems. If you have data from day 1 and you use it to build a model, then you use it on data from day n that has new and never before seen class labels, it will be impossible to predict the new labels because there is no training data for them. Similarly, if you have new attributes, it will be impossible to use those for classification because none of your training data has them to associate with the class labels.
Thus, if you want to use a model trained on data with only a subset of the new data's attributes/classes, then you might as well filter the new data to remove the new classes/attributes since they wouldn't be used even if you could execute weka without errors on two dissimilar datasets.
If it's not in your training set, exclude it from your test set. Then everything should work. If you need to be able to test/predict on it, then you need to retrain a new model that has examples of the new classes/attributes.
Doing this in your environment might require manually querying data out of the database into arff files, so as to query out only the attributes/classes that were in the training set. Look into sql and any major scripting language (e.g. perl, python) to do this without much fuss.
The university who maintains Weka also created MOA (Massive Online Analysis) to analyse and solve your kind of problem. All of their classifiers are updatable and you can compare classifiers performance over the time for your data flow. It also allows you to detect change of models (concept drift/shift) and optimize (ie. limit) your data window over the time (forget old data mechanism...).
Once you're done with testing and tuning with MOA, you can then use MOA classifiers from Weka (there is an extension to enable it) and batch all your process.