How to convert a string type value to numeric in Weka? - weka

I have an issue in WEKA. I have a training test set, where the Classifier of the data is listed as numeric value (output is either 0 or 1 for the data). I create a model on this training data.
I try to use these models on the testing data with all of the same attributes, the Classifier is considered a "string", so the two data sets are not compatible. Is there any way for me to convert this "string" variable in my testing data into a "numeric" value so that the two data sets are compatible?
I am not good with command line, so is there is a way to do it within the WEKA GUI?

Related

number expected.read Token[2015-02-02 14:19:00] weka project

i hope you all are doing well!
I have a project at data mining class.Τhe data consists of numerical data and many algorithms do not work.I have to do this:"you should compare the performance of the following categorization algorithms:
RandomForest, C4.5, JRip, Bayesian Network. Where necessary use them
Weka filters to replace or create values ​​for some properties
new properties. For comparison, adopt the Train / Test Percentage Split type with
percentage for training data equal to 80%.Describe your observations by giving tables with the results and
presenting the performance of the algorithms. Repeat the experiment by putting
percentage for training data equal to 70% and 50% presenting the results."
So my first try was to transform the data inside weka with preprocessing data numeric to nominal but a friend of mine suggest that is statistical wrong.So my second try was to use excel to transform all data even the date to numeric,remove the first row(id) and pass it to the weka(I leave double quotes only at date)
.But i have the error that i mention on the title.The dataset is:https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
Thank you for the time.
If you define date-like data as a DATE attribute in the ARFF file (using the right format for parsing the strings), then WEKA will treat it as a numeric attribute internally (Java epoch, ie milli-seconds since 1970-01-01).
Instead of using NumericToNominal, use either the supervised or unsupervised Discretize filter if the algorithm cannot handle numeric attributes.
Converting nominal attributes to numeric ones is not a recommended approach. Instead, try the supervised or unsupervised NominalToBinary filter.

Weka not display Correctly classified instances as output

I am new on weka. I have a dataset in csv with 5000 samples. here 20 samples of it; when I upload this dataset into weka, it looks ok, but when I run knn algorithm it gives a result that is not supposed to give. here is the sample data.
a,b,c,d
74,85,123,1
73,84,122,1
72,83,121,1
70,81,119,1
70,81,119,1
69,80,118,1
70,81,119,1
70,81,119,1
76,87,125,1
76,87,125,1
82,92,146,2
74,86,140,2
68,80,134,2
64,76,130,2
64,75,132,2
83,96,152,2
72,85,141,2
71,83,141,2
69,81,139,2
65,79,137,2
here is the result :
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.6148
Mean absolute error 0.2442
Root mean squared error 0.4004
Relative absolute error 50.2313 %
Root relative squared error 81.2078 %
Total Number of Instances 5000
it is supposed to give this kind of result like:
Correctly classified instances: 69 92%
Incorrectly classified instances: 6 8%
What should be the problem? What am I missing? I did this in all other algorithms but they all give the same output. I have used sample weka datasets, they all work as expected.
The IBk algorithm can be used for regression (predicting the value of a numeric response for each instance) as well as for classification (predicting which class each instance belongs to).
It looks like all the values of the class attribute in your dataset (column d in your CSV) are numbers. When you load this data into Weka, Weka therefore guesses that this attribute should be treated as a numeric one, not a nominal one. You can tell this has happened because the histogram in the Preprocess tab looks something like this:
instead of like this (coloured by class):
The result you're seeing when you run IBk is the result of a regression fit (predicting a numeric value of column d for each instance) instead of a classification (selecting the most likely nominal value of column d for each instance).
To get the result you want, you need to tell Weka to treat this attribute as nominal. When you load the csv file in the Preprocess tab, check Invoke options dialog in the file dialog window. Then when you click Open, you'll get this window:
The field nominalAttributes is where you can give Weka a list of which attributes are nominal ones even if they look numeric. Entering 4 here will specify that the fourth attribute (column) in the input is a nominal attribute. Now IBk should behave as you expect.
You could also do this by applying the NumericToNominal unsupervised attribute filter to the already loaded data, again specifying attribute 4 otherwise the filter will apply to all the attributes.
The ARFF format used for the Weka sample datasets includes a specification of which attributes are which type. After you've imported (or filtered) your dataset as above, you can save it as ARFF and you'll then be able to reload it without having to go through the same process.

Trained and Test data have different number of attributes that gave an error "Train and test set are not compatible"

I use WEKA for Text classification , I have trained data set , and I apply StringToWOrdVector and NumericToNominal filters , and have test data set and applied the same filters on it .
When I try to apply my model on test data ,it gave me the following error
Train and test set are not compatible
I searched for a solution , the error occurred because number of attributes different between two sets, and it always be different because texts in two sets are different
How I can solve this error please ?
The best thing you can do is combine your training and test set into one file and then apply the filter to it all in one go, then split them up again and copy the #attribute values from the combined file into both the training and test files. This way the attributes will be consistent across both files.

Weka : training and test set are not compatible

Each row of my training and test datasets has intensity values for pixels in an image with the last column having the label which tells what digit is represented in the image; the label can be any number from 0 to 9 in training set and is always ? on test set. I loaded the training dataset on Weka Explorer, passed the data through NumericalToNominal filter and used RemovePercentage filter to split the data in 70-30 ratio, the 30% file being used as cross validation set. I built a classifer and saved the model.
Then, I loaded the test data which has ? against label for each row and applied the NumericToNominal filter and saved it as arff file.Now, when i load the test data and try to user the model against it, I always get the error message saying "training and test set are not compatible". Both datasets have undergone the same processing. What possibly could have gone wrong?
As you can read from ARFF manual (http://www.cs.waikato.ac.nz/ml/weka/arff.html):
Nominal values are defined by providing an
listing the possible values: {, ,
, ...}
For example, the class value of the Iris dataset can be defined as
follows:
#ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
So when you apply NumericToNominal to your test file you can possibly have different number of possible values for one or more attributes within train and test arff - it really can happen, it bothered me many times - so one solution is to check your arff's manually (if it is not to big, or just copy and paste invocation of arff file with
e.g.
#attribute 'My first binary attribute' {0,1}
(...)
#attribute 'My last binary attribute' {0,1}
from train to test file - should work
you can use batch filtering, here you can read how to batch filtering in weka

Weka: Train and test set are not compatible

I'm trying to classify some web posts using weka and naive bayes classifier.
First I manually classified many posts (about 100 negative and 100 positive) and I created an .arff file with this form:
#relation classtest
#attribute 'post' string
#attribute 'class' {positive,negative}
#data
'RT #burnreporter: Google has now indexed over 30 trillion URLs. Wow. #LeWeb',positive
'A special one for me Soundcloud at #LeWeb ',positive
'RT #dianaurban: Lost Internet for 1/2 hour at a conference called #LeWeb. Ironic, yes?',negative
.
.
.
Then I open Weka Explorer loading that file and applying the StringToWordVector filter to split the posts in single word attributes.
Then, after doing the same with my dataset, selecting (in classify tab of weka) naive bayes classifier and choosing select test set, it returns Train and test set are not compatible. What can I do? Thanks!
Probably the ordering of the attributes is different in train and test sets.
You can use batch filtering as described in http://weka.wikispaces.com/Batch+filtering
I used batch filter but still have problem. Here is what I did:
java -cp /usr/share/java/weka.jar weka.filters.unsupervised.attribute.NumericToNominal -R last -b -i trainData.arff -o trainDataProcessed.csv.arff -r testData.arff -s testDataProcessed.csv.arff
I then get the error below:
Input file formats differ.
Later.I figured out two ways to make the trained model working on supplied test set.
Method 1.
Use knowledge flow. For example something like below: CSVLoader(for train set) -> classAssigner -> TrainingSetMaker -->(classifier of your choice) -> ClassfierPerformanceEvaluator - TextViewer. CSVLoader(for test set) -> classAssigner -> TestgSetMaker -->(the same classifier instance above) -> PredictionAppender -> CSVSaver. Then load the data from the CSVLoader or arffLoder for the training set. The model will be trained. After that load data from the loader for the test set. It will evaluate the model(classifier, for example) on the supplied test set and you can see the result from the textviewer (connected to the ClassifierPerformanceEvaluator) and get the saved result from the CSVSaver or arffSaver connected to the PredictionAppender.An additional column, the "classfied as" will be added to the output file. In my case, I used "?" for the class column in the supplied test set if the class labels are not available.
Method 2.
Combine the Training and Test set into one file. Then the exact same filter can be applied to both training and test set. Then you can separate training set and test set by applying instance filter. Since I use "?" as class label in the test set. It is not visible in the instance filter indices. Hence just select those indices that you can see in the attribute values to be removed when apply the instance filter. You will get the test data left only. Save it and load it in supply test set at the classifier page.This time it will work. I guess it is the class attribute that causes the NOT compatible train and test set issue. As many classfier requires nominal class attribute. The value of which is converted to the index to available values of the class attribute according to http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F