Following is the code in my arff file:
#relation superstore
#attribute t1 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t2 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t3 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t4 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t5 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t6 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t7 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t8 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t9 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t10 {milk,egg,bread,butter,popcorn,chip,beer}
#data
milk,egg,bread,?,?,chip,?
?,egg,?,?,popcorn,chip,beer
?,egg,bread,?,?,chip,?
milk,egg,bread,?,popcorn,chip,beer
milk,?,bread,?,?,?,beer
?,egg,bread,?,?,?,beer
milk,?,bread,?,?,chip,?
milk,egg,bread,butter,?,chip,?
milk,egg,?,butter,?,chip,?
While loading this data in Weka, it indicates EOL error on line 16 but i have checked multiple time and I have not found any abnormality here. Kindly help me out here..
You have defined 10 columns, but using only 6 for data.
Your file should look like this:
#relation superstore
#attribute t1 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t2 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t3 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t4 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t5 {milk,egg,bread,butter,popcorn,chip,beer}
#attribute t6 {milk,egg,bread,butter,popcorn,chip,beer}
#data
milk,egg,bread,?,?,chip,?
?,egg,?,?,popcorn,chip,beer
?,egg,bread,?,?,chip,?
milk,egg,bread,?,popcorn,chip,beer
milk,?,bread,?,?,?,beer
?,egg,bread,?,?,?,beer
milk,?,bread,?,?,chip,?
milk,egg,bread,butter,?,chip,?
milk,egg,?,butter,?,chip,?
Related
I am trying to use Weka to make decisions using arff files, however when i try to classify i recieve the error "Problem evaluating classifier: Train and test set are not compatible".
This is my weather.arff file
#relation weather
#attribute outlook {sunny, overcast, rainy}
#attribute temperature real
#attribute humidity real
#attribute windy {TRUE, FALSE}
#attribute play {yes, no}
#data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
and this is my weather.nominal.classify.arff
#relation weather.symbolic
#attribute outlook {sunny, overcast, rainy}
#attribute temperature {hot, mild, cool}
#attribute humidity {high, normal}
#attribute windy {TRUE, FALSE}
#attribute play {yes, no}
#data
overcast,mild,high,TRUE,?
Your attributes are different. In your first file you have
#attribute temperature real
and in your second file you have
#attribute temperature {hot, mild, cool}
Same with humidity. you need to have the attribute definitions completely identical.
I'm trying to use this data set in weka:
#relation adult
#attribute age: continuous
#attribute workclass: {Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov,Without-pay,Never-worked}
#attribute fnlwgt: continuous.
#attribute education: {Bachelors,Some-college,11th,HS-grad,Prof-school,Assoc-acdm,Assoc-voc,9th,7th-8th,12th,Masters,1st-4th,10th,Doctorate,5th-6th,Preschool}
#attribute education-num: continuous
#attribute marital-status: {Married-civ-spouse,Divorced,Never-married,Separated,Widowed,Married-spouse-absent,Married-AF-spouse}
#attribute occupation: {Tech-support,Craft-repair,Other-service,Sales,Exec-managerial,Prof-specialty,Handlers-cleaners,Machine-op-inspct,Adm-clerical,Farming-fishing,Transport-moving,Priv-house-serv,Protective-serv,Armed-Forces.
#attribute relationship: {Wife,Own-child,Husband,Not-in-family,Other-relative,Unmarried}
#attribute race: {White,Asian-Pac-Islander,Amer-Indian-Eskimo,Other,Black}
#attribute sex: {Female,Male}
#attribute capital-gain: continuous
#attribute capital-loss: continuous
#attribute hours-per-week: continuous
#attribute native-country: {United-States,Cambodia,England,Puerto-Rico,Canada,Germany,Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines,Italy,Poland,Jamaica,Vietnam,Mexico,Portugal,Ireland,France,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&Tobago,Peru,Hong,Holand-Netherlands}
#data
39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
I keep getting the error:
Unable to determine structure as arff (Reason: java.io.IOException: Keyword #relation expected, read Token ['{'], line 1).
Which doesn't make any sense because there is no '{' in line 1
There are a few things that could be causing the issue. Here are the specifications for the arff file format.
arff file format specifications
In the dataset below, the attributes are listed in the format:
#attribute 'fnlwgt' real
without colons and real / integer instead of continuous.
Also, you have
#attribute hours-per-week: continuous
#attribute native-country: {United-States,Cambodia,England,Puerto-Rico,Canada,Germany,Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines,Italy,Poland,Jamaica,Vietnam,Mexico,Portugal,Ireland,France,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&Tobago,Peru,Hong,Holand-Netherlands}
Reversed in your data set.
39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
And, you do not have an
#attribute 'Class' {something, something2, something3}
vehicle.arff from seasr arff datasets
#attribute 'COMPACTNESS' real
#attribute 'CIRCULARITY' real
#attribute 'DISTANCE CIRCULARITY' real
#attribute 'RADIUS RATIO' real
#attribute 'PR.AXIS ASPECT RATIO' real
#attribute 'MAX.LENGTH ASPECT RATIO' real
#attribute 'SCATTER RATIO' real
#attribute 'ELONGATEDNESS' real
#attribute 'PR.AXIS RECTANGULARITY' real
#attribute 'MAX.LENGTH RECTANGULARITY' real
#attribute 'SCALED VARIANCE_MAJOR' real
#attribute 'SCALED VARIANCE_MINOR' real
#attribute 'SCALED RADIUS OF GYRATION' real
#attribute 'SKEWNESS ABOUT_MAJOR' real
#attribute 'SKEWNESS ABOUT_MINOR' real
#attribute 'KURTOSIS ABOUT_MAJOR' real
#attribute 'KURTOSIS ABOUT_MINOR' real
#attribute 'HOLLOWS RATIO' real
#attribute 'Class' {opel,saab,bus,van}
#data
95,48,83,178,72,10,162,42,20,159,176,379,184,70,6,16,187,197,van
91,41,84,141,57,9,149,45,19,143,170,330,158,72,9,14,189,199,van
104,50,106,209,66,10,207,32,23,158,223,635,220,73,14,9,188,196,saab
93,41,82,159,63,9,144,46,19,143,160,309,127,63,6,10,199,207,van
85,44,70,205,103,52,149,45,19,144,241,325,188,127,9,11,180,183,bus
107,57,106,172,50,6,255,26,28,169,280,957,264,85,5,9,181,183,bus
97,43,73,173,65,6,153,42,19,143,176,361,172,66,13,1,200,204,bus
90,43,66,157,65,9,137,48,18,146,162,281,164,67,3,3,193,202,van
I am sorry once again to bring another question up but why is Weka is graying out some classifiers for the data which I bring about. And a sample glipse of the data file is attached as follows:
#relation whatever
#attribute ClearanceFactor numeric
#attribute CrestFactor numeric
#attribute HistogramLB numeric
#attribute HistogramUB numeric
#attribute ImpulseFactor numeric
#attribute KurtVal numeric
#attribute PeakVal numeric
#attribute RMSVal numeric
#attribute Status { Normal }
#data
1 , 0.944758327 , 0.818823375 , 0.835884533 , 0.973802319 , 0.922274575 , 0.836712854 , 0.830582178 , Normal
1 , 0.922118042 , 0.737125289 , 0.762040973 , 0.963101929 , 0.889826729 , 0.762426651 , 0.753675509 , Normal
1 , 0.975667525 , 0.916722849 , 0.924607883 , 0.988490457 , 0.962805959 , 0.925217603 , 0.922378149 , Normal
I got it working by a simple trick. In the last column, I added all of the other possible solution sets as follows and it worked! :D
#relation whatever
#attribute ClearanceFactor numeric
#attribute CrestFactor numeric
#attribute HistogramLB numeric
#attribute HistogramUB numeric
#attribute ImpulseFactor numeric
#attribute KurtVal numeric
#attribute PeakVal numeric
#attribute RMSVal numeric
#attribute Status { Normal, faulty0, faulty1}
I have a query in relation to Sparse Arff in weka->
An example shown below:
#RELATION example
#ATTRIBUTE an apple
#ATTRIBUTE a cat
#ATTRIBUTE for love
#ATTRIBUTE the end
#ATTRIBUTE class {real, fake}
#DATA
Here is my query:
This is very straightforward->
0,1,0,0,real -> {1 1, 4 real}
0,0,0,1,fake -> {3 1, 4 fake}
But how to write this ones->
1,1,1,1,real -> ? I need help here
2,1,3,1,fake -> ? I need help here
Thanks in advance guys.
Best Regards
plasma33
You can run the NonSparseToSparse filter and check the results. In your case, from:
#RELATION example
#ATTRIBUTE an numeric
#ATTRIBUTE a numeric
#ATTRIBUTE for numeric
#ATTRIBUTE the numeric
#ATTRIBUTE class {real, fake}
#DATA
0,1,0,0,real
0,0,0,1,fake
1,1,1,1,real
2,1,3,1,fake
You get:
#relation example-weka.filters.unsupervised.instance.NonSparseToSparse
#attribute an numeric
#attribute a numeric
#attribute for numeric
#attribute the numeric
#attribute class {real,fake}
#data
{1 1}
{3 1,4 fake}
{0 1,1 1,2 1,3 1}
{0 2,1 1,2 3,3 1,4 fake}
Please note that real is the default value and in nominal attributes, it is not print in sparse format. Please note as well that the first index is 0.
I have been using SVM classifier with the following data
#relation whatever
#attribute mfe numeric
#attribute GB numeric
#attribute GTB numeric
#attribute Seeds numeric
#attribute ABP numeric
#attribute AU_Seed numeric
#attribute GC_Seed numeric
#attribute GU_Seed numeric
#attribute UP numeric
#attribute AU numeric
#attribute GC numeric
#attribute GU numeric
#attribute A-U_L numeric
#attribute G-C_L numeric
#attribute G-U_L numeric
#attribute (G+C) numeric
#attribute MFEi1 numeric
#attribute MFEi2 numeric
#attribute MFEi3 numeric
#attribute MFEi4 numeric
#attribute dG numeric
#attribute dP numeric
#attribute dQ numeric
#attribute dD numeric
#attribute Outcome {Yes,No}
#data
-24.3,1,18,2,9,4,3,0.5,8,10,7,1,0.454545455,0.318181818,0.045454545,7,-0.157792208,-0.050206612,-1.104545455,-1.35,-1.104545455,0,0,0,Yes
-24.8,2,15,2,7.5,2,3,1,7,5,8,2,0.208333333,0.333333333,0.083333333,8,-0.129166667,-0.043055556,-0.516666667,-1.653333333,-1.033333333,0,0,0,No
-24.4,1,16,3,5.333333333,1.666666667,2.666666667,1,4,5,8,3,0.217391304,0.347826087,0.130434783,8,-0.132608696,-0.046124764,-1.060869565,-1.525,-1.060869565,0,0,0,Yes
-24.2,1,18,2,9,2,2.5,1,10,5,11,2,0.227272727,0.5,0.090909091,11,-0.1,-0.05,-1.1,-1.344444444,-1.1,0,0,0,Yes
-24.5,3,17,2,8.5,2,3,1,5,6,9,2,0.272727273,0.409090909,0.090909091,9,-0.123737374,-0.050619835,-0.371212121,-1.441176471,-1.113636364,-0.12244898,0,0,Yes
This is my training set . And in this its defined whether my data is yes class or no class. My question is my test data is from unknown source and i dont have idea to what class it belongs. so how to prepare my test set. without the outcome attribute weka is giving the "ereor: Data mismatch " . How to prepare the test set? to separate my variable as Yes and nO class using SVM.
Steps to prepare the test set:
Create a training set in CSV format.
Also create the test set in CSV format with same no. of attributes and same type.
Copy the test set and paste at the end of the training set and save as new CSV file.
Import the saved CSV file in step 3 using Weka>>Explorer>>Preprocess.
In Filter Option Choose filters>>unsupervised>>instances>>Remove Range.
Click the feed which says RemoveRange-R first-last.
Specify the range you want to remove say the training data had 100 values, then select first-100 and Apply the filter.
Save as Arff file and this can be used as a test set.
Then Apply this set. If you still have any errors, write as a reply to this post.
If you don't want to go through hassles, then you can prepare your test set with exact names, data types and data range as in your training set and of course with attribute values. The class attribute will be present but the value should be a question mark (?). For instance, to convert your given training set to a test set the following change can be done`#relation whatever
#relation whatever-TEST
#attribute mfe numeric
#attribute GB numeric
#attribute GTB numeric
#attribute Seeds numeric
#attribute ABP numeric
#attribute AU_Seed numeric
#attribute GC_Seed numeric
#attribute GU_Seed numeric
#attribute UP numeric
#attribute AU numeric
#attribute GC numeric
#attribute GU numeric
#attribute A-U_L numeric
#attribute G-C_L numeric
#attribute G-U_L numeric
#attribute (G+C) numeric
#attribute MFEi1 numeric
#attribute MFEi2 numeric
#attribute MFEi3 numeric
#attribute MFEi4 numeric
#attribute dG numeric
#attribute dP numeric
#attribute dQ numeric
#attribute dD numeric
#attribute Outcome {Yes,No}
#data
-24.3,1,18,2,9,4,3,0.5,8,10,7,1,0.454545455,0.318181818,0.045454545,7,-0.157792208,-0.050206612,-1.104545455,-1.35,-1.104545455,0,0,0,?
-24.8,2,15,2,7.5,2,3,1,7,5,8,2,0.208333333,0.333333333,0.083333333,8,-0.129166667,-0.043055556,-0.516666667,-1.653333333,-1.033333333,0,0,0,?
-24.4,1,16,3,5.333333333,1.666666667,2.666666667,1,4,5,8,3,0.217391304,0.347826087,0.130434783,8,-0.132608696,-0.046124764,-1.060869565,-1.525,-1.060869565,0,0,0,?
-24.2,1,18,2,9,2,2.5,1,10,5,11,2,0.227272727,0.5,0.090909091,11,-0.1,-0.05,-1.1,-1.344444444,-1.1,0,0,0,?
-24.5,3,17,2,8.5,2,3,1,5,6,9,2,0.272727273,0.409090909,0.090909091,9,-0.123737374,-0.050619835,-0.371212121,-1.441176471,-1.113636364,-0.12244898,0,0,?
`
Do we need to replace the values of last attribute with question mark in test data?
I am confused
I did test my data by two methods
removing the values of last attribute and putting ? As a replacement.
I used the test data as it is ( not reming the class attribute)
Whether you are evaluating a trained model on a dataset or trying to make predictions with a trained model, the dataset has to have the exact same structure as the training data (attribute names, attribute types, order of nominal labels). This includes the class attribute.
If you want to test your model, then you need ground truth values to compare the predictions against. Otherwise you cannot generate statistics.
If you want to make predictions, then the class values should be all missing.
For removing the class values, you can either do that manually, or you can use the missing-values-imputation Weka package. Use the weka.filters.unsupervised.attribute.MissingValuesInjection filter in conjunction with the ClassOnly injection scheme.