what should i do with unknown data while creating weka arff files - weka

I am trying to format my dataset as a weka arff file. this is a sample of my arff file:
#relation my_relation
#attribute 'attrib_1' numeric
#attribute 'attrib_2' numeric
#attribute 'attrib_3' numeric
...
#attribute 'class' {1,2,3,4,5}
#data
6,6,55,0,0,0,18.9,0,1,2,'?',14,15,20,'?','?','?','?',28,29,1
54,25,19,4.85,0,1,10,13,'?','?','?','?','?','?',15,16,19,20,21,0,3
...
My featrues are numeric and real values but there are some missing values for each feature in different cases(instances). how should i determine that my features contain missing values?
(I used '?' for missing values but this error occurs while trying to open mydata.arff
number expected, read token[?], line 746
)
Edit: I changed the '?' to ? and tried to load the file.this time the following error occurs:
nominal value not declared in header, read Token[86], line 746

This is too long to fit into a comment. I think that I can see a likely problem with your data. It contains some bad characters. You are probably reading this in a web browser. If so, view the html source for this page and then scroll down to your data. In Internet explorer, I was able to save this web page as a text file and then just look at the text in an editor to see the bad characters. In many places throughout the data, I see ‌​ These are zero-width characters (see zwnj and 8203. That is, they are characters that are present in the data, but do not show up on the screen, not even as blank space. Because your data contains these spurious characters, WEKA cannot read it. Please check your data to see if the original contains these hidden characters.

Related

Test and Train sets are not compatible

I m using weka and try to test my file but always got a popup window showing "Train and Test set are not compatible". I m using the csv file. All the attributes are same in both file.out of 30 attributes i divide them in two parts first 20 attributes as training set and rest 10 as test set. pls help me.
Your attributes and their order must be the same in the both files. See following Weka Wiki post and stack overflow question 1 and question 2. Even a small difference may cause this error.
According to you their order may be same but according to weka they are not same. Convert them to arff format and try again. You will see that their arff headers are not same. See below example.
CSV file1
Feature A
true
false
CSV file2
Feature A
false
true
Representation of these CSV files as arff header are not SAME.Since their first occurrence change in files, their order in arff header change too.

weka doesnot show arff feature values

I have an .arff file with 5 features named:
JaccardCoefficient,adamicadar,commonneighbors,katz,rootedpagerank
I open the file in weka but it does not show katz values. It shows the max:0 min:0 mean:0 stddev:0
Note that the katz values are so small like 0.0000312. What should I do to see katz values?
I have had a look at your sample file in Weka and have found the zero values that were reported. The data appears to be visually represented correctly, but the precision of the attributes appear to be limited to three decimal places. For this reason, the values are too small to be represented in the attribute list.
One way that you could change this for use with Weka's prediction models is to pre-process the data to a more suitable range. This could be done using normalisation or other rescaling techniques as required for your purposes. In the image below, I have adjusted the data by simply multiplying the attribute by 100, which brought the attribute summaries into a visible range on the screen:
Hope this helps!

How do I replace poorly formatted ZIP codes with proper ones?

I have a data set that that looks like this:
adjuster adjuster_zip
A-20 98216
A-14 98214
A-17 98216
A-20 California
I need to format this data set so that adjuster_zip is all numeric. I have several hundred adjusters and they all show up several hundred times. However, they each adjuster only has one zip code. As you can see with A-20, this adjuster has both a valid and invalid zip code. All of the adjusters that have invalid zip codes also have valid zip codes. How can I automate this so that SAS switches invalid zip codes with valid ones by adjuster?
Thanks for any and all help.
Also, I couldn't figure out how to format the data so that it shows up in a table. Sorry.
My suggestion would be to build a format table per adjuster. Start with your input dataset; then filter to only valid zip codes (you could use NOTDIGIT to check for any nondigit values, and LENGTH to check it is only five long). Then create a dataset with FMTNAME as a constant string with any legal format name you wish preceded by $ ($ADJZIPF would be a good cohice), START equal to the variable that contains the adjuster name, LABEL being the zip. Then use PROC FORMAT with cntlin= the dataset you just defined.
That would allow you to look up the zip for each adjuster using PUT and your custom format. You still have to worry about a few things; that table must be non-duplicated per adjuster, so you need to decide how to handle adjusters with two or more zips; and you need to check when you use PUT that it does find a zip code.

Arff File - Nominal Value not declared in header.

I am generating an .arff file using a Java program. The file has about 600 attributes.
I am unable to open the file in Weka Explorer.
It says: "nominal value not declared in header, read Token[0], line 626."
Here is the first attribute line: #attribute vantuono numeric
Here are the first few chars of line 626: 0,0,0,0,1,0,0,0,0,1,0,1...
Why is WEKA unable to parse '0' as a numeric value?
Interestingly, this happens only in this file. I have other files with numeric attributes accepting '0' for a value.
Are you sure that your declaration is correct? The WEKA FAQ says:
nominal value not declared in header, read Token[X], line Y
If you get this error message than you seem to have declared a nominal attribute in the ARFF header section, but Weka came across a value ("X") in the data (in line Y) for this particular attribute that wasn't listed as possible value.
All nominal values that appear in the data must be declared in the header.
There is also a bug regarding sparse ARFF files
Increase the memory to accommodate all the rows using -B #noOfRecords option.
java weka.core.converters.CSVLoader filename.csv filename.arff -B 33000
If you get this error, it's more likely that in your dataset (after the line #data), you kept the HEADER (column names) that you have already declared. Please remove that header line, and you should be good to go.
I got the same error. Then I saw that my program puts an extra Apostrophe. When I remove the Apostrophe it works
I had such a problem and it costed me so you won't be costed Okay. Just put the class attribute last, and ensure the attributes are in order as in the text.

Weka : training and test set are not compatible

Each row of my training and test datasets has intensity values for pixels in an image with the last column having the label which tells what digit is represented in the image; the label can be any number from 0 to 9 in training set and is always ? on test set. I loaded the training dataset on Weka Explorer, passed the data through NumericalToNominal filter and used RemovePercentage filter to split the data in 70-30 ratio, the 30% file being used as cross validation set. I built a classifer and saved the model.
Then, I loaded the test data which has ? against label for each row and applied the NumericToNominal filter and saved it as arff file.Now, when i load the test data and try to user the model against it, I always get the error message saying "training and test set are not compatible". Both datasets have undergone the same processing. What possibly could have gone wrong?
As you can read from ARFF manual (http://www.cs.waikato.ac.nz/ml/weka/arff.html):
Nominal values are defined by providing an
listing the possible values: {, ,
, ...}
For example, the class value of the Iris dataset can be defined as
follows:
#ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
So when you apply NumericToNominal to your test file you can possibly have different number of possible values for one or more attributes within train and test arff - it really can happen, it bothered me many times - so one solution is to check your arff's manually (if it is not to big, or just copy and paste invocation of arff file with
e.g.
#attribute 'My first binary attribute' {0,1}
(...)
#attribute 'My last binary attribute' {0,1}
from train to test file - should work
you can use batch filtering, here you can read how to batch filtering in weka